Skip to content

🔗 SQL Joins (Data Engineering Focus)

Joins are used to combine data from multiple tables.

In data engineering, joins are everywhere:

  • ETL pipelines
  • Data warehouses
  • Analytics queries

🧠 Mental Model

Think of joins as:

Matching rows from two tables using a common key.

Example:

  • users table
  • orders table

We join them using user_id


📌 INNER JOIN

Returns only matching records from both tables.

SELECT u.name, o.order_id
FROM users u
INNER JOIN orders o
ON u.user_id = o.user_id;

👉 Only users who placed orders appear


📌 LEFT JOIN

Returns all records from left table + matching from right.

SELECT u.name, o.order_id
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id;

👉 All users appear, even if they didn’t order anything


📌 RIGHT JOIN

Returns all records from right table + matching from left.

SELECT u.name, o.order_id
FROM users u
RIGHT JOIN orders o
ON u.user_id = o.user_id;

👉 All orders appear, even if user data is missing


📌 FULL OUTER JOIN

Returns all records from both tables.

SELECT u.name, o.order_id
FROM users u
FULL OUTER JOIN orders o
ON u.user_id = o.user_id;

👉 Combines everything, matching where possible


📌 SELF JOIN

A table joined with itself.

SELECT e1.name, e2.name AS manager_name
FROM employees e1
JOIN employees e2
ON e1.manager_id = e2.employee_id;

👉 Used for hierarchical data (employees, org charts)


📌 COMMON JOIN MISTAKE

⚠️ Missing join condition causes Cartesian product.

SELECT *
FROM users
JOIN orders;

👉 This multiplies all rows with all rows (dangerous)


📌 REAL WORLD EXAMPLE

Get user names with their order counts:

SELECT u.name, COUNT(o.order_id)
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id
GROUP BY u.name;


🧠 DATA ENGINEERING USE CASES

Joins are used in:

  • Combining fact + dimension tables
  • Building analytics tables
  • ETL transformations
  • Data validation checks

⚠️ COMMON MISTAKES

  • Forgetting ON condition
  • Using INNER JOIN when LEFT JOIN is needed
  • Creating duplicate rows unintentionally
  • Not understanding null behavior in joins

🎯 INTERVIEW QUESTIONS

  • Difference between INNER and LEFT JOIN?
  • What happens if join condition is missing?
  • When do you use SELF JOIN?
  • Why do joins become slow in large datasets?

🚀 SUMMARY

  • INNER JOIN → matching rows only
  • LEFT JOIN → keep all left rows
  • RIGHT JOIN → keep all right rows
  • FULL JOIN → everything
  • SELF JOIN → same table relationship