Appearance
π SQL Optimization (Data Engineering Focus) β
SQL optimization is about making queries run faster, cheaper, and at scale.
In data engineering, optimization is critical for:
- large datasets (TBs/PBs)
- production pipelines
- dashboards with low latency
π§ Mental Model β
Think of SQL optimization as:
π βReducing unnecessary work for the databaseβ
Focus on:
- fewer rows processed
- fewer joins
- better data access paths
π 1. Avoid SELECT * β
Selecting unnecessary columns increases I/O.
Bad: SELECT *
FROM employees;
Good: SELECT name, salary
FROM employees;
π 2. Filter Early (WHERE optimization) β
Always filter data as early as possible.
SELECT name, salary
FROM employees
WHERE department = 'Engineering';
π Reduces rows before further processing
π 3. Indexing (VERY IMPORTANT) β
Indexes speed up data lookup.
Example:
CREATE INDEX idx_emp_salary
ON employees(salary);
π Helps WHERE and ORDER BY queries
π 4. Avoid functions on indexed columns β
Bad: WHERE YEAR(created_at) = 2024
Good: WHERE created_at >= '2024-01-01'
AND created_at < '2025-01-01'
π Keeps index usable
π 5. Reduce JOIN size β
Join only required columns and rows.
Bad: SELECT *
FROM users
JOIN orders;
Good: SELECT u.name, o.order_id
FROM users u
JOIN orders o
ON u.user_id = o.user_id;
π 6. Use EXISTS instead of IN (sometimes) β
EXISTS is more efficient for large datasets.
SELECT name
FROM users u
WHERE EXISTS ( SELECT 1
FROM orders o
WHERE o.user_id = u.user_id );
π 7. LIMIT early for testing β
SELECT *
FROM events
LIMIT 100;
π Useful for debugging pipelines
π 8. Partitioning (BIG DATA CONCEPT) β
Split large tables into smaller chunks.
Example:
- partition by date
- partition by region
π Improves query performance significantly
π 9. Understand Query Execution Plan β
Databases execute queries in steps:
- scan
- filter
- join
- aggregate
Use: EXPLAIN SELECT * FROM employees;
π Helps identify bottlenecks
π 10. Avoid unnecessary subqueries β
Bad: SELECT * FROM ( SELECT * FROM employees ) t;
Good: SELECT * FROM employees;
π§ DATA ENGINEERING USE CASES β
Optimization is critical in:
- ETL pipelines
- dashboard queries
- data warehouse performance
- streaming aggregations
β οΈ COMMON MISTAKES β
- Using SELECT *
- Missing indexes on filter columns
- Joining large tables without filters
- Using functions on indexed columns
- Ignoring execution plans
π― INTERVIEW QUESTIONS β
- What is query optimization?
- What is indexing and how does it work?
- Difference between clustered and non-clustered index?
- Why is SELECT * bad in production?
- What is query execution plan?
π SUMMARY β
- Filter early
- Select only needed columns
- Use indexes properly
- Avoid unnecessary joins
- Understand execution plans