Appearance
PySpark Interview Guide π― β
This document covers commonly asked PySpark interview questions, real-world scenarios, and conceptual clarity points required for data engineering roles.
Core Concept Questions β
1. What is PySpark? β
PySpark is the Python API for Apache Spark used for distributed data processing across clusters.
2. What is a DataFrame in PySpark? β
A DataFrame is a distributed collection of data organized into named columns, optimized by Sparkβs Catalyst engine.
3. What is lazy evaluation? β
Lazy evaluation means Spark does not execute transformations immediately. It builds a DAG and executes only when an action is called.
4. Difference between transformations and actions? β
Transformations:
- lazy
- build DAG
- return new DataFrame
Actions:
- trigger execution
- return result to driver
5. What is a DAG in Spark? β
A DAG is a Directed Acyclic Graph representing execution logic of transformations before execution.
Execution-Based Questions β
6. What happens when you call an action? β
- DAG is built from transformations
- Spark creates stages
- Tasks are distributed to executors
- Execution happens in parallel
- Result is returned
7. What is a shuffle? β
Shuffle is data movement across partitions/executors required for operations like join or groupBy.
8. Why is shuffle expensive? β
Because it involves:
- disk I/O
- network I/O
- serialization
- data redistribution
Performance Questions β
9. How do you optimize Spark jobs? β
- reduce shuffle
- use broadcast joins
- optimize partitioning
- cache wisely
- filter early
10. What causes data skew? β
When one partition has significantly more data than others.
11. How do you fix data skew? β
- salting technique
- broadcast join
- repartitioning
Joins and Partitioning β
12. What is broadcast join? β
A join where a small dataset is sent to all executors to avoid shuffle.
13. Difference between repartition and coalesce? β
Repartition:
- increases or changes partitions
- causes shuffle
Coalesce:
- reduces partitions
- avoids full shuffle
Real-World Scenario Questions β
14. A Spark job is slow β how do you debug it? β
Steps:
- check Spark UI
- identify shuffle stages
- check skewed partitions
- analyze executor memory
- optimize joins and partitions
15. How do you handle large joins? β
- use broadcast join if possible
- partition both datasets
- avoid unnecessary columns
- filter early
Architecture Questions β
16. Driver vs Executor? β
Driver:
- builds DAG
- schedules jobs
Executor:
- runs tasks
- stores intermediate data
17. Why Spark is faster than MapReduce? β
- in-memory computation
- DAG execution model
- lazy evaluation
- optimized execution engine
Common Mistakes in Interviews β
- confusing transformations with actions
- not understanding shuffle
- ignoring partitioning impact
- overusing collect()
- not explaining execution flow
Strong Interview Answer Pattern β
When answering:
- Define concept
- Explain execution flow
- Mention performance impact
- Give real-world example
Mental Model β
Think of PySpark as:
A distributed execution engine where your code is converted into an optimized DAG and executed across a cluster.
Summary β
This section covers:
- core PySpark concepts
- execution model
- performance tuning basics
- real interview scenarios