Appearance
Executors & Partitions ⚙️
Executors and partitions define how Spark distributes computation across a cluster.
They are the foundation of Spark’s parallel processing model.
What is an Executor?
An executor is a JVM process launched on a worker node.
It is responsible for:
- running tasks
- storing intermediate data
- managing memory for execution
- sending results back to the driver
Each Spark application has its own set of executors.
Executor Lifecycle
- Driver requests resources
- Cluster manager allocates executors
- Executors start on worker nodes
- Tasks are assigned to executors
- Executors process data
- Executors are terminated after job completion
What is a Partition?
A partition is the smallest unit of data in Spark.
It represents:
- a chunk of a DataFrame or RDD
- a unit of parallel execution
Each partition is processed independently.
Relationship Between Executors and Partitions
- Data is split into partitions
- Partitions are distributed across executors
- Executors process multiple partitions in parallel
Example Execution Flow
DataFrame ↓ Partitions: P1, P2, P3, P4 ↓ Executors: E1, E2 ↓ Parallel task execution
Task Execution Model
- Each partition becomes a task
- Each task runs inside an executor
- Multiple tasks run in parallel
Parallelism in Spark
Parallelism depends on:
- number of partitions
- number of executor cores
Formula:
Parallel tasks = min(partitions, executor cores)
Partitioning Strategies
1. Hash Partitioning
- default strategy
- distributes data based on key hash
2. Range Partitioning
- used for sorted data
- ensures ordered distribution
3. Custom Partitioning
- user-defined logic
- used in advanced optimization
Why Partitions Matter
Partitions affect:
- performance
- memory usage
- shuffle cost
- cluster utilization
Common Issues
- Too few partitions → low parallelism
- Too many partitions → overhead
- Data skew → uneven load across executors
Optimization Techniques
- tune number of partitions (spark.sql.shuffle.partitions)
- use repartition/coalesce wisely
- avoid skewed keys
- balance workload across executors
Mental Model
Think of Spark execution as:
Data split into partitions, processed independently by executors running in parallel across a cluster.
Key Takeaway
- Executors are compute engines
- Partitions are units of data
- Efficient mapping between them determines Spark performance