Skip to content

Executors & Partitions ⚙️

Executors and partitions define how Spark distributes computation across a cluster.

They are the foundation of Spark’s parallel processing model.


What is an Executor?

An executor is a JVM process launched on a worker node.

It is responsible for:

  • running tasks
  • storing intermediate data
  • managing memory for execution
  • sending results back to the driver

Each Spark application has its own set of executors.


Executor Lifecycle

  1. Driver requests resources
  2. Cluster manager allocates executors
  3. Executors start on worker nodes
  4. Tasks are assigned to executors
  5. Executors process data
  6. Executors are terminated after job completion

What is a Partition?

A partition is the smallest unit of data in Spark.

It represents:

  • a chunk of a DataFrame or RDD
  • a unit of parallel execution

Each partition is processed independently.


Relationship Between Executors and Partitions

  • Data is split into partitions
  • Partitions are distributed across executors
  • Executors process multiple partitions in parallel

Example Execution Flow

DataFrame ↓ Partitions: P1, P2, P3, P4 ↓ Executors: E1, E2 ↓ Parallel task execution


Task Execution Model

  • Each partition becomes a task
  • Each task runs inside an executor
  • Multiple tasks run in parallel

Parallelism in Spark

Parallelism depends on:

  • number of partitions
  • number of executor cores

Formula:

Parallel tasks = min(partitions, executor cores)


Partitioning Strategies

1. Hash Partitioning

  • default strategy
  • distributes data based on key hash

2. Range Partitioning

  • used for sorted data
  • ensures ordered distribution

3. Custom Partitioning

  • user-defined logic
  • used in advanced optimization

Why Partitions Matter

Partitions affect:

  • performance
  • memory usage
  • shuffle cost
  • cluster utilization

Common Issues

  • Too few partitions → low parallelism
  • Too many partitions → overhead
  • Data skew → uneven load across executors

Optimization Techniques

  • tune number of partitions (spark.sql.shuffle.partitions)
  • use repartition/coalesce wisely
  • avoid skewed keys
  • balance workload across executors

Mental Model

Think of Spark execution as:

Data split into partitions, processed independently by executors running in parallel across a cluster.


Key Takeaway

  • Executors are compute engines
  • Partitions are units of data
  • Efficient mapping between them determines Spark performance