Appearance
DAG Execution Model ⚡
Spark executes all transformations using a Directed Acyclic Graph (DAG).
A DAG represents the logical sequence of operations applied to data.
What is a DAG?
A DAG (Directed Acyclic Graph) is:
- a graph of transformations
- directed (flow has direction)
- acyclic (no loops)
- used to define execution logic
In Spark, every action triggers a DAG execution.
Why Spark uses DAG
Spark uses DAG to:
- optimize execution plans
- avoid redundant computation
- enable parallel execution
- support fault tolerance through lineage
How DAG is created
When you write Spark code:
- You define transformations
- Spark does NOT execute immediately
- Spark builds a logical plan
- Logical plan is converted into DAG
- DAG is split into stages
Example Flow
input_data ↓ map() ↓ filter() ↓ groupBy() ↓ reduce() ↓ output
Each step becomes a node in the DAG.
Stages in DAG
Spark divides DAG into stages based on dependencies:
Narrow transformations
- no data shuffle
- executed in same stage
Wide transformations
- require shuffle
- create stage boundary
DAG Scheduler
The DAG Scheduler is responsible for:
- splitting DAG into stages
- identifying shuffle boundaries
- submitting tasks to Task Scheduler
Execution Summary
When an action is triggered:
- DAG is built
- DAG Scheduler creates stages
- Tasks are distributed to executors
- Execution happens in parallel
- Results are returned to driver
Mental Model
Think of DAG as:
A blueprint that describes how Spark will execute your transformations before running anything.
Key Takeaway
- No execution happens during transformations
- DAG is built only when an action is called
- DAG enables optimized distributed execution