DAG Execution Model ⚡

Spark executes all transformations using a Directed Acyclic Graph (DAG).

A DAG represents the logical sequence of operations applied to data.

What is a DAG?

A DAG (Directed Acyclic Graph) is:

a graph of transformations
directed (flow has direction)
acyclic (no loops)
used to define execution logic

In Spark, every action triggers a DAG execution.

Why Spark uses DAG

Spark uses DAG to:

optimize execution plans
avoid redundant computation
enable parallel execution
support fault tolerance through lineage

How DAG is created

When you write Spark code:

You define transformations
Spark does NOT execute immediately
Spark builds a logical plan
Logical plan is converted into DAG
DAG is split into stages

Example Flow

input_data ↓ map() ↓ filter() ↓ groupBy() ↓ reduce() ↓ output

Each step becomes a node in the DAG.

Stages in DAG

Spark divides DAG into stages based on dependencies:

Narrow transformations

no data shuffle
executed in same stage

Wide transformations

require shuffle
create stage boundary

DAG Scheduler

The DAG Scheduler is responsible for:

splitting DAG into stages
identifying shuffle boundaries
submitting tasks to Task Scheduler

Execution Summary

When an action is triggered:

DAG is built
DAG Scheduler creates stages
Tasks are distributed to executors
Execution happens in parallel
Results are returned to driver

Mental Model

Think of DAG as:

A blueprint that describes how Spark will execute your transformations before running anything.

Key Takeaway

No execution happens during transformations
DAG is built only when an action is called
DAG enables optimized distributed execution

DAG Execution Model ⚡ ​

What is a DAG? ​

Why Spark uses DAG ​

How DAG is created ​

Example Flow ​

Stages in DAG ​

Narrow transformations ​

Wide transformations ​

DAG Scheduler ​

Execution Summary ​

Mental Model ​

Key Takeaway ​