Skip to content

DAG Execution Model ⚡

Spark executes all transformations using a Directed Acyclic Graph (DAG).

A DAG represents the logical sequence of operations applied to data.


What is a DAG?

A DAG (Directed Acyclic Graph) is:

  • a graph of transformations
  • directed (flow has direction)
  • acyclic (no loops)
  • used to define execution logic

In Spark, every action triggers a DAG execution.


Why Spark uses DAG

Spark uses DAG to:

  • optimize execution plans
  • avoid redundant computation
  • enable parallel execution
  • support fault tolerance through lineage

How DAG is created

When you write Spark code:

  1. You define transformations
  2. Spark does NOT execute immediately
  3. Spark builds a logical plan
  4. Logical plan is converted into DAG
  5. DAG is split into stages

Example Flow

input_data ↓ map() ↓ filter() ↓ groupBy() ↓ reduce() ↓ output

Each step becomes a node in the DAG.


Stages in DAG

Spark divides DAG into stages based on dependencies:

Narrow transformations

  • no data shuffle
  • executed in same stage

Wide transformations

  • require shuffle
  • create stage boundary

DAG Scheduler

The DAG Scheduler is responsible for:

  • splitting DAG into stages
  • identifying shuffle boundaries
  • submitting tasks to Task Scheduler

Execution Summary

When an action is triggered:

  1. DAG is built
  2. DAG Scheduler creates stages
  3. Tasks are distributed to executors
  4. Execution happens in parallel
  5. Results are returned to driver

Mental Model

Think of DAG as:

A blueprint that describes how Spark will execute your transformations before running anything.


Key Takeaway

  • No execution happens during transformations
  • DAG is built only when an action is called
  • DAG enables optimized distributed execution