Skip to content

Spark Architecture Overview ⚡

Apache Spark is a distributed computing system designed to process large-scale data efficiently across clusters of machines.


What is Spark Architecture?

Spark architecture defines how a Spark application runs across a distributed cluster.

It follows a master–worker model.


Core Components

1. Driver Program

The Driver is the central control unit of a Spark application.

It is responsible for:

  • converting user code into execution plans
  • building the DAG (Directed Acyclic Graph)
  • scheduling tasks
  • communicating with the cluster manager

2. Cluster Manager

The Cluster Manager allocates resources for Spark applications.

Examples:

  • Standalone Cluster Manager
  • YARN
  • Kubernetes
  • Mesos

Its job is to:

  • allocate executors
  • manage cluster resources
  • handle scheduling at cluster level

3. Executors

Executors are worker processes that run on cluster nodes.

They are responsible for:

  • executing tasks
  • storing intermediate data
  • caching RDDs/DataFrames
  • sending results back to driver

Each application gets its own set of executors.


4. Worker Nodes

Worker nodes are machines in the cluster that host executors.

They provide:

  • CPU
  • memory
  • storage for execution

Execution Flow

A Spark job execution follows this flow:

  1. User submits Spark application
  2. Driver program starts
  3. Driver builds logical plan (DAG)
  4. Cluster manager allocates resources
  5. Executors are launched on worker nodes
  6. Tasks are distributed to executors
  7. Executors process data in parallel
  8. Results are sent back to driver

High-Level Architecture Diagram

+---------------------+
|   Driver Program    |
|  (DAG Scheduler)    |
+----------+----------+
           |
           |
+----------v----------+
|  Cluster Manager    |
+----------+----------+
           |

+-------------+--------------+ | | +----v-----+ +-------v-----+ | Executor | | Executor | | Node 1 | | Node 2 | +----------+ +-------------+ | | +-------------+--------------+ | Worker Nodes


Key Design Principles

1. Fault Tolerance

Spark is fault-tolerant using:

  • lineage graph
  • task re-execution

If a node fails, Spark recomputes lost data.


2. Parallelism

Data is split into partitions and processed in parallel across executors.


3. Lazy Execution

Spark does not execute immediately.

It builds a plan first, then executes only when an action is triggered.


4. In-Memory Computing

Spark stores intermediate results in memory for faster processing compared to disk-based systems like Hadoop MapReduce.


Driver vs Executor

ComponentResponsibility
DriverBuilds DAG, schedules jobs
ExecutorExecutes tasks, stores data

Common Interview Questions

  • What is Spark architecture?
  • What is the role of driver and executor?
  • What happens when a Spark job is submitted?
  • How does Spark achieve fault tolerance?

Mental Model

Think of Spark as:

A central brain (Driver) coordinating many workers (Executors) across a distributed cluster.


Summary

Spark architecture is built on:

  • Driver program (control center)
  • Cluster manager (resource allocator)
  • Executors (compute workers)
  • Worker nodes (physical machines)

Together they enable large-scale distributed data processing.