Spark Architecture Overview ⚡

Apache Spark is a distributed computing system designed to process large-scale data efficiently across clusters of machines.

What is Spark Architecture?

Spark architecture defines how a Spark application runs across a distributed cluster.

It follows a master–worker model.

Core Components

1. Driver Program

The Driver is the central control unit of a Spark application.

It is responsible for:

converting user code into execution plans
building the DAG (Directed Acyclic Graph)
scheduling tasks
communicating with the cluster manager

2. Cluster Manager

The Cluster Manager allocates resources for Spark applications.

Examples:

Standalone Cluster Manager
YARN
Kubernetes
Mesos

Its job is to:

allocate executors
manage cluster resources
handle scheduling at cluster level

3. Executors

Executors are worker processes that run on cluster nodes.

They are responsible for:

executing tasks
storing intermediate data
caching RDDs/DataFrames
sending results back to driver

Each application gets its own set of executors.

4. Worker Nodes

Worker nodes are machines in the cluster that host executors.

They provide:

CPU
memory
storage for execution

Execution Flow

A Spark job execution follows this flow:

User submits Spark application
Driver program starts
Driver builds logical plan (DAG)
Cluster manager allocates resources
Executors are launched on worker nodes
Tasks are distributed to executors
Executors process data in parallel
Results are sent back to driver

High-Level Architecture Diagram

+---------------------+
|   Driver Program    |
|  (DAG Scheduler)    |
+----------+----------+
           |
           |
+----------v----------+
|  Cluster Manager    |
+----------+----------+
           |

+-------------+--------------+ | | +----v-----+ +-------v-----+ | Executor | | Executor | | Node 1 | | Node 2 | +----------+ +-------------+ | | +-------------+--------------+ | Worker Nodes

Key Design Principles

1. Fault Tolerance

Spark is fault-tolerant using:

lineage graph
task re-execution

If a node fails, Spark recomputes lost data.

2. Parallelism

Data is split into partitions and processed in parallel across executors.

3. Lazy Execution

Spark does not execute immediately.

It builds a plan first, then executes only when an action is triggered.

4. In-Memory Computing

Spark stores intermediate results in memory for faster processing compared to disk-based systems like Hadoop MapReduce.

Driver vs Executor

Component	Responsibility
Driver	Builds DAG, schedules jobs
Executor	Executes tasks, stores data

Common Interview Questions

What is Spark architecture?
What is the role of driver and executor?
What happens when a Spark job is submitted?
How does Spark achieve fault tolerance?

Mental Model

Think of Spark as:

A central brain (Driver) coordinating many workers (Executors) across a distributed cluster.

Summary

Spark architecture is built on:

Driver program (control center)
Cluster manager (resource allocator)
Executors (compute workers)
Worker nodes (physical machines)

Together they enable large-scale distributed data processing.

Spark Architecture Overview ⚡ ​

What is Spark Architecture? ​

Core Components ​

1. Driver Program ​

2. Cluster Manager ​

3. Executors ​

4. Worker Nodes ​

Execution Flow ​

High-Level Architecture Diagram ​

Key Design Principles ​

1. Fault Tolerance ​

2. Parallelism ​

3. Lazy Execution ​

4. In-Memory Computing ​

Driver vs Executor ​

Common Interview Questions ​

Mental Model ​

Summary ​