Appearance
Spark Architecture Overview ⚡
Apache Spark is a distributed computing system designed to process large-scale data efficiently across clusters of machines.
What is Spark Architecture?
Spark architecture defines how a Spark application runs across a distributed cluster.
It follows a master–worker model.
Core Components
1. Driver Program
The Driver is the central control unit of a Spark application.
It is responsible for:
- converting user code into execution plans
- building the DAG (Directed Acyclic Graph)
- scheduling tasks
- communicating with the cluster manager
2. Cluster Manager
The Cluster Manager allocates resources for Spark applications.
Examples:
- Standalone Cluster Manager
- YARN
- Kubernetes
- Mesos
Its job is to:
- allocate executors
- manage cluster resources
- handle scheduling at cluster level
3. Executors
Executors are worker processes that run on cluster nodes.
They are responsible for:
- executing tasks
- storing intermediate data
- caching RDDs/DataFrames
- sending results back to driver
Each application gets its own set of executors.
4. Worker Nodes
Worker nodes are machines in the cluster that host executors.
They provide:
- CPU
- memory
- storage for execution
Execution Flow
A Spark job execution follows this flow:
- User submits Spark application
- Driver program starts
- Driver builds logical plan (DAG)
- Cluster manager allocates resources
- Executors are launched on worker nodes
- Tasks are distributed to executors
- Executors process data in parallel
- Results are sent back to driver
High-Level Architecture Diagram
+---------------------+
| Driver Program |
| (DAG Scheduler) |
+----------+----------+
|
|
+----------v----------+
| Cluster Manager |
+----------+----------+
|
+-------------+--------------+ | | +----v-----+ +-------v-----+ | Executor | | Executor | | Node 1 | | Node 2 | +----------+ +-------------+ | | +-------------+--------------+ | Worker Nodes
Key Design Principles
1. Fault Tolerance
Spark is fault-tolerant using:
- lineage graph
- task re-execution
If a node fails, Spark recomputes lost data.
2. Parallelism
Data is split into partitions and processed in parallel across executors.
3. Lazy Execution
Spark does not execute immediately.
It builds a plan first, then executes only when an action is triggered.
4. In-Memory Computing
Spark stores intermediate results in memory for faster processing compared to disk-based systems like Hadoop MapReduce.
Driver vs Executor
| Component | Responsibility |
|---|---|
| Driver | Builds DAG, schedules jobs |
| Executor | Executes tasks, stores data |
Common Interview Questions
- What is Spark architecture?
- What is the role of driver and executor?
- What happens when a Spark job is submitted?
- How does Spark achieve fault tolerance?
Mental Model
Think of Spark as:
A central brain (Driver) coordinating many workers (Executors) across a distributed cluster.
Summary
Spark architecture is built on:
- Driver program (control center)
- Cluster manager (resource allocator)
- Executors (compute workers)
- Worker nodes (physical machines)
Together they enable large-scale distributed data processing.