Appearance
Memory Management 🧠​
Spark memory management controls how memory is used for execution, storage, and caching during distributed processing.
Efficient memory usage is critical for performance and stability.
Overview of Spark Memory ​
Spark memory is divided into two main areas:
- Execution Memory
- Storage Memory
Both share a unified memory pool.
Execution Memory ​
Execution memory is used for computation tasks such as:
- joins
- aggregations
- sorting
- shuffles
If execution needs more memory, Spark prioritizes it.
Storage Memory ​
Storage memory is used for:
- caching DataFrames
- persisting RDDs
- broadcast variables
It improves performance by avoiding recomputation.
Unified Memory Model ​
Spark uses a unified memory system where:
- execution and storage share the same memory pool
- memory can be borrowed dynamically between them
If one needs more memory, it can borrow from the other.
Memory Flow Behavior ​
- Execution requests memory → Storage releases if needed
- Storage requests memory → Execution releases if possible
- If neither can release → Spark spills to disk
Memory Issues in Spark ​
Common memory-related problems:
- OutOfMemoryError (OOM)
- excessive disk spill
- GC overhead
- slow execution due to memory pressure
Garbage Collection (GC) ​
Spark runs on JVM, so GC affects performance.
Problems caused by GC:
- pause in execution
- increased latency
- reduced throughput
Memory Optimization Techniques ​
To improve memory usage:
- avoid unnecessary caching
- use efficient data formats (Parquet, Avro)
- reduce shuffle size
- increase executor memory carefully
- tune serialization (Kryo preferred)
Caching vs Persisting ​
- cache(): default storage level
- persist(): allows custom storage levels
Use caching only when:
- data is reused multiple times
- computation is expensive
Memory Spill ​
When memory is insufficient:
- data is written to disk
- performance decreases significantly
- shuffle-heavy operations are most affected
Mental Model ​
Think of Spark memory as:
A shared workspace where computation and storage compete dynamically for resources.
Key Takeaway ​
- Memory is shared between execution and storage
- Poor memory management leads to spills and OOM
- Proper tuning is essential for production workloads