Skip to content

Memory Management 🧠 ​

Spark memory management controls how memory is used for execution, storage, and caching during distributed processing.

Efficient memory usage is critical for performance and stability.


Overview of Spark Memory ​

Spark memory is divided into two main areas:

  • Execution Memory
  • Storage Memory

Both share a unified memory pool.


Execution Memory ​

Execution memory is used for computation tasks such as:

  • joins
  • aggregations
  • sorting
  • shuffles

If execution needs more memory, Spark prioritizes it.


Storage Memory ​

Storage memory is used for:

  • caching DataFrames
  • persisting RDDs
  • broadcast variables

It improves performance by avoiding recomputation.


Unified Memory Model ​

Spark uses a unified memory system where:

  • execution and storage share the same memory pool
  • memory can be borrowed dynamically between them

If one needs more memory, it can borrow from the other.


Memory Flow Behavior ​

  • Execution requests memory → Storage releases if needed
  • Storage requests memory → Execution releases if possible
  • If neither can release → Spark spills to disk

Memory Issues in Spark ​

Common memory-related problems:

  • OutOfMemoryError (OOM)
  • excessive disk spill
  • GC overhead
  • slow execution due to memory pressure

Garbage Collection (GC) ​

Spark runs on JVM, so GC affects performance.

Problems caused by GC:

  • pause in execution
  • increased latency
  • reduced throughput

Memory Optimization Techniques ​

To improve memory usage:

  • avoid unnecessary caching
  • use efficient data formats (Parquet, Avro)
  • reduce shuffle size
  • increase executor memory carefully
  • tune serialization (Kryo preferred)

Caching vs Persisting ​

  • cache(): default storage level
  • persist(): allows custom storage levels

Use caching only when:

  • data is reused multiple times
  • computation is expensive

Memory Spill ​

When memory is insufficient:

  • data is written to disk
  • performance decreases significantly
  • shuffle-heavy operations are most affected

Mental Model ​

Think of Spark memory as:

A shared workspace where computation and storage compete dynamically for resources.


Key Takeaway ​

  • Memory is shared between execution and storage
  • Poor memory management leads to spills and OOM
  • Proper tuning is essential for production workloads