Skip to content

Joins & Partitions 🔗⚙️

Joins and partitioning are critical concepts in PySpark that directly impact performance and scalability.


What is a Join?

A join is an operation used to combine two DataFrames based on a common key.


Types of Joins

1. Inner Join

Returns only matching records from both DataFrames.

result = df1.join(df2, df1.id == df2.id, "inner")

2. Left Join

Returns all records from left DataFrame and matched records from right.

result = df1.join(df2, df1.id == df2.id, "left")

3. Right Join

Returns all records from right DataFrame and matched records from left.

result = df1.join(df2, df1.id == df2.id, "right")

4. Outer Join

Returns all records from both DataFrames.

result = df1.join(df2, df1.id == df2.id, "outer")

Join Execution in Spark

When a join happens:

  1. Spark checks data size
  2. Chooses join strategy
  3. Performs shuffle if required
  4. Executes distributed join

Join Strategies

1. Broadcast Join

Used when one dataset is small.

  • small dataset is sent to all executors

  • avoids shuffle

  • very fast

    from pyspark.sql.functions import broadcast

    result = df1.join(broadcast(df2), "id")


2. Shuffle Join

Used when both datasets are large.

  • data is shuffled across cluster
  • expensive operation
  • default join type

When Shuffle Happens in Joins

Shuffle occurs when:

  • join keys are not partitioned
  • datasets are large
  • broadcast is not used

What is Partitioning?

Partitioning is how Spark splits data across the cluster.

Each partition is processed independently.


Why Partitioning Matters

Good partitioning:

  • improves parallelism
  • reduces shuffle
  • improves join performance

Bad partitioning:

  • causes skew
  • increases shuffle cost
  • slows execution

Partitioning Strategies

1. Hash Partitioning

  • default method
  • distributes based on hash of key

2. Range Partitioning

  • used for sorted data
  • preserves order

3. Custom Partitioning

  • user-defined logic
  • used in advanced systems

Repartition vs Coalesce

Repartition

  • increases or changes partitions

  • triggers shuffle

    df = df.repartition(10)


Coalesce

  • reduces partitions

  • avoids full shuffle

    df = df.coalesce(5)


Data Skew Problem

Data skew happens when:

  • one partition has more data than others

This causes:

  • slow tasks
  • executor imbalance

Fixing Data Skew

  • use salting keys
  • use broadcast join
  • repartition data properly

Performance Impact

Joins and partitions affect:

  • execution time
  • memory usage
  • shuffle cost
  • cluster utilization

Best Practices

  • use broadcast join for small datasets
  • avoid unnecessary shuffle joins
  • tune partition size properly
  • handle data skew carefully

Mental Model

Think of joins as:

A distributed matching process that often requires reshuffling data across the cluster unless optimized.


Summary

Joins and partitions in Spark:

  • define how data is combined and distributed
  • heavily impact performance
  • must be optimized to reduce shuffle