Skip to content

PySpark Transformations 🔄 ​

Transformations are operations that create a new DataFrame from an existing one.

They are lazy, meaning they are not executed immediately.


What are Transformations? ​

Transformations define how data should be processed, but Spark does not execute them right away.

Execution happens only when an action is triggered.


Types of Transformations ​

1. Narrow Transformations ​

Narrow transformations do not require data movement across partitions.

Examples:

  • select
  • filter
  • withColumn

2. Wide Transformations ​

Wide transformations require data shuffling across partitions.

Examples:

  • groupBy
  • join
  • distinct
  • orderBy

Narrow Transformations ​

Narrow transformations:

  • operate within a single partition
  • do not trigger shuffle
  • are faster

Example:

df2 = df.select("name", "age")
df3 = df.filter(df.age > 25)

Wide Transformations ​

Wide transformations:

  • require data movement between executors
  • trigger shuffle
  • are expensive

Example:

df_grouped = df.groupBy("age").count()

Why Shuffle happens in Wide Transformations ​

Shuffle happens because Spark must:

  • group similar keys together
  • redistribute data across partitions
  • ensure correct aggregation or join results

Transformation Characteristics ​

Lazy Evaluation ​

Transformations are not executed immediately.

Spark only builds a logical execution plan.


Immutability ​

Each transformation returns a new DataFrame.

Original data is never modified.


Example Transformation Flow ​

df1 = spark.read.csv("data.csv", header=True)

df2 = df1.filter(df1.age > 25)

df3 = df2.withColumn("age_plus_10", df2.age + 10)

No execution happens until an action is called.


DAG Impact ​

Each transformation:

  • adds a node in DAG
  • builds execution lineage
  • helps Spark optimize execution

Common Transformations in PySpark ​

Column Operations ​

  • select
  • withColumn
  • drop
  • alias

Row Operations ​

  • filter
  • where

Aggregations ​

  • groupBy
  • agg
  • count
  • sum

Joins ​

  • inner join
  • left join
  • right join
  • outer join

Performance Considerations ​

  • narrow transformations are cheap
  • wide transformations are expensive
  • minimize shuffles whenever possible

Common Mistakes ​

  • chaining too many wide transformations
  • unnecessary groupBy operations
  • ignoring partitioning strategy
  • using Python loops instead of transformations

Best Practices ​

  • filter early in pipeline
  • reduce data before shuffle
  • use built-in Spark functions
  • avoid unnecessary wide transformations

Mental Model ​

Think of transformations as:

Instructions that describe how data should flow through the system, not actual execution.


Summary ​

PySpark transformations:

  • are lazy
  • build a DAG
  • are classified into narrow and wide
  • determine Spark job performance heavily