PySpark Transformations 🔄

Transformations are operations that create a new DataFrame from an existing one.

They are lazy, meaning they are not executed immediately.

What are Transformations?

Transformations define how data should be processed, but Spark does not execute them right away.

Execution happens only when an action is triggered.

Types of Transformations

1. Narrow Transformations

Narrow transformations do not require data movement across partitions.

Examples:

select
filter
withColumn

2. Wide Transformations

Wide transformations require data shuffling across partitions.

Examples:

groupBy
join
distinct
orderBy

Narrow Transformations

Narrow transformations:

operate within a single partition
do not trigger shuffle
are faster

Example:

df2 = df.select("name", "age")
df3 = df.filter(df.age > 25)

Wide Transformations

Wide transformations:

require data movement between executors
trigger shuffle
are expensive

Example:

df_grouped = df.groupBy("age").count()

Why Shuffle happens in Wide Transformations

Shuffle happens because Spark must:

group similar keys together
redistribute data across partitions
ensure correct aggregation or join results

Transformation Characteristics

Lazy Evaluation

Transformations are not executed immediately.

Spark only builds a logical execution plan.

Immutability

Each transformation returns a new DataFrame.

Original data is never modified.

Example Transformation Flow

df1 = spark.read.csv("data.csv", header=True)

df2 = df1.filter(df1.age > 25)

df3 = df2.withColumn("age_plus_10", df2.age + 10)

No execution happens until an action is called.

DAG Impact

Each transformation:

adds a node in DAG
builds execution lineage
helps Spark optimize execution

Common Transformations in PySpark

Column Operations

select
withColumn
drop
alias

Row Operations

filter
where

Aggregations

groupBy
agg
count
sum

Joins

inner join
left join
right join
outer join

Performance Considerations

narrow transformations are cheap
wide transformations are expensive
minimize shuffles whenever possible

Common Mistakes

chaining too many wide transformations
unnecessary groupBy operations
ignoring partitioning strategy
using Python loops instead of transformations

Best Practices

filter early in pipeline
reduce data before shuffle
use built-in Spark functions
avoid unnecessary wide transformations

Mental Model

Think of transformations as:

Instructions that describe how data should flow through the system, not actual execution.

Summary

PySpark transformations:

are lazy
build a DAG
are classified into narrow and wide
determine Spark job performance heavily

PySpark Transformations 🔄 ​

What are Transformations? ​

Types of Transformations ​

1. Narrow Transformations ​

2. Wide Transformations ​

Narrow Transformations ​

Wide Transformations ​

Why Shuffle happens in Wide Transformations ​

Transformation Characteristics ​

Lazy Evaluation ​

Immutability ​

Example Transformation Flow ​

DAG Impact ​

Common Transformations in PySpark ​

Column Operations ​

Row Operations ​

Aggregations ​

Joins ​

Performance Considerations ​

Common Mistakes ​

Best Practices ​

Mental Model ​

Summary ​