Appearance
PySpark Transformations 🔄 ​
Transformations are operations that create a new DataFrame from an existing one.
They are lazy, meaning they are not executed immediately.
What are Transformations? ​
Transformations define how data should be processed, but Spark does not execute them right away.
Execution happens only when an action is triggered.
Types of Transformations ​
1. Narrow Transformations ​
Narrow transformations do not require data movement across partitions.
Examples:
- select
- filter
- withColumn
2. Wide Transformations ​
Wide transformations require data shuffling across partitions.
Examples:
- groupBy
- join
- distinct
- orderBy
Narrow Transformations ​
Narrow transformations:
- operate within a single partition
- do not trigger shuffle
- are faster
Example:
df2 = df.select("name", "age")
df3 = df.filter(df.age > 25)
Wide Transformations ​
Wide transformations:
- require data movement between executors
- trigger shuffle
- are expensive
Example:
df_grouped = df.groupBy("age").count()
Why Shuffle happens in Wide Transformations ​
Shuffle happens because Spark must:
- group similar keys together
- redistribute data across partitions
- ensure correct aggregation or join results
Transformation Characteristics ​
Lazy Evaluation ​
Transformations are not executed immediately.
Spark only builds a logical execution plan.
Immutability ​
Each transformation returns a new DataFrame.
Original data is never modified.
Example Transformation Flow ​
df1 = spark.read.csv("data.csv", header=True)
df2 = df1.filter(df1.age > 25)
df3 = df2.withColumn("age_plus_10", df2.age + 10)
No execution happens until an action is called.
DAG Impact ​
Each transformation:
- adds a node in DAG
- builds execution lineage
- helps Spark optimize execution
Common Transformations in PySpark ​
Column Operations ​
- select
- withColumn
- drop
- alias
Row Operations ​
- filter
- where
Aggregations ​
- groupBy
- agg
- count
- sum
Joins ​
- inner join
- left join
- right join
- outer join
Performance Considerations ​
- narrow transformations are cheap
- wide transformations are expensive
- minimize shuffles whenever possible
Common Mistakes ​
- chaining too many wide transformations
- unnecessary groupBy operations
- ignoring partitioning strategy
- using Python loops instead of transformations
Best Practices ​
- filter early in pipeline
- reduce data before shuffle
- use built-in Spark functions
- avoid unnecessary wide transformations
Mental Model ​
Think of transformations as:
Instructions that describe how data should flow through the system, not actual execution.
Summary ​
PySpark transformations:
- are lazy
- build a DAG
- are classified into narrow and wide
- determine Spark job performance heavily