Appearance
PySpark Actions ⚡
Actions are operations in PySpark that trigger execution of a Spark job.
Unlike transformations, actions are eager, meaning they execute immediately.
What are Actions?
Actions are operations that:
- trigger execution of the DAG
- send results back to the driver
- produce final output
Until an action is called, Spark does nothing (lazy execution model).
Why Actions matter
Transformations build the plan, but:
Actions execute the plan.
Without actions, no computation happens in Spark.
Common Actions
1. show()
Displays top rows of DataFrame.
df.show()
2. count()
Returns number of rows.
df.count()
3. collect()
Returns all data to driver as a list.
df.collect()
⚠️ Dangerous for large datasets (can cause memory issues)
4. take(n)
Returns first n rows.
df.take(5)
5. first()
Returns first row.
df.first()
6. reduce()
Aggregates data using a function.
from functools import reduce
Action Execution Flow
When an action is called:
- Spark builds DAG from transformations
- DAG Scheduler creates stages
- Tasks are sent to executors
- Executors process data
- Results are returned to driver
Transformations vs Actions
| Type | Behavior |
|---|---|
| Transformations | Lazy (no execution) |
| Actions | Eager (triggers execution) |
Execution Trigger Point
Execution starts ONLY when an action is called.
Example:
df = spark.read.csv("data.csv")
df2 = df.filter(df.age > 25)
df2.show() ← execution starts here
Memory Impact of Actions
Some actions are expensive:
collect()
- brings all data to driver
- can crash application if data is large
count()
- triggers full scan of data
- expensive for large datasets
DAG Relation
Actions:
- trigger DAG execution
- finalize computation graph
- initiate stage execution
Common Mistakes
- using collect() on large datasets
- calling multiple actions unnecessarily
- ignoring data size before actions
Best Practices
- use show() instead of collect() for inspection
- avoid bringing full dataset to driver
- use actions only when required
- prefer distributed aggregations over collect()
Mental Model
Think of actions as:
The “start button” of Spark execution.
Without them, Spark only prepares instructions but never runs them.
Summary
PySpark actions:
- trigger execution
- return results to driver
- execute DAG
- must be used carefully for large datasets