PySpark Actions ⚡

Actions are operations in PySpark that trigger execution of a Spark job.

Unlike transformations, actions are eager, meaning they execute immediately.

What are Actions?

Actions are operations that:

trigger execution of the DAG
send results back to the driver
produce final output

Until an action is called, Spark does nothing (lazy execution model).

Why Actions matter

Transformations build the plan, but:

Actions execute the plan.

Without actions, no computation happens in Spark.

Common Actions

1. show()

Displays top rows of DataFrame.

df.show()

2. count()

Returns number of rows.

df.count()

3. collect()

Returns all data to driver as a list.

df.collect()

⚠️ Dangerous for large datasets (can cause memory issues)

4. take(n)

Returns first n rows.

df.take(5)

5. first()

Returns first row.

df.first()

6. reduce()

Aggregates data using a function.

from functools import reduce

Action Execution Flow

When an action is called:

Spark builds DAG from transformations
DAG Scheduler creates stages
Tasks are sent to executors
Executors process data
Results are returned to driver

Transformations vs Actions

Type	Behavior
Transformations	Lazy (no execution)
Actions	Eager (triggers execution)

Execution Trigger Point

Execution starts ONLY when an action is called.

Example:

df = spark.read.csv("data.csv")

df2 = df.filter(df.age > 25)

df2.show()   ← execution starts here

Memory Impact of Actions

Some actions are expensive:

collect()

brings all data to driver
can crash application if data is large

count()

triggers full scan of data
expensive for large datasets

DAG Relation

Actions:

trigger DAG execution
finalize computation graph
initiate stage execution

Common Mistakes

using collect() on large datasets
calling multiple actions unnecessarily
ignoring data size before actions

Best Practices

use show() instead of collect() for inspection
avoid bringing full dataset to driver
use actions only when required
prefer distributed aggregations over collect()

Mental Model

Think of actions as:

The “start button” of Spark execution.

Without them, Spark only prepares instructions but never runs them.

Summary

PySpark actions:

trigger execution
return results to driver
execute DAG
must be used carefully for large datasets

PySpark Actions ⚡ ​

What are Actions? ​

Why Actions matter ​

Common Actions ​

1. show() ​

2. count() ​

3. collect() ​

4. take(n) ​

5. first() ​

6. reduce() ​

Action Execution Flow ​

Transformations vs Actions ​

Execution Trigger Point ​

Memory Impact of Actions ​

collect() ​

count() ​

DAG Relation ​

Common Mistakes ​

Best Practices ​

Mental Model ​

Summary ​