Skip to content

PySpark Actions ⚡

Actions are operations in PySpark that trigger execution of a Spark job.

Unlike transformations, actions are eager, meaning they execute immediately.


What are Actions?

Actions are operations that:

  • trigger execution of the DAG
  • send results back to the driver
  • produce final output

Until an action is called, Spark does nothing (lazy execution model).


Why Actions matter

Transformations build the plan, but:

Actions execute the plan.

Without actions, no computation happens in Spark.


Common Actions

1. show()

Displays top rows of DataFrame.

df.show()

2. count()

Returns number of rows.

df.count()

3. collect()

Returns all data to driver as a list.

df.collect()

⚠️ Dangerous for large datasets (can cause memory issues)


4. take(n)

Returns first n rows.

df.take(5)

5. first()

Returns first row.

df.first()

6. reduce()

Aggregates data using a function.

from functools import reduce

Action Execution Flow

When an action is called:

  1. Spark builds DAG from transformations
  2. DAG Scheduler creates stages
  3. Tasks are sent to executors
  4. Executors process data
  5. Results are returned to driver

Transformations vs Actions

TypeBehavior
TransformationsLazy (no execution)
ActionsEager (triggers execution)

Execution Trigger Point

Execution starts ONLY when an action is called.

Example:

df = spark.read.csv("data.csv")

df2 = df.filter(df.age > 25)

df2.show()   ← execution starts here

Memory Impact of Actions

Some actions are expensive:

collect()

  • brings all data to driver
  • can crash application if data is large

count()

  • triggers full scan of data
  • expensive for large datasets

DAG Relation

Actions:

  • trigger DAG execution
  • finalize computation graph
  • initiate stage execution

Common Mistakes

  • using collect() on large datasets
  • calling multiple actions unnecessarily
  • ignoring data size before actions

Best Practices

  • use show() instead of collect() for inspection
  • avoid bringing full dataset to driver
  • use actions only when required
  • prefer distributed aggregations over collect()

Mental Model

Think of actions as:

The “start button” of Spark execution.

Without them, Spark only prepares instructions but never runs them.


Summary

PySpark actions:

  • trigger execution
  • return results to driver
  • execute DAG
  • must be used carefully for large datasets