Skip to content

PySpark DataFrame API ⚡

The DataFrame API is the core abstraction in PySpark used for distributed data processing.

A DataFrame represents structured data distributed across a cluster.


Why DataFrames matter

In real data engineering systems:

  • data is too large for a single machine
  • computation must be distributed
  • performance and optimization are critical

DataFrames provide:

  • distributed execution
  • optimized query planning
  • SQL-like transformations

What is a DataFrame

A DataFrame is:

  • distributed
  • immutable
  • lazily evaluated
  • optimized by Spark engine

Creating a Spark Session

Before using DataFrames, we create a Spark session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DE Guide").getOrCreate()

Creating a DataFrame from Python data

data = [("Alice", 25), ("Bob", 30), ("Cathy", 28)]

df = spark.createDataFrame(data, ["name", "age"])

df.show()

Creating DataFrame from file

df = spark.read.csv(
    "data/users.csv",
    header=True,
    inferSchema=True
)

df.show()

Schema

Schema defines structure of DataFrame.

df.printSchema()

Core Operations

Select columns

df.select("name", "age").show()

Filter rows

df.filter(df.age > 25).show()

Add new column

from pyspark.sql.functions import col

df = df.withColumn("age_plus_10", col("age") + 10)

Rename column

df = df.withColumnRenamed("name", "full_name")

Drop column

df = df.drop("age")

Transformations vs Actions

Transformations (lazy execution)

  • select
  • filter
  • withColumn
  • groupBy

Actions (trigger execution)

  • show()
  • count()
  • collect()

Key Properties

Immutability

Every transformation returns a new DataFrame.


Lazy Evaluation

Spark does not execute immediately.

Execution happens only when an action is triggered.


Distributed Execution

Data is processed in parallel across cluster nodes.


Execution Flow

  1. User writes transformations
  2. Spark builds logical plan
  3. Catalyst optimizer optimizes plan
  4. DAG is created
  5. Execution runs on executors

DataFrame vs RDD

FeatureDataFrameRDD
PerformanceHighLow
OptimizationYesNo
Ease of useHighLow

Common Mistakes

  • using collect() on large data
  • using Python loops instead of transformations
  • ignoring schema design
  • unnecessary shuffles

Best Practices

  • use built-in functions instead of loops
  • avoid collect() in production
  • define schema when possible
  • minimize shuffles

Mental Model

A DataFrame is not data itself.

It is a logical execution plan that Spark optimizes and runs in a distributed system.


Summary

PySpark DataFrame API allows you to:

  • process large-scale distributed data
  • write SQL-like transformations
  • optimize execution automatically through Spark engine