PySpark DataFrame API ⚡

The DataFrame API is the core abstraction in PySpark used for distributed data processing.

A DataFrame represents structured data distributed across a cluster.

Why DataFrames matter

In real data engineering systems:

data is too large for a single machine
computation must be distributed
performance and optimization are critical

DataFrames provide:

distributed execution
optimized query planning
SQL-like transformations

What is a DataFrame

A DataFrame is:

distributed
immutable
lazily evaluated
optimized by Spark engine

Creating a Spark Session

Before using DataFrames, we create a Spark session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DE Guide").getOrCreate()

Creating a DataFrame from Python data

data = [("Alice", 25), ("Bob", 30), ("Cathy", 28)]

df = spark.createDataFrame(data, ["name", "age"])

df.show()

Creating DataFrame from file

df = spark.read.csv(
    "data/users.csv",
    header=True,
    inferSchema=True
)

df.show()

Schema

Schema defines structure of DataFrame.

df.printSchema()

Core Operations

Select columns

df.select("name", "age").show()

Filter rows

df.filter(df.age > 25).show()

Add new column

from pyspark.sql.functions import col

df = df.withColumn("age_plus_10", col("age") + 10)

Rename column

df = df.withColumnRenamed("name", "full_name")

Drop column

df = df.drop("age")

Transformations vs Actions

Transformations (lazy execution)

select
filter
withColumn
groupBy

Actions (trigger execution)

show()
count()
collect()

Key Properties

Immutability

Every transformation returns a new DataFrame.

Lazy Evaluation

Spark does not execute immediately.

Execution happens only when an action is triggered.

Distributed Execution

Data is processed in parallel across cluster nodes.

Execution Flow

User writes transformations
Spark builds logical plan
Catalyst optimizer optimizes plan
DAG is created
Execution runs on executors

DataFrame vs RDD

Feature	DataFrame	RDD
Performance	High	Low
Optimization	Yes	No
Ease of use	High	Low

Common Mistakes

using collect() on large data
using Python loops instead of transformations
ignoring schema design
unnecessary shuffles

Best Practices

use built-in functions instead of loops
avoid collect() in production
define schema when possible
minimize shuffles

Mental Model

A DataFrame is not data itself.

It is a logical execution plan that Spark optimizes and runs in a distributed system.

Summary

PySpark DataFrame API allows you to:

process large-scale distributed data
write SQL-like transformations
optimize execution automatically through Spark engine

PySpark DataFrame API ⚡ ​

Why DataFrames matter ​

What is a DataFrame ​

Creating a Spark Session ​

Creating a DataFrame from Python data ​

Creating DataFrame from file ​

Schema ​

Core Operations ​

Select columns ​

Filter rows ​

Add new column ​

Rename column ​

Drop column ​

Transformations vs Actions ​

Transformations (lazy execution) ​

Actions (trigger execution) ​

Key Properties ​

Immutability ​

Lazy Evaluation ​

Distributed Execution ​

Execution Flow ​

DataFrame vs RDD ​

Common Mistakes ​

Best Practices ​

Mental Model ​

Summary ​