Appearance
PySpark DataFrame API ⚡
The DataFrame API is the core abstraction in PySpark used for distributed data processing.
A DataFrame represents structured data distributed across a cluster.
Why DataFrames matter
In real data engineering systems:
- data is too large for a single machine
- computation must be distributed
- performance and optimization are critical
DataFrames provide:
- distributed execution
- optimized query planning
- SQL-like transformations
What is a DataFrame
A DataFrame is:
- distributed
- immutable
- lazily evaluated
- optimized by Spark engine
Creating a Spark Session
Before using DataFrames, we create a Spark session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DE Guide").getOrCreate()
Creating a DataFrame from Python data
data = [("Alice", 25), ("Bob", 30), ("Cathy", 28)]
df = spark.createDataFrame(data, ["name", "age"])
df.show()
Creating DataFrame from file
df = spark.read.csv(
"data/users.csv",
header=True,
inferSchema=True
)
df.show()
Schema
Schema defines structure of DataFrame.
df.printSchema()
Core Operations
Select columns
df.select("name", "age").show()
Filter rows
df.filter(df.age > 25).show()
Add new column
from pyspark.sql.functions import col
df = df.withColumn("age_plus_10", col("age") + 10)
Rename column
df = df.withColumnRenamed("name", "full_name")
Drop column
df = df.drop("age")
Transformations vs Actions
Transformations (lazy execution)
- select
- filter
- withColumn
- groupBy
Actions (trigger execution)
- show()
- count()
- collect()
Key Properties
Immutability
Every transformation returns a new DataFrame.
Lazy Evaluation
Spark does not execute immediately.
Execution happens only when an action is triggered.
Distributed Execution
Data is processed in parallel across cluster nodes.
Execution Flow
- User writes transformations
- Spark builds logical plan
- Catalyst optimizer optimizes plan
- DAG is created
- Execution runs on executors
DataFrame vs RDD
| Feature | DataFrame | RDD |
|---|---|---|
| Performance | High | Low |
| Optimization | Yes | No |
| Ease of use | High | Low |
Common Mistakes
- using collect() on large data
- using Python loops instead of transformations
- ignoring schema design
- unnecessary shuffles
Best Practices
- use built-in functions instead of loops
- avoid collect() in production
- define schema when possible
- minimize shuffles
Mental Model
A DataFrame is not data itself.
It is a logical execution plan that Spark optimizes and runs in a distributed system.
Summary
PySpark DataFrame API allows you to:
- process large-scale distributed data
- write SQL-like transformations
- optimize execution automatically through Spark engine