PySpark Interview Guide 🎯

This document covers commonly asked PySpark interview questions, real-world scenarios, and conceptual clarity points required for data engineering roles.

Core Concept Questions

1. What is PySpark?

PySpark is the Python API for Apache Spark used for distributed data processing across clusters.

2. What is a DataFrame in PySpark?

A DataFrame is a distributed collection of data organized into named columns, optimized by Spark’s Catalyst engine.

3. What is lazy evaluation?

Lazy evaluation means Spark does not execute transformations immediately. It builds a DAG and executes only when an action is called.

4. Difference between transformations and actions?

Transformations:

lazy
build DAG
return new DataFrame

Actions:

trigger execution
return result to driver

5. What is a DAG in Spark?

A DAG is a Directed Acyclic Graph representing execution logic of transformations before execution.

Execution-Based Questions

6. What happens when you call an action?

DAG is built from transformations
Spark creates stages
Tasks are distributed to executors
Execution happens in parallel
Result is returned

7. What is a shuffle?

Shuffle is data movement across partitions/executors required for operations like join or groupBy.

8. Why is shuffle expensive?

Because it involves:

disk I/O
network I/O
serialization
data redistribution

Performance Questions

9. How do you optimize Spark jobs?

reduce shuffle
use broadcast joins
optimize partitioning
cache wisely
filter early

10. What causes data skew?

When one partition has significantly more data than others.

11. How do you fix data skew?

salting technique
broadcast join
repartitioning

Joins and Partitioning

12. What is broadcast join?

A join where a small dataset is sent to all executors to avoid shuffle.

13. Difference between repartition and coalesce?

Repartition:

increases or changes partitions
causes shuffle

Coalesce:

reduces partitions
avoids full shuffle

Real-World Scenario Questions

14. A Spark job is slow — how do you debug it?

Steps:

check Spark UI
identify shuffle stages
check skewed partitions
analyze executor memory
optimize joins and partitions

15. How do you handle large joins?

use broadcast join if possible
partition both datasets
avoid unnecessary columns
filter early

Architecture Questions

16. Driver vs Executor?

Driver:

builds DAG
schedules jobs

Executor:

runs tasks
stores intermediate data

17. Why Spark is faster than MapReduce?

in-memory computation
DAG execution model
lazy evaluation
optimized execution engine

Common Mistakes in Interviews

confusing transformations with actions
not understanding shuffle
ignoring partitioning impact
overusing collect()
not explaining execution flow

Strong Interview Answer Pattern

When answering:

Define concept
Explain execution flow
Mention performance impact
Give real-world example

Mental Model

Think of PySpark as:

A distributed execution engine where your code is converted into an optimized DAG and executed across a cluster.

Summary

This section covers:

core PySpark concepts
execution model
performance tuning basics
real interview scenarios

PySpark Interview Guide 🎯 ​

Core Concept Questions ​

1. What is PySpark? ​

2. What is a DataFrame in PySpark? ​

3. What is lazy evaluation? ​

4. Difference between transformations and actions? ​

5. What is a DAG in Spark? ​

Execution-Based Questions ​

6. What happens when you call an action? ​

7. What is a shuffle? ​

8. Why is shuffle expensive? ​

Performance Questions ​

9. How do you optimize Spark jobs? ​

10. What causes data skew? ​

11. How do you fix data skew? ​

Joins and Partitioning ​

12. What is broadcast join? ​

13. Difference between repartition and coalesce? ​

Real-World Scenario Questions ​

14. A Spark job is slow — how do you debug it? ​

15. How do you handle large joins? ​

Architecture Questions ​

16. Driver vs Executor? ​

17. Why Spark is faster than MapReduce? ​

Common Mistakes in Interviews ​

Strong Interview Answer Pattern ​

Mental Model ​

Summary ​