Skip to content

PySpark Interview Guide 🎯 ​

This document covers commonly asked PySpark interview questions, real-world scenarios, and conceptual clarity points required for data engineering roles.


Core Concept Questions ​

1. What is PySpark? ​

PySpark is the Python API for Apache Spark used for distributed data processing across clusters.


2. What is a DataFrame in PySpark? ​

A DataFrame is a distributed collection of data organized into named columns, optimized by Spark’s Catalyst engine.


3. What is lazy evaluation? ​

Lazy evaluation means Spark does not execute transformations immediately. It builds a DAG and executes only when an action is called.


4. Difference between transformations and actions? ​

Transformations:

  • lazy
  • build DAG
  • return new DataFrame

Actions:

  • trigger execution
  • return result to driver

5. What is a DAG in Spark? ​

A DAG is a Directed Acyclic Graph representing execution logic of transformations before execution.


Execution-Based Questions ​

6. What happens when you call an action? ​

  1. DAG is built from transformations
  2. Spark creates stages
  3. Tasks are distributed to executors
  4. Execution happens in parallel
  5. Result is returned

7. What is a shuffle? ​

Shuffle is data movement across partitions/executors required for operations like join or groupBy.


8. Why is shuffle expensive? ​

Because it involves:

  • disk I/O
  • network I/O
  • serialization
  • data redistribution

Performance Questions ​

9. How do you optimize Spark jobs? ​

  • reduce shuffle
  • use broadcast joins
  • optimize partitioning
  • cache wisely
  • filter early

10. What causes data skew? ​

When one partition has significantly more data than others.


11. How do you fix data skew? ​

  • salting technique
  • broadcast join
  • repartitioning

Joins and Partitioning ​

12. What is broadcast join? ​

A join where a small dataset is sent to all executors to avoid shuffle.


13. Difference between repartition and coalesce? ​

Repartition:

  • increases or changes partitions
  • causes shuffle

Coalesce:

  • reduces partitions
  • avoids full shuffle

Real-World Scenario Questions ​

14. A Spark job is slow β€” how do you debug it? ​

Steps:

  • check Spark UI
  • identify shuffle stages
  • check skewed partitions
  • analyze executor memory
  • optimize joins and partitions

15. How do you handle large joins? ​

  • use broadcast join if possible
  • partition both datasets
  • avoid unnecessary columns
  • filter early

Architecture Questions ​

16. Driver vs Executor? ​

Driver:

  • builds DAG
  • schedules jobs

Executor:

  • runs tasks
  • stores intermediate data

17. Why Spark is faster than MapReduce? ​

  • in-memory computation
  • DAG execution model
  • lazy evaluation
  • optimized execution engine

Common Mistakes in Interviews ​

  • confusing transformations with actions
  • not understanding shuffle
  • ignoring partitioning impact
  • overusing collect()
  • not explaining execution flow

Strong Interview Answer Pattern ​

When answering:

  1. Define concept
  2. Explain execution flow
  3. Mention performance impact
  4. Give real-world example

Mental Model ​

Think of PySpark as:

A distributed execution engine where your code is converted into an optimized DAG and executed across a cluster.


Summary ​

This section covers:

  • core PySpark concepts
  • execution model
  • performance tuning basics
  • real interview scenarios