Skip to content

Data Modeling 🧱 (Foundation of Data Engineering) ​

Data modeling is the process of designing how data is structured, stored, and related inside a system.

🧠 Good data models = fast queries, scalable systems, clean pipelines
❌ Bad data models = slow systems, broken pipelines, high cost


🎯 Why Data Modeling Matters ​

Every data system depends on how data is modeled.

It impacts:

  • Query performance
  • Storage cost
  • Pipeline complexity
  • Scalability
  • Data quality

In interviews, this is often the first system design discussion point.


🧭 Types of Data Models ​

1. Conceptual Model ​

High-level business view.

Example:

  • Customer
  • Orders
  • Products

No technical details yet.


2. Logical Model ​

Defines relationships:

  • Customer β†’ Orders (1:N)
  • Orders β†’ Products (N:M)

Still database-agnostic.


3. Physical Model ​

Actual implementation:

  • Tables
  • Columns
  • Data types
  • Indexes

πŸ—οΈ Relational vs Non-Relational Models ​

Relational (SQL-based) ​

  • Tables with rows and columns
  • Fixed schema
  • Strong consistency

Used in:

  • Banking systems
  • ERP systems
  • Transaction systems

Non-Relational (NoSQL) ​

  • Flexible schema
  • Document / Key-value / Column / Graph

Used in:

  • Logging systems
  • Real-time apps
  • Large-scale distributed systems

πŸ”₯ Normalization vs Denormalization ​

Normalization ​

Splitting data to reduce duplication.

βœ” Pros:

  • Less redundancy
  • Better consistency

❌ Cons:

  • More joins β†’ slower queries

Denormalization ​

Combining data for faster reads.

βœ” Pros:

  • Faster queries
  • Fewer joins

❌ Cons:

  • Data duplication
  • Harder updates

βš™οΈ OLTP vs OLAP ​

FeatureOLTPOLAP
PurposeTransactionsAnalytics
DataCurrent stateHistorical
QueriesSimpleComplex
ExamplesPaymentsReporting

🧠 Star Schema vs Snowflake Schema ​

Star Schema ​

  • Fact table in center
  • Dimension tables around it

βœ” Fast queries ❌ More redundancy


Snowflake Schema ​

  • Normalized dimensions
  • More joins

βœ” Storage efficient ❌ Slower queries


🚨 Common Interview Questions ​

You should be able to answer:

  • How do you design a schema for an e-commerce system?
  • When do you use normalization vs denormalization?
  • Difference between OLTP and OLAP?
  • How would you model event data?

πŸ”— How This Connects Forward ​

  • SQL β†’ queries this data
  • PySpark β†’ processes this data at scale
  • Spark β†’ executes transformations
  • Pipelines β†’ move this data
  • System Design β†’ defines entire architecture

🎯 Goal of Data Modeling ​

By mastering this, you should be able to:

  • Design clean database schemas
  • Optimize query performance
  • Understand tradeoffs in system design
  • Handle real-world data complexity

β€œEvery scalable system starts with how you model your data β€” everything else is execution.”