Appearance
Data Modeling π§± (Foundation of Data Engineering) β
Data modeling is the process of designing how data is structured, stored, and related inside a system.
π§ Good data models = fast queries, scalable systems, clean pipelines
β Bad data models = slow systems, broken pipelines, high cost
π― Why Data Modeling Matters β
Every data system depends on how data is modeled.
It impacts:
- Query performance
- Storage cost
- Pipeline complexity
- Scalability
- Data quality
In interviews, this is often the first system design discussion point.
π§ Types of Data Models β
1. Conceptual Model β
High-level business view.
Example:
- Customer
- Orders
- Products
No technical details yet.
2. Logical Model β
Defines relationships:
- Customer β Orders (1:N)
- Orders β Products (N:M)
Still database-agnostic.
3. Physical Model β
Actual implementation:
- Tables
- Columns
- Data types
- Indexes
ποΈ Relational vs Non-Relational Models β
Relational (SQL-based) β
- Tables with rows and columns
- Fixed schema
- Strong consistency
Used in:
- Banking systems
- ERP systems
- Transaction systems
Non-Relational (NoSQL) β
- Flexible schema
- Document / Key-value / Column / Graph
Used in:
- Logging systems
- Real-time apps
- Large-scale distributed systems
π₯ Normalization vs Denormalization β
Normalization β
Splitting data to reduce duplication.
β Pros:
- Less redundancy
- Better consistency
β Cons:
- More joins β slower queries
Denormalization β
Combining data for faster reads.
β Pros:
- Faster queries
- Fewer joins
β Cons:
- Data duplication
- Harder updates
βοΈ OLTP vs OLAP β
| Feature | OLTP | OLAP |
|---|---|---|
| Purpose | Transactions | Analytics |
| Data | Current state | Historical |
| Queries | Simple | Complex |
| Examples | Payments | Reporting |
π§ Star Schema vs Snowflake Schema β
Star Schema β
- Fact table in center
- Dimension tables around it
β Fast queries β More redundancy
Snowflake Schema β
- Normalized dimensions
- More joins
β Storage efficient β Slower queries
π¨ Common Interview Questions β
You should be able to answer:
- How do you design a schema for an e-commerce system?
- When do you use normalization vs denormalization?
- Difference between OLTP and OLAP?
- How would you model event data?
π How This Connects Forward β
- SQL β queries this data
- PySpark β processes this data at scale
- Spark β executes transformations
- Pipelines β move this data
- System Design β defines entire architecture
π― Goal of Data Modeling β
By mastering this, you should be able to:
- Design clean database schemas
- Optimize query performance
- Understand tradeoffs in system design
- Handle real-world data complexity
βEvery scalable system starts with how you model your data β everything else is execution.β