Appearance
Data Warehousing 🏢 (Analytics at Scale)
A data warehouse is a system designed to store structured data for analytics and reporting.
🧠 If pipelines move data, warehouses make data useful for decision making.
🎯 Why Data Warehouses Exist
Operational databases (OLTP) are not suitable for analytics because:
- They are optimized for transactions
- They handle frequent writes and updates
- They are not optimized for large aggregations
A data warehouse solves this by:
- Optimizing for read-heavy workloads
- Storing historical data
- Supporting complex queries
🧭 OLTP vs OLAP
OLTP (Online Transaction Processing)
Used for day-to-day operations.
Examples:
- Banking systems
- E-commerce checkout
- User login systems
✔ Fast inserts/updates
✔ Real-time operations
❌ Not good for analytics
OLAP (Online Analytical Processing)
Used for analysis and reporting.
Examples:
- Business dashboards
- Sales analysis
- Customer insights
✔ Complex queries
✔ Large-scale aggregations
❌ Not optimized for writes
🏗️ Data Warehouse Architecture
Sources → ETL/ELT → Data Warehouse → BI Tools
Flow:
- Data is extracted from multiple systems
- Transformed and cleaned
- Loaded into warehouse
- Queried by analysts and dashboards
🧱 Core Components
1. Staging Layer
- Raw data landing zone
- Temporary storage
- Used for preprocessing
2. Data Storage Layer
- Cleaned and structured data
- Organized into schemas
3. Presentation Layer
- Final layer for analytics
- Used by BI tools and analysts
🧠 Schema Design Models
1. Star Schema ⭐
- Fact table in center
- Dimension tables around it
Example:
- Fact: Sales
- Dimensions: Customer, Product, Time
✔ Fast queries
✔ Easy to understand
❌ Redundant data
2. Snowflake Schema ❄️
- Normalized version of star schema
- More relationships between tables
✔ Less redundancy
❌ More joins → slower queries
⚙️ Key Concepts in Warehousing
1. Fact Tables
- Contain measurable data
- Example: sales amount, clicks, revenue
2. Dimension Tables
- Descriptive data
- Example: customer info, product details
3. Aggregations
Warehouses are optimized for:
- SUM()
- COUNT()
- AVG()
- GROUP BY
🔥 Modern Data Warehouses
Cloud-Based Systems
- Snowflake
- BigQuery
- Redshift
Characteristics:
- Scalable storage
- Separation of compute & storage
- Pay-as-you-use model
⚡ Data Lake vs Data Warehouse
| Feature | Data Lake | Data Warehouse |
|---|---|---|
| Data | Raw | Processed |
| Schema | On read | On write |
| Cost | Low | Higher |
| Use case | ML, raw data | Analytics, BI |
🚨 Common Warehouse Problems
- Slow query performance
- Poor schema design
- Data duplication
- High storage costs
- Stale data issues
🔗 How This Connects
- Data Modeling → defines structure
- Storage → holds data
- Processing → transforms data
- Pipelines → move data
- Warehouse → makes data usable for analytics
🎯 Goal of Data Warehousing Knowledge
You should be able to:
- Design analytics systems
- Choose schema types
- Explain OLAP vs OLTP
- Understand warehouse architecture
- Optimize query performance
“A data warehouse is where raw data becomes business intelligence.”