Skip to content

Data Storage Systems πŸ’Ύ (How Data is Persisted) ​

Storage is the foundation of every data system.

If data modeling defines structure, then storage defines:

🧠 β€œWhere data lives and how it is accessed efficiently.”


🎯 Why Storage Matters ​

Storage systems directly impact:

  • Query speed
  • Cost of infrastructure
  • Scalability
  • Reliability
  • Data availability

In real systems, storage is not just a database β€” it is an ecosystem.


🧭 Types of Storage Systems ​


1. File Storage ​

Stores data as files in directories.

Examples:

  • HDFS
  • Amazon S3
  • Local filesystem

Characteristics: ​

  • Object/file-based
  • Cheap and scalable
  • Used for raw data storage

Use cases: ​

  • Data lakes
  • Logs
  • Backups
  • Raw ingestion layer

2. Block Storage ​

Data is stored in fixed-size blocks.

Examples:

  • EBS (AWS)
  • SSD disks

Characteristics: ​

  • High performance
  • Low latency
  • Used for databases

Use cases: ​

  • Databases
  • Virtual machines
  • Transaction systems

3. Object Storage ​

Stores data as objects with metadata.

Examples:

  • Amazon S3
  • Azure Blob Storage
  • Google Cloud Storage

Characteristics: ​

  • Infinite scalability
  • Low cost
  • Access via APIs

Use cases: ​

  • Data lakes
  • Big data pipelines
  • ML datasets

πŸ—οΈ Structured vs Semi-Structured vs Unstructured Storage ​

Structured Data ​

  • Tables
  • Fixed schema
  • SQL databases

Semi-Structured Data ​

  • JSON
  • Parquet
  • Avro

Unstructured Data ​

  • Images
  • Videos
  • Logs
  • Text files

βš™οΈ Database Storage Types ​


1. Row-Based Storage ​

Data stored row by row.

βœ” Good for:

  • Transaction systems
  • OLTP workloads

❌ Weak for analytics


2. Column-Based Storage ​

Data stored column by column.

βœ” Good for:

  • Analytics (OLAP)
  • Aggregations
  • Reporting systems

Examples:

  • Parquet
  • ORC

πŸ”₯ Data Lake vs Data Warehouse (Storage Perspective) ​


Data Lake ​

  • Stores raw data
  • Schema applied later (schema-on-read)
  • Uses object storage

βœ” Flexible
βœ” Cheap
❌ Harder governance


Data Warehouse ​

  • Structured, cleaned data
  • Schema applied before storage (schema-on-write)
  • Optimized for analytics

βœ” Fast queries
βœ” Clean data
❌ Expensive


⚑ Partitioning (Very Important) ​

Partitioning improves performance by splitting data:

Example:

  • By date
  • By region
  • By user ID

βœ” Benefits:

  • Faster queries
  • Less data scanned

❌ Wrong partitioning β†’ performance issues


🧠 Compression Formats ​

Used in big data systems:

  • Parquet (columnar + compressed)
  • ORC
  • Avro

βœ” Reduces storage cost βœ” Improves query speed


🚨 Common Storage Problems ​

  • Small file problem (S3 / HDFS)
  • Hot partitions
  • Data skew
  • Inefficient file formats
  • Excessive storage cost

πŸ”— How This Connects ​

  • Data Modeling β†’ defines structure
  • Storage β†’ persists structure
  • PySpark β†’ reads/writes storage
  • Pipelines β†’ move data between storage systems
  • System Design β†’ chooses storage architecture

🎯 Goal of Storage Knowledge ​

You should be able to:

  • Choose correct storage system
  • Explain tradeoffs (cost vs speed)
  • Understand file formats
  • Design scalable data lakes
  • Optimize data access patterns

β€œStorage is not just where data lives β€” it defines how fast your system thinks.”