Skip to content

This allows late events to still be included.


3. Reprocessing / Backfilling ​

When late data arrives:

  • re-run historical windows
  • recompute aggregates
  • update downstream tables

4. Upserts in Storage ​

Instead of appending:

  • update existing records
  • correct previous aggregates

Used in:

  • Delta Lake
  • Hudi
  • Iceberg

5. Event Versioning ​

Track updates to events:

  • event_id + version
  • latest event wins

🧠 Late Data in Batch vs Streaming ​


Batch Systems ​

βœ” Easier handling
βœ” Full dataset available
βœ” Can reprocess everything


Streaming Systems ​

❌ Hard problem
❌ Must balance latency vs correctness
❌ Needs window + watermark logic


🚨 Common Real-World Issues ​

  • Dashboard corrections after refresh
  • Revenue mismatches
  • Duplicate counts
  • Missing events in reports
  • Delayed fraud detection signals

πŸ”— How This Connects ​

  • Processing β†’ defines event handling
  • Pipelines β†’ manage late data flows
  • Storage β†’ supports updates and merges
  • System Design β†’ must decide correctness strategy
  • Idempotency β†’ ensures safe reprocessing

🎯 Goal of Understanding Late Data ​

You should be able to:

  • Design correct streaming systems
  • Handle delayed events safely
  • Explain watermarking clearly
  • Decide between batch vs streaming tradeoffs
  • Fix incorrect metrics in production

πŸ”₯ Interview Insight ​

If you explain this well:

You immediately sound like someone who has worked on real streaming systems


πŸ’‘ Mental Model ​

Think of it as:

β€œThe system is always partially blind β€” late data fills the gaps after the fact.”


β€œIn real-time systems, correctness is not instant β€” it is eventually consistent.”