Skip to content

Late Arriving Data ⏳ (Real-World Streaming Problem)

Late arriving data refers to:

🧠 “Events that arrive after the expected processing window has already passed.”

This is one of the most common problems in real-time data systems.


🎯 Why This Happens

In distributed systems, delays are normal due to:

  • Network latency
  • Mobile app offline mode
  • Retry mechanisms
  • Kafka lag
  • System outages
  • Batch ingestion delays

So events may arrive:

  • seconds late
  • minutes late
  • hours late
  • even days late

🔄 Simple Example

Expected flow

Event Time: 10:00 AM → processed at 10:01 AM ✔


Late arriving case

Event Time: 10:00 AM → arrives at 10:10 AM ❗

System has already processed 10:00–10:05 window.


🧭 Event Time vs Processing Time


Event Time

When the event actually happened

Example:

  • user clicked at 10:00 AM

Processing Time

When the system receives/processes it

Example:

  • system processes at 10:10 AM

⚠️ The Core Problem

If you only use processing time:

  • metrics become inaccurate
  • dashboards shift unpredictably
  • aggregations are wrong

⚙️ How Systems Handle Late Data


1. Watermarking

Watermarks define how long the system waits for late data.

Example: Wait 10 minutes for late events

After that:

  • late events are dropped OR handled separately

Used in:

  • Spark Structured Streaming
  • Flink

2. Windowing with Grace Period

Time windows are extended: Window: 10:00–10:05 Grace: +5 minutes

This allows late events to still be included.


3. Reprocessing / Backfilling

When late data arrives:

  • re-run historical windows
  • recompute aggregates
  • update downstream tables

4. Upserts in Storage

Instead of appending:

  • update existing records
  • correct previous aggregates

Used in:

  • Delta Lake
  • Hudi
  • Iceberg

5. Event Versioning

Track updates to events:

  • event_id + version
  • latest event wins

🧠 Late Data in Batch vs Streaming


Batch Systems

✔ Easier handling
✔ Full dataset available
✔ Can reprocess everything


Streaming Systems

❌ Hard problem
❌ Must balance latency vs correctness
❌ Needs window + watermark logic


🚨 Common Real-World Issues

  • Dashboard corrections after refresh
  • Revenue mismatches
  • Duplicate counts
  • Missing events in reports
  • Delayed fraud detection signals

🔗 How This Connects

  • Processing → defines event handling
  • Pipelines → manage late data flows
  • Storage → supports updates and merges
  • System Design → must decide correctness strategy
  • Idempotency → ensures safe reprocessing

🎯 Goal of Understanding Late Data

You should be able to:

  • Design correct streaming systems
  • Handle delayed events safely
  • Explain watermarking clearly
  • Decide between batch vs streaming tradeoffs
  • Fix incorrect metrics in production

🔥 Interview Insight

If you explain this well:

You immediately sound like someone who has worked on real streaming systems


💡 Mental Model

Think of it as:

“The system is always partially blind — late data fills the gaps after the fact.”


“In real-time systems, correctness is not instant — it is eventually consistent.”