Appearance
Late Arriving Data ⏳ (Real-World Streaming Problem)
Late arriving data refers to:
🧠 “Events that arrive after the expected processing window has already passed.”
This is one of the most common problems in real-time data systems.
🎯 Why This Happens
In distributed systems, delays are normal due to:
- Network latency
- Mobile app offline mode
- Retry mechanisms
- Kafka lag
- System outages
- Batch ingestion delays
So events may arrive:
- seconds late
- minutes late
- hours late
- even days late
🔄 Simple Example
Expected flow
Event Time: 10:00 AM → processed at 10:01 AM ✔
Late arriving case
Event Time: 10:00 AM → arrives at 10:10 AM ❗
System has already processed 10:00–10:05 window.
🧭 Event Time vs Processing Time
Event Time
When the event actually happened
Example:
- user clicked at 10:00 AM
Processing Time
When the system receives/processes it
Example:
- system processes at 10:10 AM
⚠️ The Core Problem
If you only use processing time:
- metrics become inaccurate
- dashboards shift unpredictably
- aggregations are wrong
⚙️ How Systems Handle Late Data
1. Watermarking
Watermarks define how long the system waits for late data.
Example: Wait 10 minutes for late events
After that:
- late events are dropped OR handled separately
Used in:
- Spark Structured Streaming
- Flink
2. Windowing with Grace Period
Time windows are extended: Window: 10:00–10:05 Grace: +5 minutes
This allows late events to still be included.
3. Reprocessing / Backfilling
When late data arrives:
- re-run historical windows
- recompute aggregates
- update downstream tables
4. Upserts in Storage
Instead of appending:
- update existing records
- correct previous aggregates
Used in:
- Delta Lake
- Hudi
- Iceberg
5. Event Versioning
Track updates to events:
- event_id + version
- latest event wins
🧠 Late Data in Batch vs Streaming
Batch Systems
✔ Easier handling
✔ Full dataset available
✔ Can reprocess everything
Streaming Systems
❌ Hard problem
❌ Must balance latency vs correctness
❌ Needs window + watermark logic
🚨 Common Real-World Issues
- Dashboard corrections after refresh
- Revenue mismatches
- Duplicate counts
- Missing events in reports
- Delayed fraud detection signals
🔗 How This Connects
- Processing → defines event handling
- Pipelines → manage late data flows
- Storage → supports updates and merges
- System Design → must decide correctness strategy
- Idempotency → ensures safe reprocessing
🎯 Goal of Understanding Late Data
You should be able to:
- Design correct streaming systems
- Handle delayed events safely
- Explain watermarking clearly
- Decide between batch vs streaming tradeoffs
- Fix incorrect metrics in production
🔥 Interview Insight
If you explain this well:
You immediately sound like someone who has worked on real streaming systems
💡 Mental Model
Think of it as:
“The system is always partially blind — late data fills the gaps after the fact.”
“In real-time systems, correctness is not instant — it is eventually consistent.”