Appearance
This allows late events to still be included.
3. Reprocessing / Backfilling β
When late data arrives:
- re-run historical windows
- recompute aggregates
- update downstream tables
4. Upserts in Storage β
Instead of appending:
- update existing records
- correct previous aggregates
Used in:
- Delta Lake
- Hudi
- Iceberg
5. Event Versioning β
Track updates to events:
- event_id + version
- latest event wins
π§ Late Data in Batch vs Streaming β
Batch Systems β
β Easier handling
β Full dataset available
β Can reprocess everything
Streaming Systems β
β Hard problem
β Must balance latency vs correctness
β Needs window + watermark logic
π¨ Common Real-World Issues β
- Dashboard corrections after refresh
- Revenue mismatches
- Duplicate counts
- Missing events in reports
- Delayed fraud detection signals
π How This Connects β
- Processing β defines event handling
- Pipelines β manage late data flows
- Storage β supports updates and merges
- System Design β must decide correctness strategy
- Idempotency β ensures safe reprocessing
π― Goal of Understanding Late Data β
You should be able to:
- Design correct streaming systems
- Handle delayed events safely
- Explain watermarking clearly
- Decide between batch vs streaming tradeoffs
- Fix incorrect metrics in production
π₯ Interview Insight β
If you explain this well:
You immediately sound like someone who has worked on real streaming systems
π‘ Mental Model β
Think of it as:
βThe system is always partially blind β late data fills the gaps after the fact.β
βIn real-time systems, correctness is not instant β it is eventually consistent.β