Designing a Resilient and Scalable Data Ingestion System
Introduction
In real-world environments, data ingestion systems rarely deal with perfect or controlled inputs. Data can arrive in multiple formats, from different sources, with errors, duplicates, and unpredictable volumes. This document describes a robust architecture for building a data ingestion system designed for these conditions, prioritizing resilience, scalability, and consistency.
The approach is intended for modern event-driven and microservice-based architectures, but the principles apply to other styles as well.
System Goals
The primary goals of the system are:
- Ingest data from any source
- Process data asynchronously and non-blocking
- Tolerate partial failures without data loss
- Guarantee consistency through idempotency
- Provide full system observability
Data Sources
The system must be source-agnostic. Common examples include:
- HTTP requests with direct payloads
- CSV files uploaded via HTTP or FTP
- Files published to object storage (S3, GCS, Azure Blob)
Regardless of the origin, all data flows through the same logical pipeline.
Core Design Principles
Asynchronous processing
Ingestion must never block the client or upstream system. The system accepts the input, persists or references it, and continues processing in the background.
Event-driven architecture
Components communicate via events. This decouples responsibilities and allows each part of the system to scale independently.
Failure-first mindset
The design assumes:
- Data can be malformed
- Services can crash
- Message brokers can be temporarily unavailable
The system must continue operating under these conditions.
High-level Flow
+-------------+ +------------------+ +------------------+
| Sources | -----> | Ingest Service | -----> | Message Broker |
| (HTTP/CSV) | | (stream + clean) | | (Kafka / etc.) |
+-------------+ +------------------+ +------------------+
|
v
+------------------+
| Domain Services |
| (Invoices, ...) |
+------------------+- The Ingest Service receives or detects new data
- Input is processed using streams
- A sanitize and normalization phase is applied
- Valid data is sent in batched events
- Domain services process and persist data
- Errors are isolated and handled explicitly
Stream-based Processing
+-----------+ chunk +-----------+ chunk +-----------+
| File | ---------> | Parser | ---------> | Processor |
| (CSV) | | (stream) | | (logic) |
+-----------+ +-----------+ +-----------+One of the most critical aspects is never assuming input size.
The problem
Loading entire files into memory can cause:
- Out-of-memory errors
- Process blocking
- Service crashes
The solution
- Incremental reading (chunk by chunk)
- Line-by-line processing
- Backpressure control
This approach enables stable processing of anything from hundreds to millions of records.
Sanitize Phase
Purpose
Transform untrusted, raw data into well-known and manageable structures.
Responsibilities
- Format conversion (CSV → JSON)
- Type normalization
- Basic data cleanup
Examples:
- "123" → number
- "TRUE" / "FALSE" → boolean
- Normalize date formats
This phase does not apply business rules. It only prepares data for downstream processing.
Chunked Delivery to Domain Services
After sanitization, data is grouped into controlled batches and sent as events.
Benefits:
- Reduces pressure on the message broker
- Improves throughput
- Simplifies retries and failure handling
Each event represents a coherent batch of sanitized data.
Domain Service Processing
Example: an Invoice microservice.
Responsibilities
- Map incoming data to the domain model
- Validate schemas and business rules
- Persist data in its own database
Individual records may fail during this process.
Error Handling and Dead Letter Queues (DLQ)
+-------------------+
| Domain Service |
| (validation/map) |
+-------------------+
|
valid | invalid
|
+-----------+-----------+
| |
v v
+------------------+ +------------------+
| Persist State | | DLQ |
| (success path) | | (audit/replay) |
+------------------+ +------------------+The problem
A single invalid record must not block the entire pipeline.
The solution
- Failed records are sent to a Dead Letter Queue (DLQ)
- They are stored together with failure reasons
- They can be audited or reprocessed later
This provides both resilience and traceability.
Idempotency
Event (retry)
|
v
+-----------------------+
| Consumer Service |
| (idempotent logic) |
+-----------------------+
| |
ignored appliedIn distributed, event-driven systems, idempotency is a fundamental requirement. Due to retries, network issues, or manual reprocessing, the same event may reach a consumer multiple times.
The goal of idempotency is simple: processing the same event multiple times must produce the same result as processing it once.
What idempotency provides
- Safe retries
- Easier failure recovery
- Protection against duplicated data
- Confidence in asynchronous architectures
Strategy 1: Event ID Deduplication
Each event carries a unique event_id.
How it works
- The consumer stores processed
event_ids - Before processing, it checks for existence
- Already-processed events are ignored
What it provides
- Strong deduplication guarantees
- Easy auditing and reasoning
Trade-offs
- Requires additional storage
- Needs long-term cleanup or TTLs
Strategy 2: Timestamp-based State Control
This strategy assumes the most recent state is authoritative.
How it works
- Events include a
created_attimestamp - Persisted entities store
last_updated - If
created_at≤last_updated, the event is ignored - Otherwise, it is processed and updates the state
What it provides
- No need to store processed events
- Works well for state-based models
Limitations
- Relies on reasonably synchronized clocks
- Best-effort consistency
Strategy 3: Idempotent Operations
Instead of deduplicating events, operations themselves are designed to be idempotent.
Examples:
- Using
UPSERTinstead ofINSERT - Deterministic updates
- Full state replacement
What it provides
- Lower technical complexity
- Very robust against retries
Drawback
- Not always applicable
- Events can't be lost, use Outbox Pattern .
Strategy 4: Combining Approaches
In critical systems, strategies are often combined:
event_idfor strong deduplication- Timestamps for state control
- Idempotent database operations
This balances safety, performance, and simplicity.
Observability
A system without observability is impossible to operate reliably.
Key elements
- Structured logs with context
- Throughput and error metrics
- Automated alerts
These allow fast detection and response to issues.
Conclusion
This design embraces the reality of production systems: imperfect data, constant failures, and the need to scale. Applying these principles results in a data ingestion system that is robust, maintainable, and ready for future growth.