[EDD] Event-Driven Design
Event-Driven Design (EDD) is an architecture where the system reflects and communicates important business changes through events. Services react to what happens, rather than through direct calls.
EDD is ideal for systems that are Distributed, Scalable, and Resilient.
Assume failures, design for them, and use events as the language of the business. It is often used alongside DDD.
What is an event?
An event represents a domain fact that has already occurred and cannot be changed.
Examples:
OrderCreated
PaymentConfirmed
UserRegistered
An event is not an intention, it is a statement of the past.
Core components in EDD
| Component | Description | Responsibilities | Examples |
|---|---|---|---|
| Broker | Ensures event delivery |
| Kafka, RabbitMQ |
| Producer | Service that emits the event |
| — |
| Consumer | Service that processes the event |
| — |
Eventual Consistency
In distributed systems like those using EDD, not all parts of the system are consistent at the same time. Each service processes events at its own pace. Over time, all converge to the same state.
Example:
States may differ temporarily, but the system eventually becomes consistent.
Failures in distributed systems
In distributed systems, failures are not an exception, they are the norm.
Common principles:
At-least-once: the event may arrive more than once; the broker and consumer must handle this.At-most-once: the event may be lost.Must-once: the event arrives at least once; if it arrives multiple times, the consumer must be idempotent (final effect is the same whether it’s delivered once or several times).Exactly-once: the event is received only once; this rarely happens and is expensive, so it cannot be assumed.
Core Principles
1. Events reflect domain state changes
They are only emitted when something relevant to the business changes.
2. Events are immutable
Once emitted, they do not change. If something changes, a new event is emitted.
3. Idempotency
A consumer must be able to process the same event multiple times without side effects.
This is critical because:
- Distributed systems fail
- Retries occur
- Duplicates may exist
Processing N times must produce the same result as processing once.
Learn more at Idempotency in Distributed Systems.
4. Event ordering
In some flows, order matters.
Example: events per customer, partition key would be the customer ID (client_id).
This is achieved with partitions.
Within a partition, events are processed in FIFO order.
5. Durability and reliability
Events must not be lost.
Achieved through: persistent brokers, logs, and database persistence.
6. Security
Events do not expose sensitive data or internal schemas; they follow clear contracts (schema, version).
Event Authentication through producers and consumers. 👉 The authentication does not occur in Kafka or the event. 👉 It occurs when a microservice exposes or consumes a synchronous API.
Events are not authenticated; they are authorized/secured by infrastructure:
-
Infrastructure security:
- TLS
- SASL
- ACLs de Kafka: ACL (Access Control List) defines what identity can perform an action on what resource.
- Example: an ACL per topic is a rule like: “Service X can PRODUCE in topic Y, but not CONSUME.”
-
Authorization by:
- Topic
- Consumer group
- Service
Kafka ensures:
- Who can publish
- Who can consume
7. Observability
It should be possible to:
- Event Trace: trace an event end-to-end
- Debug failures
- Correlate events
- Events → what happened (business event)
- Traces/logs → how it happened (flow and processing time)
Example of traces for a CustomerCreated event published by a microservice:
{
"event": "CustomerCreated",
"customerId": "123",
"traceId": "trace-1001", // the entire creation operation
"spanId": "span-02", // this specific step: publishing the event
"createdAt": "2026-01-22T12:00:00Z"
}traceIdconnects the event through all microservices.spanIdshows exactly which step produced this event.
If you only include traceId without spanId, you can still track the flow, but you won’t be able to measure latency or pinpoint which step generated each event.
Idempotency in depth
As mentioned before, in distributed systems failures are normal, so events may be duplicated due to:
Service crashes/restarts
Network failures
Broker re-sends or producer retries
It is the consumer's responsibility to protect against these cases.
Common techniques
- process_event.ts
- ingested_events
- Idempotence key
Store the event_id and ignore it if already processed
- Safe modification operations
Use append/patch instead of insert
Event Sourcing
The state is not stored as a snapshot (final value), it is reconstructed from events. That is, it consists of storing all events that caused state changes to rebuild the current state.
Characteristics:
- Events are the source of truth
- State can be recalculated
- Full audit trail
Not always necessary, but fits very well with EDD.
How to prevent event loss
On the broker
- Persistent broker (Kafka / RabbitMQ): events are persisted.
- Recovery after crashes: events are resent from the broker's last known state.
- Partitions for ordering: events in a partition are processed in FIFO order.
- Dead Letter Queue (DLQ): failed events are sent to this queue for later detailed evaluation (even manually).
What is a DLQ?
A queue that allows: reprocessing failed events and analyzing errors.
In the system
- Persistent logs: Cloudwatch / ELK / Loki, etc.
- Monitoring: allows visualization and analysis of events in real-time and over time (Prometheus/Grafana).
- Alerts: allow preventing erroneous behaviors.
On the consumer
Typical flow:
On the producer
The Outbox Pattern applies
Outbox Pattern
Prevents losing events when the DB and broker are not synchronized.
The Outbox Pattern is applied by the consumer service that receives an event, persists a domain change, and as a consequence must publish a new event.
In other words, the actor implementing the Outbox Pattern is the consumer that becomes a producer.
The problem it solves
A common event-driven flow is:
- A microservice consumes an event
- It persists a domain state change
- It publishes a new event to the broker
If the service crashes between steps 2 and 3:
- The domain state is already persisted
- The outgoing event is never published
- The system becomes inconsistent
Role of the Outbox-enabled service
The service:
- Consumes upstream events
- Persists domain state changes
- Produces downstream events
The Outbox Pattern ensures these responsibilities are handled safely.
How it works
- The service receives an event
- A database transaction is started
- The domain change is persisted
- The outgoing event is written to an outbox table
- The transaction is committed
A separate background worker:
- Polls the outbox table
- Publishes pending events to the broker
- Marks events as
SENT
Typical outbox states
PENDING: waiting to be publishedSENT: successfully published- (optional)
FAILED: failed after retries
What this pattern provides
- Guarantees that if the domain state exists, the event will eventually exist
- Enables safe retries
- Prevents event loss
- Decouples domain logic from broker availability
Relationship with idempotency
Because delivery is usually at-least-once:
- Downstream consumers must be idempotent
- The Outbox Pattern guarantees delivery, not uniqueness
Both patterns complement each other.
PENDING and will be retried when the broker comes back; if the publisher crashes right after committing the transaction, the events are already in the outbox, so no events are lost and the worker can send them when the publisher restarts.