Software Design
Distributed Systems
Event Driven Design

[EDD] Event-Driven Design

Event-Driven Design (EDD) is an architecture where the system reflects and communicates important business changes through events. Services react to what happens, rather than through direct calls.

EDD is ideal for systems that are Distributed, Scalable, and Resilient.

💡

Assume failures, design for them, and use events as the language of the business. It is often used alongside DDD.


What is an event?

An event represents a domain fact that has already occurred and cannot be changed.

Examples: OrderCreated PaymentConfirmed UserRegistered

⚠️

An event is not an intention, it is a statement of the past.


Core components in EDD

ComponentDescriptionResponsibilitiesExamples
BrokerEnsures event delivery
  • Message persistence
  • Retries
  • Ordering within a partition
Kafka, RabbitMQ
ProducerService that emits the event
  • Generate the event after a domain change
  • Send the event to the broker
ConsumerService that processes the event
  • Process the event correctly
  • Be idempotent
  • Handle retries
  • Send ACK to the broker

Eventual Consistency

⚠️

In distributed systems like those using EDD, not all parts of the system are consistent at the same time. Each service processes events at its own pace. Over time, all converge to the same state.

Example:

States may differ temporarily, but the system eventually becomes consistent.


Failures in distributed systems

⚠️

In distributed systems, failures are not an exception, they are the norm.

Common principles:

  • At-least-once: the event may arrive more than once; the broker and consumer must handle this.
  • At-most-once: the event may be lost.
  • Must-once: the event arrives at least once; if it arrives multiple times, the consumer must be idempotent (final effect is the same whether it’s delivered once or several times).
  • Exactly-once: the event is received only once; this rarely happens and is expensive, so it cannot be assumed.

Core Principles

1. Events reflect domain state changes

They are only emitted when something relevant to the business changes.

2. Events are immutable

Once emitted, they do not change. If something changes, a new event is emitted.

3. Idempotency

A consumer must be able to process the same event multiple times without side effects.
This is critical because:

  • Distributed systems fail
  • Retries occur
  • Duplicates may exist

Processing N times must produce the same result as processing once.

Learn more at Idempotency in Distributed Systems.


4. Event ordering

In some flows, order matters.

Example: events per customer, partition key would be the customer ID (client_id).

This is achieved with partitions.

Within a partition, events are processed in FIFO order.


5. Durability and reliability

Events must not be lost.

Achieved through: persistent brokers, logs, and database persistence.


6. Security

🚫

Events do not expose sensitive data or internal schemas; they follow clear contracts (schema, version).

Event Authentication through producers and consumers. 👉 The authentication does not occur in Kafka or the event. 👉 It occurs when a microservice exposes or consumes a synchronous API.

Events are not authenticated; they are authorized/secured by infrastructure:

  • Infrastructure security:

    • TLS
    • SASL
    • ACLs de Kafka: ACL (Access Control List) defines what identity can perform an action on what resource.
      • Example: an ACL per topic is a rule like: “Service X can PRODUCE in topic Y, but not CONSUME.”
  • Authorization by:

    • Topic
    • Consumer group
    • Service

Kafka ensures:

  • Who can publish
  • Who can consume

7. Observability

👀

It should be possible to:

  • Event Trace: trace an event end-to-end
  • Debug failures
  • Correlate events
🚫
  • Events → what happened (business event)
  • Traces/logs → how it happened (flow and processing time)

Example of traces for a CustomerCreated event published by a microservice:

{
  "event": "CustomerCreated",
  "customerId": "123",
  "traceId": "trace-1001",  // the entire creation operation
  "spanId": "span-02",      // this specific step: publishing the event
  "createdAt": "2026-01-22T12:00:00Z"
}
  • traceId connects the event through all microservices.
  • spanId shows exactly which step produced this event.

If you only include traceId without spanId, you can still track the flow, but you won’t be able to measure latency or pinpoint which step generated each event.


Idempotency in depth

As mentioned before, in distributed systems failures are normal, so events may be duplicated due to:

🚫

Service crashes/restarts

🚫

Network failures

🚫

Broker re-sends or producer retries

⚠️

It is the consumer's responsibility to protect against these cases.

Common techniques

    • process_event.ts
    • ingested_events
    1. Idempotence key
    💡

    Store the event_id and ignore it if already processed

    1. Safe modification operations
    💡

    Use append/patch instead of insert


    Event Sourcing

    The state is not stored as a snapshot (final value), it is reconstructed from events. That is, it consists of storing all events that caused state changes to rebuild the current state.

    💡

    Characteristics:

    • Events are the source of truth
    • State can be recalculated
    • Full audit trail

    Not always necessary, but fits very well with EDD.


    How to prevent event loss

    On the broker

    • Persistent broker (Kafka / RabbitMQ): events are persisted.
    • Recovery after crashes: events are resent from the broker's last known state.
    • Partitions for ordering: events in a partition are processed in FIFO order.
    • Dead Letter Queue (DLQ): failed events are sent to this queue for later detailed evaluation (even manually).

    What is a DLQ?

    A queue that allows: reprocessing failed events and analyzing errors.


    In the system

    • Persistent logs: Cloudwatch / ELK / Loki, etc.
    • Monitoring: allows visualization and analysis of events in real-time and over time (Prometheus/Grafana).
    • Alerts: allow preventing erroneous behaviors.

    On the consumer

    Typical flow:


    On the producer

    The Outbox Pattern applies

    Outbox Pattern

    Prevents losing events when the DB and broker are not synchronized.

    The Outbox Pattern is applied by the consumer service that receives an event, persists a domain change, and as a consequence must publish a new event.

    In other words, the actor implementing the Outbox Pattern is the consumer that becomes a producer.


    The problem it solves

    A common event-driven flow is:

    1. A microservice consumes an event
    2. It persists a domain state change
    3. It publishes a new event to the broker

    If the service crashes between steps 2 and 3:

    • The domain state is already persisted
    • The outgoing event is never published
    • The system becomes inconsistent

    Role of the Outbox-enabled service

    The service:

    • Consumes upstream events
    • Persists domain state changes
    • Produces downstream events

    The Outbox Pattern ensures these responsibilities are handled safely.


    How it works

    1. The service receives an event
    2. A database transaction is started
    3. The domain change is persisted
    4. The outgoing event is written to an outbox table
    5. The transaction is committed

    A separate background worker:

    • Polls the outbox table
    • Publishes pending events to the broker
    • Marks events as SENT

    Typical outbox states

    • PENDING: waiting to be published
    • SENT: successfully published
    • (optional) FAILED: failed after retries

    What this pattern provides

    • Guarantees that if the domain state exists, the event will eventually exist
    • Enables safe retries
    • Prevents event loss
    • Decouples domain logic from broker availability

    Relationship with idempotency

    Because delivery is usually at-least-once:

    • Downstream consumers must be idempotent
    • The Outbox Pattern guarantees delivery, not uniqueness

    Both patterns complement each other.

    💡
    In the Outbox Pattern, if the broker goes down, events remain in the outbox table with status PENDING and will be retried when the broker comes back; if the publisher crashes right after committing the transaction, the events are already in the outbox, so no events are lost and the worker can send them when the publisher restarts.