Error Handling in Microservices with Asynchronous Integrations
How to design resilient microservices when integrating with external APIs like OpenAI
Introduction
When a microservice integrates asynchronously with an external service (for example, the OpenAI API), failure is not a possibility — it is a normal system state.
This post describes a practical and scalable approach to building resilience from day one.
1. Assume failure as a normal state
An external service can:
- Stop responding
- Respond slowly
- Fail intermittently
- Go completely down
Designing systems without assuming this reality leads to cascading failures.
In distributed systems, external availability is never under your control.
2. Observability as the foundation
Before you can recover from a failure, you need to detect and understand it.
Always include:
- Structured logs (
requestId,service,errorType) - Metrics (latency, error rate, p95/p99)
- Alerts based on trends, not single events
3. Timeouts and SLA
A timeout defines how long you are willing to wait for an external response.
SLA (Service Level Agreement) defines expected availability, latency, and reliability.
Your timeout should be based on the SLA you can accept — not the provider’s SLA.
Best practices: explicit timeouts, never rely on defaults, keep them short and realistic (2–5s).
4. Retries with exponential backoff
Retries should only be applied to transient failures (timeouts, 5xx errors).
Recommended strategy:
- Maximum number of attempts
- Exponential backoff + jitter
- Never infinite retries
5. Circuit Breaker
A circuit breaker prevents continuously calling a service that is already failing.
Trigger
It opens when defined thresholds are exceeded, such as:
- Consecutive failures
- Error rate within a time window
- Repeated timeouts
For a circuit breaker to work correctly, it must persist and evaluate success/failure data from calls to the external service.
States
Closed: normal operationOpen: calls are blockedHalf-open: controlled test requests to check recovery
6. Fallbacks and graceful degradation
When the circuit is open:
- Return cached responses
- Redirect to an alternative service
- Offer limited functionality
- Return an informative message
7. Asynchronous processing and DLQ
For non–time-critical operations:
- Use queues
- Process with workers
- Apply retries in a decoupled way
After multiple failures:
- Send the message to a Dead Letter Queue (DLQ)
- Manual or automated reprocessing
8. Idempotency Keys and Correlation IDs
Learn more at Idempotency in Distributed Systems.
Idempotency Keys
Ensure an operation is executed only once, even if retries occur.
Examples of idempotency keys:
eventId, requestId, operationId
Duplicate detection is done by checking whether the idempotency key has already been processed.
Correlation IDs
Allow tracing a full operation across services, logs, and events.
They typically correspond to a traceId.
Using tools like OpenTelemetry, Jaeger, or Zipkin, a traceId makes it possible to follow a complete business operation end-to-end.
Each traced event should include:
{
event_id: "UUID"
correlation_id: "traceID"
causation_id "Parent event_id that triggers this event"
}Allowing: Rebuilding the causal chain and end-to-end auditability.
9. Failure handling priority order
- Timeout
- Retries
- Circuit Breaker
- Fallback / Degradation
- DLQ
10. Architecture diagram
Client → Service The client (frontend or another service) sends a request to the microservice.
Service → ExternalAPI Synchronous path. The service attempts to call the external API using timeouts, retries, and a circuit breaker.
Service → Queue If the operation is not time-critical or the external service is degraded, the service publishes an event to a queue.
Queue → Worker An independent worker consumes messages asynchronously, fully decoupled from the user request flow.
Worker → ExternalAPI The worker calls the external API without blocking the client and can apply more aggressive retry policies.
Worker → DLQ If the operation keeps failing after multiple attempts, the message is sent to the Dead Letter Queue for analysis or reprocessing.
11. Audit Logs in Distributed Systems
To audit a distributed system, you need to log events in a way that allows you to rebuild the causal chain and end-to-end auditability. The logs must be immutable (append-only system) and stored agnostic from the application (domain).
Examples of append-only systems:
- DB WORM / S3 con Object Lock
- Dedicated Event Store
- Data Lake with retention policies
Example of events saved:
{
"event_id": "UUID",
"event_type": "string",
"aggregate_id": "debtor_id", // id of the domain entity which event is related to
"payload_hash": "string", // Allow data integrity without exposing data
"producer": "DebtorService",
"timestamp": "string",
"trace_id": "string"
}It can't be deleted or modified.
Final flow: