Cloud Design Patterns — Ambassador, CQRS, Event Sourcing, Retry & More
Why cloud design patterns exist#
Cloud environments introduce challenges that do not exist in traditional on-premise deployments: transient failures, elastic scaling, multi-region distribution, and pay-per-use economics. Cloud design patterns are proven solutions to these recurring challenges.
These patterns are not theoretical. They are extracted from production systems running at scale across Azure, AWS, and GCP.
1. Ambassador Pattern#
An ambassador is a sidecar service that handles cross-cutting concerns on behalf of your application: retries, circuit breaking, logging, monitoring, and TLS termination.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Your App │────▶│ Ambassador │────▶│ External │
│ Container │ │ Sidecar │ │ Service │
└──────────────┘ └──────────────┘ └──────────────┘
The application talks to localhost. The ambassador handles the complexity of communicating with the outside world.
Real-world implementations:
- Azure: Dapr sidecar, Envoy proxy with AKS
- AWS: App Mesh with Envoy sidecars on ECS/EKS
- GCP: Anthos Service Mesh with Envoy
When to use: when your application communicates with external services and you want to offload retry logic, circuit breaking, mTLS, and observability without modifying application code.
When to avoid: if the added latency of a sidecar hop is unacceptable (sub-millisecond latency requirements), or for simple applications with a single dependency.
2. Anti-Corruption Layer (ACL)#
The ACL sits between your modern system and a legacy or external system. It translates requests and responses so your domain model stays clean.
┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ Modern │ │ Anti-Corruption │ │ Legacy │
│ Service │────▶│ Layer │────▶│ System │
│ │◀────│ (translates) │◀────│ (SOAP/XML) │
└──────────┘ └─────────────────┘ └──────────────┘
Example: your new microservice needs data from a 15-year-old SOAP service. The ACL translates SOAP/XML to REST/JSON, maps legacy field names to your domain language, and handles the legacy system's quirks.
class LegacyOrderACL:
"""Translates legacy SOAP order model to our domain."""
def translate(self, soap_response: dict) -> Order:
legacy = soap_response["OrderRecord"]
return Order(
id=str(legacy["ORD_NUM"]),
customer_id=str(legacy["CUST_ID"]),
total=Decimal(legacy["TOT_AMT"]) / 100, # stored as cents
status=self._map_status(legacy["STAT_CD"]),
created_at=datetime.strptime(
legacy["CRT_DT"], "%Y%m%d"
),
)
def _map_status(self, code: str) -> OrderStatus:
mapping = {
"A": OrderStatus.ACTIVE,
"C": OrderStatus.COMPLETED,
"X": OrderStatus.CANCELLED,
}
return mapping.get(code, OrderStatus.UNKNOWN)
Cloud implementations:
- Azure: Azure Functions or API Management policies as the translation layer
- AWS: Lambda functions behind API Gateway
- GCP: Cloud Functions or Cloud Run
3. CQRS (Command Query Responsibility Segregation)#
Separate the read model from the write model. Commands (writes) go to one model optimized for validation and consistency. Queries (reads) go to a different model optimized for retrieval speed.
┌─────────────┐
Write──▶│ Command │──▶ Write DB (normalized)
│ Model │
└─────────────┘
│
│ events / sync
▼
┌─────────────┐
Read───▶│ Query │──▶ Read DB (denormalized)
│ Model │
└─────────────┘
Why CQRS works in the cloud:
- Scale reads and writes independently — add read replicas without affecting write throughput
- Use different storage technologies — write to PostgreSQL, read from Elasticsearch or Redis
- Optimize each model for its purpose
Cloud implementations:
- Azure: Cosmos DB (write) + Azure Search (read), connected via Change Feed
- AWS: DynamoDB (write) + OpenSearch (read), connected via DynamoDB Streams
- GCP: Cloud Spanner (write) + BigQuery (read), connected via Dataflow
When to use: when read and write workloads have vastly different performance profiles. A product catalog with 1,000 writes/day but 1,000,000 reads/day is a textbook CQRS candidate.
When to avoid: simple CRUD applications where reads and writes are balanced. CQRS adds complexity.
4. Event Sourcing#
Instead of storing the current state of an entity, store the sequence of events that led to the current state. The current state is derived by replaying events.
Traditional: Account { balance: 150 }
Event Sourced:
1. AccountOpened { initial_balance: 0 }
2. MoneyDeposited { amount: 200 }
3. MoneyWithdrawn { amount: 50 }
→ Replay: 0 + 200 - 50 = 150
Benefits:
- Complete audit trail — every change is recorded
- Temporal queries — "what was the state at 3pm yesterday?"
- Event replay — rebuild read models, fix bugs retroactively
- Natural fit with CQRS — events drive the read model updates
Cloud implementations:
- Azure: Event Hubs or Cosmos DB with Change Feed as the event store
- AWS: EventBridge + DynamoDB or Kinesis as the event store
- GCP: Pub/Sub + Firestore or Cloud Spanner
Event store schema:
CREATE TABLE events (
event_id UUID PRIMARY KEY,
aggregate_id UUID NOT NULL,
event_type VARCHAR(255) NOT NULL,
event_data JSONB NOT NULL,
version INTEGER NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(aggregate_id, version)
);
When to avoid: when you do not need an audit trail and the complexity of event replay is not justified.
5. Gateway Aggregation#
The API gateway aggregates multiple backend service calls into a single response. The client makes one request; the gateway fans out to multiple services and merges the results.
Client ──▶ API Gateway ──┬──▶ User Service
├──▶ Order Service
└──▶ Recommendation Service
│
Client ◀── Merged Response ◀──┘
Why this matters: mobile clients on slow networks cannot afford 5 sequential API calls. Gateway aggregation reduces round trips from N to 1.
// Gateway aggregation handler
async function getProductPage(productId) {
const [product, reviews, recommendations, inventory] =
await Promise.all([
productService.get(productId),
reviewService.getForProduct(productId),
recommendationService.getRelated(productId),
inventoryService.getStock(productId),
]);
return {
...product,
reviews: reviews.items,
relatedProducts: recommendations.items,
inStock: inventory.available > 0,
};
}
Cloud implementations:
- Azure: API Management with policy-based composition
- AWS: AppSync (GraphQL) or API Gateway with Step Functions
- GCP: Apigee with mashup policies
When to avoid: when aggregation logic becomes complex business logic. The gateway should aggregate, not orchestrate.
6. Retry Pattern#
Cloud services experience transient failures: network blips, throttled requests, temporary unavailability. The retry pattern handles these by automatically retrying failed operations.
Retry strategies:
Fixed interval — retry every N seconds. Simple but can overwhelm a recovering service.
Exponential backoff — double the wait time on each retry: 1s, 2s, 4s, 8s. Gives the service time to recover.
Exponential backoff with jitter — add randomness to avoid thundering herd:
import random
import time
def retry_with_backoff(func, max_retries=5, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.5)
time.sleep(delay + jitter)
Cloud-native retry:
- Azure: Polly library (.NET), built into Azure SDK
- AWS: SDK built-in retry with exponential backoff, Step Functions retry policies
- GCP: Cloud Tasks automatic retry, SDK built-in retry
Critical rules:
- Only retry transient errors (5xx, timeouts) — never retry 400 Bad Request
- Always set a max retry count — infinite retries cause cascading failures
- Make operations idempotent — a retried POST should not create duplicates
- Combine with circuit breaker — stop retrying when the service is clearly down
7. Sharding Pattern#
Distribute data across multiple database instances (shards) based on a partition key. Each shard holds a subset of the data.
Sharding strategies:
Range-based — shard by ID range (1-1M on shard 1, 1M-2M on shard 2). Simple but causes hotspots if recent data is accessed more.
Hash-based — hash the partition key and mod by shard count. Even distribution but range queries across shards are expensive.
Geography-based — shard by region. US users on US shard, EU users on EU shard. Optimizes for data locality and compliance.
def get_shard(user_id: str, num_shards: int) -> int:
"""Hash-based sharding."""
hash_value = hashlib.md5(user_id.encode()).hexdigest()
return int(hash_value, 16) % num_shards
Cloud implementations:
- Azure: Cosmos DB (automatic partitioning), Azure SQL Elastic Pools
- AWS: DynamoDB (automatic partitioning by partition key), Aurora with custom sharding
- GCP: Cloud Spanner (automatic sharding with interleaved tables), Bigtable
When to use: when a single database instance cannot handle the data volume, write throughput, or read throughput.
When to avoid: when you can scale vertically or use read replicas. Sharding adds significant operational complexity.
Combining patterns#
These patterns rarely exist in isolation. A typical production architecture combines several:
- CQRS + Event Sourcing — events drive the write side; projections build the read side
- Gateway Aggregation + Retry — the gateway retries individual backend calls
- Ambassador + Retry + Circuit Breaker — the sidecar handles all resilience patterns
- Sharding + CQRS — shard the write model, replicate to denormalized read stores
Pattern selection guide#
| Problem | Pattern |
|---|---|
| Cross-cutting concerns | Ambassador |
| Legacy integration | Anti-corruption Layer |
| Read/write scaling | CQRS |
| Audit trail, temporal queries | Event Sourcing |
| Reduce client round trips | Gateway Aggregation |
| Transient failures | Retry with backoff |
| Data volume beyond one node | Sharding |
Visualize your cloud architecture#
Map these patterns in your architecture — generate an interactive diagram with Codelit showing how Ambassador, CQRS, and Gateway Aggregation fit together.
Key takeaways#
- Ambassador offloads cross-cutting concerns to a sidecar — retries, TLS, observability
- Anti-corruption Layer protects your domain from legacy system models
- CQRS separates reads and writes for independent scaling and optimization
- Event Sourcing stores events, not state — complete audit trail and temporal queries
- Gateway Aggregation reduces N client calls to 1 gateway call
- Retry with jitter handles transient failures without thundering herd
- Sharding distributes data when a single node cannot keep up
- This is article #385 of our ongoing system design series
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsCloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive architecture for Cloud Design Patterns in seconds.
Try it in Codelit →
Comments