system-designcloudarchitecturepatterns

Cloud Design Patterns — Ambassador, CQRS, Event Sourcing, Retry & More

March 29, 2026 8 min readBy Codelit Team Discussion

Why cloud design patterns exist#

Cloud environments introduce challenges that do not exist in traditional on-premise deployments: transient failures, elastic scaling, multi-region distribution, and pay-per-use economics. Cloud design patterns are proven solutions to these recurring challenges.

These patterns are not theoretical. They are extracted from production systems running at scale across Azure, AWS, and GCP.

1. Ambassador Pattern#

An ambassador is a sidecar service that handles cross-cutting concerns on behalf of your application: retries, circuit breaking, logging, monitoring, and TLS termination.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Your App    │────▶│  Ambassador  │────▶│  External    │
│  Container   │     │  Sidecar     │     │  Service     │
└──────────────┘     └──────────────┘     └──────────────┘

The application talks to localhost. The ambassador handles the complexity of communicating with the outside world.

Real-world implementations:

Azure: Dapr sidecar, Envoy proxy with AKS
AWS: App Mesh with Envoy sidecars on ECS/EKS
GCP: Anthos Service Mesh with Envoy

When to use: when your application communicates with external services and you want to offload retry logic, circuit breaking, mTLS, and observability without modifying application code.

When to avoid: if the added latency of a sidecar hop is unacceptable (sub-millisecond latency requirements), or for simple applications with a single dependency.

2. Anti-Corruption Layer (ACL)#

The ACL sits between your modern system and a legacy or external system. It translates requests and responses so your domain model stays clean.

┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│  Modern  │     │ Anti-Corruption  │     │   Legacy     │
│  Service │────▶│     Layer        │────▶│   System     │
│          │◀────│  (translates)    │◀────│   (SOAP/XML) │
└──────────┘     └─────────────────┘     └──────────────┘

Example: your new microservice needs data from a 15-year-old SOAP service. The ACL translates SOAP/XML to REST/JSON, maps legacy field names to your domain language, and handles the legacy system's quirks.

class LegacyOrderACL:
    """Translates legacy SOAP order model to our domain."""

    def translate(self, soap_response: dict) -> Order:
        legacy = soap_response["OrderRecord"]
        return Order(
            id=str(legacy["ORD_NUM"]),
            customer_id=str(legacy["CUST_ID"]),
            total=Decimal(legacy["TOT_AMT"]) / 100,  # stored as cents
            status=self._map_status(legacy["STAT_CD"]),
            created_at=datetime.strptime(
                legacy["CRT_DT"], "%Y%m%d"
            ),
        )

    def _map_status(self, code: str) -> OrderStatus:
        mapping = {
            "A": OrderStatus.ACTIVE,
            "C": OrderStatus.COMPLETED,
            "X": OrderStatus.CANCELLED,
        }
        return mapping.get(code, OrderStatus.UNKNOWN)

Cloud implementations:

Azure: Azure Functions or API Management policies as the translation layer
AWS: Lambda functions behind API Gateway
GCP: Cloud Functions or Cloud Run

3. CQRS (Command Query Responsibility Segregation)#

Separate the read model from the write model. Commands (writes) go to one model optimized for validation and consistency. Queries (reads) go to a different model optimized for retrieval speed.

          ┌─────────────┐
  Write──▶│  Command     │──▶ Write DB (normalized)
          │  Model       │
          └─────────────┘
                │
                │ events / sync
                ▼
          ┌─────────────┐
  Read───▶│  Query       │──▶ Read DB (denormalized)
          │  Model       │
          └─────────────┘

Why CQRS works in the cloud:

Scale reads and writes independently — add read replicas without affecting write throughput
Use different storage technologies — write to PostgreSQL, read from Elasticsearch or Redis
Optimize each model for its purpose

Cloud implementations:

Azure: Cosmos DB (write) + Azure Search (read), connected via Change Feed
AWS: DynamoDB (write) + OpenSearch (read), connected via DynamoDB Streams
GCP: Cloud Spanner (write) + BigQuery (read), connected via Dataflow

When to use: when read and write workloads have vastly different performance profiles. A product catalog with 1,000 writes/day but 1,000,000 reads/day is a textbook CQRS candidate.

When to avoid: simple CRUD applications where reads and writes are balanced. CQRS adds complexity.

4. Event Sourcing#

Instead of storing the current state of an entity, store the sequence of events that led to the current state. The current state is derived by replaying events.

Traditional:  Account { balance: 150 }

Event Sourced:
  1. AccountOpened { initial_balance: 0 }
  2. MoneyDeposited { amount: 200 }
  3. MoneyWithdrawn { amount: 50 }
  → Replay: 0 + 200 - 50 = 150

Benefits:

Complete audit trail — every change is recorded
Temporal queries — "what was the state at 3pm yesterday?"
Event replay — rebuild read models, fix bugs retroactively
Natural fit with CQRS — events drive the read model updates

Cloud implementations:

Azure: Event Hubs or Cosmos DB with Change Feed as the event store
AWS: EventBridge + DynamoDB or Kinesis as the event store
GCP: Pub/Sub + Firestore or Cloud Spanner

Event store schema:

CREATE TABLE events (
    event_id     UUID PRIMARY KEY,
    aggregate_id UUID NOT NULL,
    event_type   VARCHAR(255) NOT NULL,
    event_data   JSONB NOT NULL,
    version      INTEGER NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(aggregate_id, version)
);

When to avoid: when you do not need an audit trail and the complexity of event replay is not justified.

5. Gateway Aggregation#

The API gateway aggregates multiple backend service calls into a single response. The client makes one request; the gateway fans out to multiple services and merges the results.

Client ──▶ API Gateway ──┬──▶ User Service
                         ├──▶ Order Service
                         └──▶ Recommendation Service
                              │
Client ◀── Merged Response ◀──┘

Why this matters: mobile clients on slow networks cannot afford 5 sequential API calls. Gateway aggregation reduces round trips from N to 1.

// Gateway aggregation handler
async function getProductPage(productId) {
  const [product, reviews, recommendations, inventory] =
    await Promise.all([
      productService.get(productId),
      reviewService.getForProduct(productId),
      recommendationService.getRelated(productId),
      inventoryService.getStock(productId),
    ]);

  return {
    ...product,
    reviews: reviews.items,
    relatedProducts: recommendations.items,
    inStock: inventory.available > 0,
  };
}

Cloud implementations:

Azure: API Management with policy-based composition
AWS: AppSync (GraphQL) or API Gateway with Step Functions
GCP: Apigee with mashup policies

When to avoid: when aggregation logic becomes complex business logic. The gateway should aggregate, not orchestrate.

6. Retry Pattern#

Cloud services experience transient failures: network blips, throttled requests, temporary unavailability. The retry pattern handles these by automatically retrying failed operations.

Retry strategies:

Fixed interval — retry every N seconds. Simple but can overwhelm a recovering service.

Exponential backoff — double the wait time on each retry: 1s, 2s, 4s, 8s. Gives the service time to recover.

Exponential backoff with jitter — add randomness to avoid thundering herd:

import random
import time

def retry_with_backoff(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)

Cloud-native retry:

Azure: Polly library (.NET), built into Azure SDK
AWS: SDK built-in retry with exponential backoff, Step Functions retry policies
GCP: Cloud Tasks automatic retry, SDK built-in retry

Critical rules:

Only retry transient errors (5xx, timeouts) — never retry 400 Bad Request
Always set a max retry count — infinite retries cause cascading failures
Make operations idempotent — a retried POST should not create duplicates
Combine with circuit breaker — stop retrying when the service is clearly down

7. Sharding Pattern#

Distribute data across multiple database instances (shards) based on a partition key. Each shard holds a subset of the data.

Sharding strategies:

Range-based — shard by ID range (1-1M on shard 1, 1M-2M on shard 2). Simple but causes hotspots if recent data is accessed more.

Hash-based — hash the partition key and mod by shard count. Even distribution but range queries across shards are expensive.

Geography-based — shard by region. US users on US shard, EU users on EU shard. Optimizes for data locality and compliance.

def get_shard(user_id: str, num_shards: int) -> int:
    """Hash-based sharding."""
    hash_value = hashlib.md5(user_id.encode()).hexdigest()
    return int(hash_value, 16) % num_shards

Cloud implementations:

Azure: Cosmos DB (automatic partitioning), Azure SQL Elastic Pools
AWS: DynamoDB (automatic partitioning by partition key), Aurora with custom sharding
GCP: Cloud Spanner (automatic sharding with interleaved tables), Bigtable

When to use: when a single database instance cannot handle the data volume, write throughput, or read throughput.

When to avoid: when you can scale vertically or use read replicas. Sharding adds significant operational complexity.

Combining patterns#

These patterns rarely exist in isolation. A typical production architecture combines several:

CQRS + Event Sourcing — events drive the write side; projections build the read side
Gateway Aggregation + Retry — the gateway retries individual backend calls
Ambassador + Retry + Circuit Breaker — the sidecar handles all resilience patterns
Sharding + CQRS — shard the write model, replicate to denormalized read stores

Pattern selection guide#

Problem	Pattern
Cross-cutting concerns	Ambassador
Legacy integration	Anti-corruption Layer
Read/write scaling	CQRS
Audit trail, temporal queries	Event Sourcing
Reduce client round trips	Gateway Aggregation
Transient failures	Retry with backoff
Data volume beyond one node	Sharding

Visualize your cloud architecture#

Map these patterns in your architecture — generate an interactive diagram with Codelit showing how Ambassador, CQRS, and Gateway Aggregation fit together.

Key takeaways#

Ambassador offloads cross-cutting concerns to a sidecar — retries, TLS, observability
Anti-corruption Layer protects your domain from legacy system models
CQRS separates reads and writes for independent scaling and optimization
Event Sourcing stores events, not state — complete audit trail and temporal queries
Gateway Aggregation reduces N client calls to 1 gateway call
Retry with jitter handles transient failures without thundering herd
Sharding distributes data when a single node cannot keep up
This is article #385 of our ongoing system design series

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

AI Agent Memory Architecture

2 min read

AI agents

Production AI Agent Deployment Checklist

2 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Cloud File Storage Platform

Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.

8 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Build this architecture

Generate an interactive architecture for Cloud Design Patterns in seconds.

Try it in Codelit →

system-designcloudarchitecturepatterns

Cloud Design Patterns — Ambassador, CQRS, Event Sourcing, Retry & More

March 29, 2026 8 min readBy Codelit Team Discussion

Why cloud design patterns exist#

These patterns are not theoretical. They are extracted from production systems running at scale across Azure, AWS, and GCP.

1. Ambassador Pattern#

An ambassador is a sidecar service that handles cross-cutting concerns on behalf of your application: retries, circuit breaking, logging, monitoring, and TLS termination.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Your App    │────▶│  Ambassador  │────▶│  External    │
│  Container   │     │  Sidecar     │     │  Service     │
└──────────────┘     └──────────────┘     └──────────────┘

The application talks to localhost. The ambassador handles the complexity of communicating with the outside world.

Real-world implementations:

Azure: Dapr sidecar, Envoy proxy with AKS
AWS: App Mesh with Envoy sidecars on ECS/EKS
GCP: Anthos Service Mesh with Envoy

When to use: when your application communicates with external services and you want to offload retry logic, circuit breaking, mTLS, and observability without modifying application code.

When to avoid: if the added latency of a sidecar hop is unacceptable (sub-millisecond latency requirements), or for simple applications with a single dependency.

2. Anti-Corruption Layer (ACL)#

The ACL sits between your modern system and a legacy or external system. It translates requests and responses so your domain model stays clean.

┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│  Modern  │     │ Anti-Corruption  │     │   Legacy     │
│  Service │────▶│     Layer        │────▶│   System     │
│          │◀────│  (translates)    │◀────│   (SOAP/XML) │
└──────────┘     └─────────────────┘     └──────────────┘

class LegacyOrderACL:
    """Translates legacy SOAP order model to our domain."""

    def translate(self, soap_response: dict) -> Order:
        legacy = soap_response["OrderRecord"]
        return Order(
            id=str(legacy["ORD_NUM"]),
            customer_id=str(legacy["CUST_ID"]),
            total=Decimal(legacy["TOT_AMT"]) / 100,  # stored as cents
            status=self._map_status(legacy["STAT_CD"]),
            created_at=datetime.strptime(
                legacy["CRT_DT"], "%Y%m%d"
            ),
        )

    def _map_status(self, code: str) -> OrderStatus:
        mapping = {
            "A": OrderStatus.ACTIVE,
            "C": OrderStatus.COMPLETED,
            "X": OrderStatus.CANCELLED,
        }
        return mapping.get(code, OrderStatus.UNKNOWN)

Cloud implementations:

Azure: Azure Functions or API Management policies as the translation layer
AWS: Lambda functions behind API Gateway
GCP: Cloud Functions or Cloud Run

3. CQRS (Command Query Responsibility Segregation)#

Separate the read model from the write model. Commands (writes) go to one model optimized for validation and consistency. Queries (reads) go to a different model optimized for retrieval speed.

          ┌─────────────┐
  Write──▶│  Command     │──▶ Write DB (normalized)
          │  Model       │
          └─────────────┘
                │
                │ events / sync
                ▼
          ┌─────────────┐
  Read───▶│  Query       │──▶ Read DB (denormalized)
          │  Model       │
          └─────────────┘

Why CQRS works in the cloud:

Scale reads and writes independently — add read replicas without affecting write throughput
Use different storage technologies — write to PostgreSQL, read from Elasticsearch or Redis
Optimize each model for its purpose

Cloud implementations:

Azure: Cosmos DB (write) + Azure Search (read), connected via Change Feed
AWS: DynamoDB (write) + OpenSearch (read), connected via DynamoDB Streams
GCP: Cloud Spanner (write) + BigQuery (read), connected via Dataflow

When to use: when read and write workloads have vastly different performance profiles. A product catalog with 1,000 writes/day but 1,000,000 reads/day is a textbook CQRS candidate.

When to avoid: simple CRUD applications where reads and writes are balanced. CQRS adds complexity.

4. Event Sourcing#

Instead of storing the current state of an entity, store the sequence of events that led to the current state. The current state is derived by replaying events.

Traditional:  Account { balance: 150 }

Event Sourced:
  1. AccountOpened { initial_balance: 0 }
  2. MoneyDeposited { amount: 200 }
  3. MoneyWithdrawn { amount: 50 }
  → Replay: 0 + 200 - 50 = 150

Benefits:

Complete audit trail — every change is recorded
Temporal queries — "what was the state at 3pm yesterday?"
Event replay — rebuild read models, fix bugs retroactively
Natural fit with CQRS — events drive the read model updates

Cloud implementations:

Azure: Event Hubs or Cosmos DB with Change Feed as the event store
AWS: EventBridge + DynamoDB or Kinesis as the event store
GCP: Pub/Sub + Firestore or Cloud Spanner

Event store schema:

CREATE TABLE events (
    event_id     UUID PRIMARY KEY,
    aggregate_id UUID NOT NULL,
    event_type   VARCHAR(255) NOT NULL,
    event_data   JSONB NOT NULL,
    version      INTEGER NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(aggregate_id, version)
);

When to avoid: when you do not need an audit trail and the complexity of event replay is not justified.

5. Gateway Aggregation#

The API gateway aggregates multiple backend service calls into a single response. The client makes one request; the gateway fans out to multiple services and merges the results.

Client ──▶ API Gateway ──┬──▶ User Service
                         ├──▶ Order Service
                         └──▶ Recommendation Service
                              │
Client ◀── Merged Response ◀──┘

Why this matters: mobile clients on slow networks cannot afford 5 sequential API calls. Gateway aggregation reduces round trips from N to 1.

// Gateway aggregation handler
async function getProductPage(productId) {
  const [product, reviews, recommendations, inventory] =
    await Promise.all([
      productService.get(productId),
      reviewService.getForProduct(productId),
      recommendationService.getRelated(productId),
      inventoryService.getStock(productId),
    ]);

  return {
    ...product,
    reviews: reviews.items,
    relatedProducts: recommendations.items,
    inStock: inventory.available > 0,
  };
}

Cloud implementations:

Azure: API Management with policy-based composition
AWS: AppSync (GraphQL) or API Gateway with Step Functions
GCP: Apigee with mashup policies

When to avoid: when aggregation logic becomes complex business logic. The gateway should aggregate, not orchestrate.

6. Retry Pattern#

Cloud services experience transient failures: network blips, throttled requests, temporary unavailability. The retry pattern handles these by automatically retrying failed operations.

Retry strategies:

Fixed interval — retry every N seconds. Simple but can overwhelm a recovering service.

Exponential backoff — double the wait time on each retry: 1s, 2s, 4s, 8s. Gives the service time to recover.

Exponential backoff with jitter — add randomness to avoid thundering herd:

import random
import time

def retry_with_backoff(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)

Cloud-native retry:

Azure: Polly library (.NET), built into Azure SDK
AWS: SDK built-in retry with exponential backoff, Step Functions retry policies
GCP: Cloud Tasks automatic retry, SDK built-in retry

Critical rules:

Only retry transient errors (5xx, timeouts) — never retry 400 Bad Request
Always set a max retry count — infinite retries cause cascading failures
Make operations idempotent — a retried POST should not create duplicates
Combine with circuit breaker — stop retrying when the service is clearly down

7. Sharding Pattern#

Distribute data across multiple database instances (shards) based on a partition key. Each shard holds a subset of the data.

Sharding strategies:

Range-based — shard by ID range (1-1M on shard 1, 1M-2M on shard 2). Simple but causes hotspots if recent data is accessed more.

Hash-based — hash the partition key and mod by shard count. Even distribution but range queries across shards are expensive.

Geography-based — shard by region. US users on US shard, EU users on EU shard. Optimizes for data locality and compliance.

def get_shard(user_id: str, num_shards: int) -> int:
    """Hash-based sharding."""
    hash_value = hashlib.md5(user_id.encode()).hexdigest()
    return int(hash_value, 16) % num_shards

Cloud implementations:

Azure: Cosmos DB (automatic partitioning), Azure SQL Elastic Pools
AWS: DynamoDB (automatic partitioning by partition key), Aurora with custom sharding
GCP: Cloud Spanner (automatic sharding with interleaved tables), Bigtable

When to use: when a single database instance cannot handle the data volume, write throughput, or read throughput.

When to avoid: when you can scale vertically or use read replicas. Sharding adds significant operational complexity.

Combining patterns#

These patterns rarely exist in isolation. A typical production architecture combines several:

CQRS + Event Sourcing — events drive the write side; projections build the read side
Gateway Aggregation + Retry — the gateway retries individual backend calls
Ambassador + Retry + Circuit Breaker — the sidecar handles all resilience patterns
Sharding + CQRS — shard the write model, replicate to denormalized read stores

Pattern selection guide#

Problem	Pattern
Cross-cutting concerns	Ambassador
Legacy integration	Anti-corruption Layer
Read/write scaling	CQRS
Audit trail, temporal queries	Event Sourcing
Reduce client round trips	Gateway Aggregation
Transient failures	Retry with backoff
Data volume beyond one node	Sharding

Visualize your cloud architecture#

Map these patterns in your architecture — generate an interactive diagram with Codelit showing how Ambassador, CQRS, and Gateway Aggregation fit together.

Key takeaways#

Ambassador offloads cross-cutting concerns to a sidecar — retries, TLS, observability
Anti-corruption Layer protects your domain from legacy system models
CQRS separates reads and writes for independent scaling and optimization
Event Sourcing stores events, not state — complete audit trail and temporal queries
Gateway Aggregation reduces N client calls to 1 gateway call
Retry with jitter handles transient failures without thundering herd
Sharding distributes data when a single node cannot keep up
This is article #385 of our ongoing system design series

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Cloud Design Patterns in seconds.

Try it in Codelit →

Cloud Design Patterns — Ambassador, CQRS, Event Sourcing, Retry & More

Why cloud design patterns exist#

1. Ambassador Pattern#

2. Anti-Corruption Layer (ACL)#

3. CQRS (Command Query Responsibility Segregation)#

4. Event Sourcing#

5. Gateway Aggregation#

6. Retry Pattern#

7. Sharding Pattern#

Combining patterns#

Pattern selection guide#

Visualize your cloud architecture#

Key takeaways#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Cloud File Storage Platform

Search Engine Architecture

Build this architecture

Cloud Design Patterns — Ambassador, CQRS, Event Sourcing, Retry & More

Why cloud design patterns exist#

1. Ambassador Pattern#

2. Anti-Corruption Layer (ACL)#

3. CQRS (Command Query Responsibility Segregation)#

4. Event Sourcing#

5. Gateway Aggregation#

6. Retry Pattern#

7. Sharding Pattern#

Combining patterns#

Pattern selection guide#

Visualize your cloud architecture#

Key takeaways#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Cloud File Storage Platform

Search Engine Architecture

Build this architecture