system-designinfrastructurebackend

Design a Task Queue — From Simple Workers to Distributed Job Processing

March 24, 2026 4 min readBy Mo Discussion

Why task queues exist#

Some work takes too long for a request-response cycle:

Sending emails (2-5 seconds per email API call)
Processing uploaded images (resize, compress, generate thumbnails)
Generating reports from millions of rows
Syncing data with external APIs

A task queue lets you say "do this later" and respond to the user immediately.

The producer-consumer pattern#

Producer (API server) → Queue (Redis/SQS/RabbitMQ) → Consumer (Worker)

Producer creates a job with a payload and pushes it to the queue
Queue stores the job until a worker picks it up
Consumer (worker) pulls the job, executes it, acknowledges completion

The user gets an instant response. The work happens in the background.

Core components#

Job definition#

{
  "id": "job_abc123",
  "type": "send_email",
  "payload": {
    "to": "user@example.com",
    "template": "welcome",
    "data": { "name": "Mo" }
  },
  "priority": "high",
  "attempts": 0,
  "max_attempts": 3,
  "created_at": "2026-03-24T10:00:00Z",
  "scheduled_at": null
}

Queue backends#

Backend	Best for	Durability
Redis (Bull/BullMQ)	Simple, fast, Node.js	On-disk persistence
RabbitMQ	Complex routing, multiple consumers	Durable queues
SQS	AWS native, zero ops	Fully managed
Kafka	Event streaming, replay	Persistent log
Celery	Python ecosystem	Depends on broker

Workers#

Workers are processes that:

Connect to the queue
Pull the next job (or receive via push)
Execute the job logic
Acknowledge success or report failure

Scale workers independently: 1 worker for low-traffic, 100 for high-traffic.

Reliability patterns#

At-least-once delivery#

The queue delivers every job at least once. If the worker crashes mid-processing, the job is re-delivered. Your job logic must be idempotent — processing the same job twice produces the same result.

Retry with exponential backoff#

Attempt 1: immediate
Attempt 2: wait 10s
Attempt 3: wait 30s
Attempt 4: wait 2min
Attempt 5: wait 10min → Dead Letter Queue

Add jitter (random delay) to prevent thundering herd when many retries fire simultaneously.

Dead Letter Queue (DLQ)#

Jobs that fail all retry attempts go to a separate queue for manual inspection:

View the payload and error
Fix the bug
Re-queue the job

Never silently drop failed jobs.

Visibility timeout#

When a worker picks a job, the queue hides it from other workers for N seconds. If the worker doesn't acknowledge in time, the job becomes visible again for another worker to pick up.

Set the timeout to 2x your expected processing time.

Job scheduling#

Beyond "do it now" — schedule jobs for later:

Delayed jobs — "Send this email in 30 minutes"
Cron jobs — "Generate the daily report at 6 AM"
Rate-limited jobs — "Process at most 10 API calls per second"

Implementation#

Store scheduled jobs in a sorted set (Redis ZSET) keyed by execution time. A scheduler process polls for jobs whose scheduled time has passed and moves them to the active queue.

Scaling strategies#

Horizontal scaling: Add more workers. Each worker pulls from the same queue. Works out of the box.

Concurrency per worker: Each worker process can run multiple jobs concurrently (threads or async). Tune based on job type — CPU-bound (1 per core) vs. I/O-bound (many per process).

Priority queues: Separate queues for high/medium/low priority. Workers process high-priority queue first.

Dedicated queues: Separate queues per job type. Email queue, image queue, report queue. Prevents slow jobs from blocking fast ones.

Monitoring#

Track these metrics:

Queue depth — Jobs waiting. If growing, add workers.
Processing time — P50, P95, P99 per job type.
Failure rate — Percentage of jobs that fail.
DLQ size — Failed jobs awaiting manual intervention.
Worker utilization — Busy vs. idle time.

Alert when queue depth exceeds a threshold or DLQ grows.

Visualize your queue architecture#

See how producers, queues, workers, and DLQs connect — try Codelit to generate an interactive diagram of your task queue system.

Key takeaways#

Decouple heavy work from the request-response cycle
Jobs must be idempotent — at-least-once delivery means duplicates happen
Retry with backoff + jitter — don't hammer a failing dependency
Dead letter queues — never silently drop failed jobs
Monitor queue depth — growing queues mean you need more workers
Start with managed (SQS, Cloud Tasks) unless you need complex routing

{ }

Explore the Slack architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

Try these templates

Simple Ride-Sharing MVP

A basic ride-sharing app with core components for rider-driver matching.

5 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Payment Processing Platform

PCI-compliant payment system with multi-gateway routing, fraud detection, and reconciliation.

9 components

Build this architecture

Generate an interactive architecture for Design a Task Queue in seconds.

Try it in Codelit →

system-designinfrastructurebackend

Design a Task Queue — From Simple Workers to Distributed Job Processing

March 24, 2026 4 min readBy Mo Discussion

Why task queues exist#

Some work takes too long for a request-response cycle:

Sending emails (2-5 seconds per email API call)
Processing uploaded images (resize, compress, generate thumbnails)
Generating reports from millions of rows
Syncing data with external APIs

A task queue lets you say "do this later" and respond to the user immediately.

The producer-consumer pattern#

Producer (API server) → Queue (Redis/SQS/RabbitMQ) → Consumer (Worker)

Producer creates a job with a payload and pushes it to the queue
Queue stores the job until a worker picks it up
Consumer (worker) pulls the job, executes it, acknowledges completion

The user gets an instant response. The work happens in the background.

Core components#

Job definition#

{
  "id": "job_abc123",
  "type": "send_email",
  "payload": {
    "to": "user@example.com",
    "template": "welcome",
    "data": { "name": "Mo" }
  },
  "priority": "high",
  "attempts": 0,
  "max_attempts": 3,
  "created_at": "2026-03-24T10:00:00Z",
  "scheduled_at": null
}

Queue backends#

Backend	Best for	Durability
Redis (Bull/BullMQ)	Simple, fast, Node.js	On-disk persistence
RabbitMQ	Complex routing, multiple consumers	Durable queues
SQS	AWS native, zero ops	Fully managed
Kafka	Event streaming, replay	Persistent log
Celery	Python ecosystem	Depends on broker

Workers#

Workers are processes that:

Connect to the queue
Pull the next job (or receive via push)
Execute the job logic
Acknowledge success or report failure

Scale workers independently: 1 worker for low-traffic, 100 for high-traffic.

Reliability patterns#

At-least-once delivery#

Retry with exponential backoff#

Attempt 1: immediate
Attempt 2: wait 10s
Attempt 3: wait 30s
Attempt 4: wait 2min
Attempt 5: wait 10min → Dead Letter Queue

Add jitter (random delay) to prevent thundering herd when many retries fire simultaneously.

Dead Letter Queue (DLQ)#

Jobs that fail all retry attempts go to a separate queue for manual inspection:

View the payload and error
Fix the bug
Re-queue the job

Never silently drop failed jobs.

Visibility timeout#

When a worker picks a job, the queue hides it from other workers for N seconds. If the worker doesn't acknowledge in time, the job becomes visible again for another worker to pick up.

Set the timeout to 2x your expected processing time.

Job scheduling#

Beyond "do it now" — schedule jobs for later:

Delayed jobs — "Send this email in 30 minutes"
Cron jobs — "Generate the daily report at 6 AM"
Rate-limited jobs — "Process at most 10 API calls per second"

Implementation#

Store scheduled jobs in a sorted set (Redis ZSET) keyed by execution time. A scheduler process polls for jobs whose scheduled time has passed and moves them to the active queue.

Scaling strategies#

Horizontal scaling: Add more workers. Each worker pulls from the same queue. Works out of the box.

Concurrency per worker: Each worker process can run multiple jobs concurrently (threads or async). Tune based on job type — CPU-bound (1 per core) vs. I/O-bound (many per process).

Priority queues: Separate queues for high/medium/low priority. Workers process high-priority queue first.

Dedicated queues: Separate queues per job type. Email queue, image queue, report queue. Prevents slow jobs from blocking fast ones.

Monitoring#

Track these metrics:

Queue depth — Jobs waiting. If growing, add workers.
Processing time — P50, P95, P99 per job type.
Failure rate — Percentage of jobs that fail.
DLQ size — Failed jobs awaiting manual intervention.
Worker utilization — Busy vs. idle time.

Alert when queue depth exceeds a threshold or DLQ grows.

Visualize your queue architecture#

See how producers, queues, workers, and DLQs connect — try Codelit to generate an interactive diagram of your task queue system.

Key takeaways#

Decouple heavy work from the request-response cycle
Jobs must be idempotent — at-least-once delivery means duplicates happen
Retry with backoff + jitter — don't hammer a failing dependency
Dead letter queues — never silently drop failed jobs
Monitor queue depth — growing queues mean you need more workers
Start with managed (SQS, Cloud Tasks) unless you need complex routing

{ }

Explore the Slack architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

API design

Build this architecture

Generate an interactive architecture for Design a Task Queue in seconds.

Try it in Codelit →

Design a Task Queue — From Simple Workers to Distributed Job Processing

Why task queues exist#

The producer-consumer pattern#

Core components#

Job definition#

Queue backends#

Workers#

Reliability patterns#

At-least-once delivery#

Retry with exponential backoff#

Dead Letter Queue (DLQ)#

Visibility timeout#

Job scheduling#

Implementation#

Scaling strategies#

Monitoring#

Visualize your queue architecture#

Key takeaways#

Comments

Related articles

API Backward Compatibility: Ship Changes Without Breaking Consumers

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

Try these templates

Simple Ride-Sharing MVP

Distributed Rate Limiter

Payment Processing Platform

Build this architecture

Design a Task Queue — From Simple Workers to Distributed Job Processing

Why task queues exist#

The producer-consumer pattern#

Core components#

Job definition#

Queue backends#

Workers#

Reliability patterns#

At-least-once delivery#

Retry with exponential backoff#

Dead Letter Queue (DLQ)#

Visibility timeout#

Job scheduling#

Implementation#

Scaling strategies#

Monitoring#

Visualize your queue architecture#

Key takeaways#

Comments

Related articles

API Backward Compatibility: Ship Changes Without Breaking Consumers

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

Try these templates

Simple Ride-Sharing MVP

Distributed Rate Limiter

Payment Processing Platform

Build this architecture