Design a Task Queue — From Simple Workers to Distributed Job Processing
Why task queues exist#
Some work takes too long for a request-response cycle:
- Sending emails (2-5 seconds per email API call)
- Processing uploaded images (resize, compress, generate thumbnails)
- Generating reports from millions of rows
- Syncing data with external APIs
A task queue lets you say "do this later" and respond to the user immediately.
The producer-consumer pattern#
Producer (API server) → Queue (Redis/SQS/RabbitMQ) → Consumer (Worker)
- Producer creates a job with a payload and pushes it to the queue
- Queue stores the job until a worker picks it up
- Consumer (worker) pulls the job, executes it, acknowledges completion
The user gets an instant response. The work happens in the background.
Core components#
Job definition#
{
"id": "job_abc123",
"type": "send_email",
"payload": {
"to": "user@example.com",
"template": "welcome",
"data": { "name": "Mo" }
},
"priority": "high",
"attempts": 0,
"max_attempts": 3,
"created_at": "2026-03-24T10:00:00Z",
"scheduled_at": null
}
Queue backends#
| Backend | Best for | Durability |
|---|---|---|
| Redis (Bull/BullMQ) | Simple, fast, Node.js | On-disk persistence |
| RabbitMQ | Complex routing, multiple consumers | Durable queues |
| SQS | AWS native, zero ops | Fully managed |
| Kafka | Event streaming, replay | Persistent log |
| Celery | Python ecosystem | Depends on broker |
Workers#
Workers are processes that:
- Connect to the queue
- Pull the next job (or receive via push)
- Execute the job logic
- Acknowledge success or report failure
Scale workers independently: 1 worker for low-traffic, 100 for high-traffic.
Reliability patterns#
At-least-once delivery#
The queue delivers every job at least once. If the worker crashes mid-processing, the job is re-delivered. Your job logic must be idempotent — processing the same job twice produces the same result.
Retry with exponential backoff#
Attempt 1: immediate
Attempt 2: wait 10s
Attempt 3: wait 30s
Attempt 4: wait 2min
Attempt 5: wait 10min → Dead Letter Queue
Add jitter (random delay) to prevent thundering herd when many retries fire simultaneously.
Dead Letter Queue (DLQ)#
Jobs that fail all retry attempts go to a separate queue for manual inspection:
- View the payload and error
- Fix the bug
- Re-queue the job
Never silently drop failed jobs.
Visibility timeout#
When a worker picks a job, the queue hides it from other workers for N seconds. If the worker doesn't acknowledge in time, the job becomes visible again for another worker to pick up.
Set the timeout to 2x your expected processing time.
Job scheduling#
Beyond "do it now" — schedule jobs for later:
- Delayed jobs — "Send this email in 30 minutes"
- Cron jobs — "Generate the daily report at 6 AM"
- Rate-limited jobs — "Process at most 10 API calls per second"
Implementation#
Store scheduled jobs in a sorted set (Redis ZSET) keyed by execution time. A scheduler process polls for jobs whose scheduled time has passed and moves them to the active queue.
Scaling strategies#
Horizontal scaling: Add more workers. Each worker pulls from the same queue. Works out of the box.
Concurrency per worker: Each worker process can run multiple jobs concurrently (threads or async). Tune based on job type — CPU-bound (1 per core) vs. I/O-bound (many per process).
Priority queues: Separate queues for high/medium/low priority. Workers process high-priority queue first.
Dedicated queues: Separate queues per job type. Email queue, image queue, report queue. Prevents slow jobs from blocking fast ones.
Monitoring#
Track these metrics:
- Queue depth — Jobs waiting. If growing, add workers.
- Processing time — P50, P95, P99 per job type.
- Failure rate — Percentage of jobs that fail.
- DLQ size — Failed jobs awaiting manual intervention.
- Worker utilization — Busy vs. idle time.
Alert when queue depth exceeds a threshold or DLQ grows.
Visualize your queue architecture#
See how producers, queues, workers, and DLQs connect — try Codelit to generate an interactive diagram of your task queue system.
Key takeaways#
- Decouple heavy work from the request-response cycle
- Jobs must be idempotent — at-least-once delivery means duplicates happen
- Retry with backoff + jitter — don't hammer a failing dependency
- Dead letter queues — never silently drop failed jobs
- Monitor queue depth — growing queues mean you need more workers
- Start with managed (SQS, Cloud Tasks) unless you need complex routing
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
API Backward Compatibility: Ship Changes Without Breaking Consumers
6 min read
api designBatch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency
8 min read
system designCircuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
7 min read
Try these templates
Simple Ride-Sharing MVP
A basic ride-sharing app with core components for rider-driver matching.
5 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsPayment Processing Platform
PCI-compliant payment system with multi-gateway routing, fraud detection, and reconciliation.
9 componentsBuild this architecture
Generate an interactive architecture for Design a Task Queue in seconds.
Try it in Codelit →
Comments