Distributed Job Scheduling: Cron at Scale with Deduplication & Exactly-Once Guarantees
Distributed Job Scheduling#
Single-server cron breaks the moment you add a second machine. Distributed job scheduling solves the hard problems: deduplication, delivery guarantees, priority queuing, and fault tolerance across a cluster.
Why Single-Server Cron Fails#
Server A: crontab → run billing-job at 00:00
Server B: crontab → run billing-job at 00:00
Result → billing-job runs TWICE
Adding servers multiplies executions. You need coordination.
Core Concepts#
Job Types#
| Type | Example | Constraint |
|---|---|---|
| One-time | Send welcome email | Execute once, then discard |
| Recurring | Daily report | Execute on schedule, indefinitely |
| Delayed | Reminder in 30 min | Execute once after delay |
| Chained | ETL pipeline | Execute steps in sequence |
Job Deduplication#
The first challenge: ensuring a job runs exactly the number of times intended.
Leader election approach:
1. All schedulers compete for a distributed lock
2. Winner becomes leader, schedules jobs
3. If leader dies, new election triggers
4. Only leader writes jobs to the queue
Database-backed deduplication:
1. Insert job with unique constraint (job_type + scheduled_time)
2. First insert wins — duplicates rejected
3. Worker claims job via UPDATE ... SET status = 'running'
WHERE status = 'pending' (atomic claim)
4. Unclaimed jobs after timeout → re-queued
Delivery Guarantees#
At-Least-Once#
The job will run one or more times. If a worker crashes mid-execution, the job is retried.
Queue → Worker picks job → Worker crashes
↓
Visibility timeout expires
↓
Job reappears in queue → Another worker picks it
Trade-off: You must make jobs idempotent. A payment job that runs twice must not charge twice.
At-Most-Once#
The job will run zero or one times. If delivery fails, the job is lost.
Queue → Delete job → Send to worker → Worker crashes
Result: job is gone, never retried
Trade-off: Simple, but you lose jobs on failure.
Exactly-Once (Effectively)#
True exactly-once is impossible in distributed systems. Effectively exactly-once combines at-least-once delivery with idempotent processing:
1. Worker receives job (at-least-once)
2. Worker checks idempotency key in database
3. If key exists → skip (already processed)
4. If key missing → process + write key in same transaction
5. Ack job to queue
Job Priorities and Fairness#
Not all jobs are equal. A user-facing notification matters more than a nightly analytics roll-up.
Priority queue approach:
High Priority Queue → [send-otp, payment-confirm]
Medium Priority Queue → [order-update, sync-inventory]
Low Priority Queue → [analytics-rollup, cleanup-logs]
Workers poll high first, then medium, then low.
Weighted fair queuing:
High: 60% of worker capacity
Medium: 30% of worker capacity
Low: 10% of worker capacity
This prevents low-priority starvation while keeping high-priority latency minimal.
Work Stealing#
In a cluster, some workers finish faster than others. Work stealing rebalances load dynamically:
Worker A: [job1, job2, job3, job4, job5] ← overloaded
Worker B: [job6] ← idle
Worker C: [] ← idle
Worker C steals job5 from Worker A's queue
Worker B steals job4 from Worker A's queue
Result: balanced execution across all workers
Implementation pattern:
1. Each worker has a local deque (double-ended queue)
2. Worker pushes/pops from the front (LIFO — cache friendly)
3. Thieves steal from the back (FIFO — coarse-grained tasks)
4. Lock-free compare-and-swap for theft attempts
Failure Handling#
Dead Letter Queues#
Jobs that fail repeatedly need quarantine:
Job fails → retry 1 → retry 2 → retry 3 → Dead Letter Queue
↓
Alert ops team
Manual inspection
Fix and replay
Exponential Backoff with Jitter#
Retry 1: wait 1s + random(0-500ms)
Retry 2: wait 2s + random(0-500ms)
Retry 3: wait 4s + random(0-500ms)
Retry 4: wait 8s + random(0-500ms)
Retry 5: wait 16s + random(0-500ms) → then DLQ
Jitter prevents the thundering herd when many jobs retry simultaneously.
Tools Comparison#
| Tool | Language | Best For | Delivery |
|---|---|---|---|
| Temporal | Go/Java/TS/Python | Complex workflows, long-running | Exactly-once (effective) |
| Airflow | Python | Data pipelines, DAGs | At-least-once |
| Quartz Scheduler | Java/JVM | Enterprise Java apps | At-least-once |
| Hangfire | C#/.NET | .NET background jobs | At-least-once |
Temporal#
Durable execution engine. Workflows survive process crashes — the framework replays history to rebuild state:
Temporal Server ← persists workflow state
↓
Worker polls for tasks → executes activity
↓
Worker crashes → new worker replays from history
↓
Workflow continues from exact point of failure
Airflow#
DAG-based scheduler for data pipelines:
DAG: extract → transform → load
↓ ↓ ↓
S3 pull Spark job Write to warehouse
Scheduler triggers DAG on cron
Each task retries independently
Backfill: re-run DAG for past dates
Quartz Scheduler#
Mature JVM scheduler with clustering support:
Quartz Node 1 ←→ Shared Database ←→ Quartz Node 2
↓ ↓
Row-level locks prevent duplicate execution
Misfire policies handle late triggers
Hangfire#
.NET background job framework with dashboard:
Application → Hangfire Client → Job Storage (SQL/Redis)
↓
Hangfire Server → Execute job
↓
Dashboard → monitor/retry
Visualize your job scheduling architecture at codelit.io — generate interactive system design diagrams with workflow exports.
Architecture Decision Checklist#
- Idempotency first — assume every job will run at least twice
- Choose your guarantee — at-least-once is the practical default
- Prioritize fairly — weighted queues prevent starvation
- Work stealing for heterogeneous clusters
- Dead letter queues for poisoned jobs
- Exponential backoff + jitter to avoid thundering herds
- Observability — trace every job from enqueue to completion
360 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsWhatsApp-Scale Messaging System
End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.
9 componentsBuild this architecture
Generate an interactive architecture for Distributed Job Scheduling in seconds.
Try it in Codelit →
Comments