distributed systemsjob schedulingarchitecturesystem designbackend

Distributed Job Scheduling: Cron at Scale with Deduplication & Exactly-Once Guarantees

March 29, 2026 5 min readBy Codelit Team Discussion

Distributed Job Scheduling#

Single-server cron breaks the moment you add a second machine. Distributed job scheduling solves the hard problems: deduplication, delivery guarantees, priority queuing, and fault tolerance across a cluster.

Why Single-Server Cron Fails#

Server A: crontab → run billing-job at 00:00
Server B: crontab → run billing-job at 00:00
Result   → billing-job runs TWICE

Adding servers multiplies executions. You need coordination.

Core Concepts#

Job Types#

Type	Example	Constraint
One-time	Send welcome email	Execute once, then discard
Recurring	Daily report	Execute on schedule, indefinitely
Delayed	Reminder in 30 min	Execute once after delay
Chained	ETL pipeline	Execute steps in sequence

Job Deduplication#

The first challenge: ensuring a job runs exactly the number of times intended.

Leader election approach:

1. All schedulers compete for a distributed lock
2. Winner becomes leader, schedules jobs
3. If leader dies, new election triggers
4. Only leader writes jobs to the queue

Database-backed deduplication:

1. Insert job with unique constraint (job_type + scheduled_time)
2. First insert wins — duplicates rejected
3. Worker claims job via UPDATE ... SET status = 'running'
   WHERE status = 'pending' (atomic claim)
4. Unclaimed jobs after timeout → re-queued

Delivery Guarantees#

At-Least-Once#

The job will run one or more times. If a worker crashes mid-execution, the job is retried.

Queue → Worker picks job → Worker crashes
         ↓
      Visibility timeout expires
         ↓
      Job reappears in queue → Another worker picks it

Trade-off: You must make jobs idempotent. A payment job that runs twice must not charge twice.

At-Most-Once#

The job will run zero or one times. If delivery fails, the job is lost.

Queue → Delete job → Send to worker → Worker crashes
Result: job is gone, never retried

Trade-off: Simple, but you lose jobs on failure.

Exactly-Once (Effectively)#

True exactly-once is impossible in distributed systems. Effectively exactly-once combines at-least-once delivery with idempotent processing:

1. Worker receives job (at-least-once)
2. Worker checks idempotency key in database
3. If key exists → skip (already processed)
4. If key missing → process + write key in same transaction
5. Ack job to queue

Job Priorities and Fairness#

Not all jobs are equal. A user-facing notification matters more than a nightly analytics roll-up.

Priority queue approach:

High Priority Queue   → [send-otp, payment-confirm]
Medium Priority Queue → [order-update, sync-inventory]
Low Priority Queue    → [analytics-rollup, cleanup-logs]

Workers poll high first, then medium, then low.

Weighted fair queuing:

High:   60% of worker capacity
Medium: 30% of worker capacity
Low:    10% of worker capacity

This prevents low-priority starvation while keeping high-priority latency minimal.

Work Stealing#

In a cluster, some workers finish faster than others. Work stealing rebalances load dynamically:

Worker A: [job1, job2, job3, job4, job5]  ← overloaded
Worker B: [job6]                           ← idle
Worker C: []                               ← idle

Worker C steals job5 from Worker A's queue
Worker B steals job4 from Worker A's queue

Result: balanced execution across all workers

Implementation pattern:

1. Each worker has a local deque (double-ended queue)
2. Worker pushes/pops from the front (LIFO — cache friendly)
3. Thieves steal from the back (FIFO — coarse-grained tasks)
4. Lock-free compare-and-swap for theft attempts

Failure Handling#

Dead Letter Queues#

Jobs that fail repeatedly need quarantine:

Job fails → retry 1 → retry 2 → retry 3 → Dead Letter Queue
                                              ↓
                                           Alert ops team
                                           Manual inspection
                                           Fix and replay

Exponential Backoff with Jitter#

Retry 1: wait 1s  + random(0-500ms)
Retry 2: wait 2s  + random(0-500ms)
Retry 3: wait 4s  + random(0-500ms)
Retry 4: wait 8s  + random(0-500ms)
Retry 5: wait 16s + random(0-500ms) → then DLQ

Jitter prevents the thundering herd when many jobs retry simultaneously.

Tools Comparison#

Tool	Language	Best For	Delivery
Temporal	Go/Java/TS/Python	Complex workflows, long-running	Exactly-once (effective)
Airflow	Python	Data pipelines, DAGs	At-least-once
Quartz Scheduler	Java/JVM	Enterprise Java apps	At-least-once
Hangfire	C#/.NET	.NET background jobs	At-least-once

Temporal#

Durable execution engine. Workflows survive process crashes — the framework replays history to rebuild state:

Temporal Server ← persists workflow state
     ↓
Worker polls for tasks → executes activity
     ↓
Worker crashes → new worker replays from history
     ↓
Workflow continues from exact point of failure

Airflow#

DAG-based scheduler for data pipelines:

DAG: extract → transform → load
      ↓           ↓          ↓
   S3 pull    Spark job   Write to warehouse

Scheduler triggers DAG on cron
Each task retries independently
Backfill: re-run DAG for past dates

Quartz Scheduler#

Mature JVM scheduler with clustering support:

Quartz Node 1 ←→ Shared Database ←→ Quartz Node 2
     ↓                                    ↓
  Row-level locks prevent duplicate execution
  Misfire policies handle late triggers

Hangfire#

.NET background job framework with dashboard:

Application → Hangfire Client → Job Storage (SQL/Redis)
                                       ↓
                               Hangfire Server → Execute job
                                       ↓
                               Dashboard → monitor/retry

Visualize your job scheduling architecture at codelit.io — generate interactive system design diagrams with workflow exports.

Architecture Decision Checklist#

Idempotency first — assume every job will run at least twice
Choose your guarantee — at-least-once is the practical default
Prioritize fairly — weighted queues prevent starvation
Work stealing for heterogeneous clusters
Dead letter queues for poisoned jobs
Exponential backoff + jitter to avoid thundering herds
Observability — trace every job from enqueue to completion

360 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

AI Agent Memory Architecture

2 min read

AI agents

Production AI Agent Deployment Checklist

2 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

WhatsApp-Scale Messaging System

End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.

9 components

Build this architecture

Generate an interactive architecture for Distributed Job Scheduling in seconds.

Try it in Codelit →

distributed systemsjob schedulingarchitecturesystem designbackend

Distributed Job Scheduling: Cron at Scale with Deduplication & Exactly-Once Guarantees

March 29, 2026 5 min readBy Codelit Team Discussion

Distributed Job Scheduling#

Why Single-Server Cron Fails#

Server A: crontab → run billing-job at 00:00
Server B: crontab → run billing-job at 00:00
Result   → billing-job runs TWICE

Adding servers multiplies executions. You need coordination.

Core Concepts#

Job Types#

Type	Example	Constraint
One-time	Send welcome email	Execute once, then discard
Recurring	Daily report	Execute on schedule, indefinitely
Delayed	Reminder in 30 min	Execute once after delay
Chained	ETL pipeline	Execute steps in sequence

Job Deduplication#

The first challenge: ensuring a job runs exactly the number of times intended.

Leader election approach:

1. All schedulers compete for a distributed lock
2. Winner becomes leader, schedules jobs
3. If leader dies, new election triggers
4. Only leader writes jobs to the queue

Database-backed deduplication:

1. Insert job with unique constraint (job_type + scheduled_time)
2. First insert wins — duplicates rejected
3. Worker claims job via UPDATE ... SET status = 'running'
   WHERE status = 'pending' (atomic claim)
4. Unclaimed jobs after timeout → re-queued

Delivery Guarantees#

At-Least-Once#

The job will run one or more times. If a worker crashes mid-execution, the job is retried.

Queue → Worker picks job → Worker crashes
         ↓
      Visibility timeout expires
         ↓
      Job reappears in queue → Another worker picks it

Trade-off: You must make jobs idempotent. A payment job that runs twice must not charge twice.

At-Most-Once#

The job will run zero or one times. If delivery fails, the job is lost.

Queue → Delete job → Send to worker → Worker crashes
Result: job is gone, never retried

Trade-off: Simple, but you lose jobs on failure.

Exactly-Once (Effectively)#

True exactly-once is impossible in distributed systems. Effectively exactly-once combines at-least-once delivery with idempotent processing:

1. Worker receives job (at-least-once)
2. Worker checks idempotency key in database
3. If key exists → skip (already processed)
4. If key missing → process + write key in same transaction
5. Ack job to queue

Job Priorities and Fairness#

Not all jobs are equal. A user-facing notification matters more than a nightly analytics roll-up.

Priority queue approach:

High Priority Queue   → [send-otp, payment-confirm]
Medium Priority Queue → [order-update, sync-inventory]
Low Priority Queue    → [analytics-rollup, cleanup-logs]

Workers poll high first, then medium, then low.

Weighted fair queuing:

High:   60% of worker capacity
Medium: 30% of worker capacity
Low:    10% of worker capacity

This prevents low-priority starvation while keeping high-priority latency minimal.

Work Stealing#

In a cluster, some workers finish faster than others. Work stealing rebalances load dynamically:

Worker A: [job1, job2, job3, job4, job5]  ← overloaded
Worker B: [job6]                           ← idle
Worker C: []                               ← idle

Worker C steals job5 from Worker A's queue
Worker B steals job4 from Worker A's queue

Result: balanced execution across all workers

Implementation pattern:

1. Each worker has a local deque (double-ended queue)
2. Worker pushes/pops from the front (LIFO — cache friendly)
3. Thieves steal from the back (FIFO — coarse-grained tasks)
4. Lock-free compare-and-swap for theft attempts

Failure Handling#

Dead Letter Queues#

Jobs that fail repeatedly need quarantine:

Job fails → retry 1 → retry 2 → retry 3 → Dead Letter Queue
                                              ↓
                                           Alert ops team
                                           Manual inspection
                                           Fix and replay

Exponential Backoff with Jitter#

Retry 1: wait 1s  + random(0-500ms)
Retry 2: wait 2s  + random(0-500ms)
Retry 3: wait 4s  + random(0-500ms)
Retry 4: wait 8s  + random(0-500ms)
Retry 5: wait 16s + random(0-500ms) → then DLQ

Jitter prevents the thundering herd when many jobs retry simultaneously.

Tools Comparison#

Tool	Language	Best For	Delivery
Temporal	Go/Java/TS/Python	Complex workflows, long-running	Exactly-once (effective)
Airflow	Python	Data pipelines, DAGs	At-least-once
Quartz Scheduler	Java/JVM	Enterprise Java apps	At-least-once
Hangfire	C#/.NET	.NET background jobs	At-least-once

Temporal#

Durable execution engine. Workflows survive process crashes — the framework replays history to rebuild state:

Temporal Server ← persists workflow state
     ↓
Worker polls for tasks → executes activity
     ↓
Worker crashes → new worker replays from history
     ↓
Workflow continues from exact point of failure

Airflow#

DAG-based scheduler for data pipelines:

DAG: extract → transform → load
      ↓           ↓          ↓
   S3 pull    Spark job   Write to warehouse

Scheduler triggers DAG on cron
Each task retries independently
Backfill: re-run DAG for past dates

Quartz Scheduler#

Mature JVM scheduler with clustering support:

Quartz Node 1 ←→ Shared Database ←→ Quartz Node 2
     ↓                                    ↓
  Row-level locks prevent duplicate execution
  Misfire policies handle late triggers

Hangfire#

.NET background job framework with dashboard:

Application → Hangfire Client → Job Storage (SQL/Redis)
                                       ↓
                               Hangfire Server → Execute job
                                       ↓
                               Dashboard → monitor/retry

Visualize your job scheduling architecture at codelit.io — generate interactive system design diagrams with workflow exports.

Architecture Decision Checklist#

Idempotency first — assume every job will run at least twice
Choose your guarantee — at-least-once is the practical default
Prioritize fairly — weighted queues prevent starvation
Work stealing for heterogeneous clusters
Dead letter queues for poisoned jobs
Exponential backoff + jitter to avoid thundering herds
Observability — trace every job from enqueue to completion

360 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Distributed Job Scheduling in seconds.

Try it in Codelit →

Distributed Job Scheduling: Cron at Scale with Deduplication & Exactly-Once Guarantees

Distributed Job Scheduling#

Why Single-Server Cron Fails#

Core Concepts#

Job Types#

Job Deduplication#

Delivery Guarantees#

At-Least-Once#

At-Most-Once#

Exactly-Once (Effectively)#

Job Priorities and Fairness#

Work Stealing#

Failure Handling#

Dead Letter Queues#

Exponential Backoff with Jitter#

Tools Comparison#

Temporal#

Airflow#

Quartz Scheduler#

Hangfire#

Architecture Decision Checklist#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Distributed Rate Limiter

WhatsApp-Scale Messaging System

Build this architecture

Distributed Job Scheduling: Cron at Scale with Deduplication & Exactly-Once Guarantees

Distributed Job Scheduling#

Why Single-Server Cron Fails#

Core Concepts#

Job Types#

Job Deduplication#

Delivery Guarantees#

At-Least-Once#

At-Most-Once#

Exactly-Once (Effectively)#

Job Priorities and Fairness#

Work Stealing#

Failure Handling#

Dead Letter Queues#

Exponential Backoff with Jitter#

Tools Comparison#

Temporal#

Airflow#

Quartz Scheduler#

Hangfire#

Architecture Decision Checklist#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Distributed Rate Limiter

WhatsApp-Scale Messaging System

Build this architecture