Kubernetes Jobs and CronJobs — Batch Processing Done Right
Not everything is a long-running service#
Kubernetes excels at running Deployments — pods that stay alive, serve traffic, and restart on failure. But many workloads are fundamentally different: data migrations, report generation, batch imports, email campaigns, nightly cleanups.
These tasks run to completion and exit. Kubernetes Jobs and CronJobs are purpose-built for this.
Jobs — run to completion#
A Job creates one or more pods and ensures a specified number of them successfully terminate. Unlike a Deployment, a Job does not restart pods after success.
apiVersion: batch/v1
kind: Job
metadata:
name: data-migration
spec:
template:
spec:
containers:
- name: migrate
image: myapp/migrate:v2.3
command: ["python", "migrate.py", "--target", "v2.3"]
restartPolicy: Never
backoffLimit: 4
When this Job runs, Kubernetes creates a pod. If the pod succeeds (exit code 0), the Job is complete. If it fails, Kubernetes retries up to backoffLimit times.
Job types — single, parallel, and indexed#
Single completion (default)#
One pod runs. It succeeds or fails. This is the simplest form — good for migrations, one-off scripts, and database seeds.
Parallel with fixed completions#
Multiple pods run, and the Job succeeds when a target number of completions is reached.
spec:
completions: 10
parallelism: 3
template:
# ...
This runs 3 pods at a time until 10 pods have succeeded. Each pod processes one work item. Use this for batch processing where each item is independent.
Parallel work queue#
Pods pull work from an external queue (Redis, SQS, RabbitMQ). The Job succeeds when all pods exit successfully.
spec:
parallelism: 5
template:
spec:
containers:
- name: worker
image: myapp/worker:latest
env:
- name: QUEUE_URL
value: "redis://redis:6379/0"
restartPolicy: Never
Each pod dequeues items until the queue is empty, then exits 0. No completions field — Kubernetes doesn't know the total work items.
Indexed Jobs (Kubernetes 1.24+)#
Each pod gets a unique index via the JOB_COMPLETION_INDEX environment variable. Useful when you need to partition work deterministically.
spec:
completions: 8
parallelism: 4
completionMode: Indexed
template:
spec:
containers:
- name: processor
image: myapp/processor:latest
# Pod receives JOB_COMPLETION_INDEX (0-7)
restartPolicy: Never
Pod 0 processes shard 0, pod 1 processes shard 1, and so on. No need for an external queue — the index itself is the work assignment.
Failure handling — backoffLimit and restartPolicy#
backoffLimit#
Controls how many times Kubernetes retries a failed Job. Default is 6.
spec:
backoffLimit: 3
After 3 failures, the Job is marked Failed. Kubernetes uses exponential backoff between retries: 10s, 20s, 40s, capped at 6 minutes.
restartPolicy#
Jobs support two restart policies:
Never— failed pods are not restarted. Kubernetes creates a new pod for each retry. Old pods remain for log inspection.OnFailure— the same pod is restarted in place. Saves resources but loses logs from previous attempts.
For debugging, Never is better. For production cost efficiency, OnFailure is usually preferred.
activeDeadlineSeconds#
A hard timeout for the entire Job. If the Job hasn't completed within this window, all pods are terminated.
spec:
activeDeadlineSeconds: 3600 # 1 hour max
backoffLimit: 3
This prevents runaway jobs from consuming cluster resources indefinitely.
CronJobs — scheduled execution#
A CronJob creates Jobs on a schedule. It uses standard cron syntax.
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
spec:
schedule: "0 2 * * *" # 2:00 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: report
image: myapp/report-generator:latest
command: ["python", "generate_report.py"]
restartPolicy: OnFailure
backoffLimit: 2
Cron syntax quick reference#
| Field | Values | Example |
|---|---|---|
| Minute | 0-59 | 30 = minute 30 |
| Hour | 0-23 | 2 = 2 AM |
| Day of month | 1-31 | 15 = 15th |
| Month | 1-12 | */3 = every 3 months |
| Day of week | 0-6 (Sun=0) | 1-5 = weekdays |
Common patterns:
*/15 * * * *— every 15 minutes0 */6 * * *— every 6 hours0 2 * * 1— Monday at 2 AM0 0 1 * *— first day of each month at midnight
concurrencyPolicy — handling overlaps#
What happens when a CronJob triggers but the previous run is still active?
spec:
concurrencyPolicy: Forbid
Allow(default) — multiple Jobs run simultaneously. Risk of resource contention and data races.Forbid— skip the new run if the previous one is still active. Safest for idempotent tasks.Replace— kill the running Job and start a new one. Use when freshness matters more than completion.
For most production workloads, Forbid is the right choice. If a report takes 90 minutes but runs every hour, you want to skip rather than overlap.
TTL controller — automatic cleanup#
By default, completed Job pods stick around forever. They clutter kubectl get pods and consume etcd storage.
spec:
ttlSecondsAfterFinished: 86400 # Clean up after 24 hours
The TTL-after-finished controller deletes the Job and its pods after the specified duration. Set this on every Job.
For CronJobs, also use:
spec:
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
This keeps the last 3 successful and 5 failed Jobs for debugging while cleaning up older ones.
Monitoring job failures#
Pod-level signals#
Check exit codes and logs from completed pods:
# List all pods for a Job
kubectl get pods --selector=job-name=data-migration
# Check logs from a failed pod
kubectl logs data-migration-abc12
# Describe for events and failure reasons
kubectl describe job data-migration
Conditions and events#
Jobs expose conditions in their status:
kubectl get job data-migration -o jsonpath='{.status.conditions}'
A Failed condition with reason: BackoffLimitExceeded means all retries are exhausted.
Alerting on failures#
Use Prometheus with kube-state-metrics to alert on Job failures:
# Prometheus alert rule
- alert: KubeJobFailed
expr: kube_job_status_failed{job_name!=""} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job_name }} failed"
For CronJobs, also monitor missed schedules:
- alert: CronJobMissedSchedule
expr: time() - kube_cronjob_next_schedule_time > 3600
for: 10m
labels:
severity: critical
Production checklist#
Resource limits — always set CPU and memory requests/limits on Job pods. A runaway batch job can starve your Deployments.
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "2"
memory: "1Gi"
Node selection — run batch jobs on dedicated node pools to isolate them from serving workloads.
nodeSelector:
workload-type: batch
tolerations:
- key: "batch"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Idempotency — Jobs can be retried. Your task must handle being run multiple times without side effects. Use database transactions, deduplication keys, or idempotency tokens.
Graceful shutdown — handle SIGTERM in your containers. Kubernetes sends SIGTERM before SIGKILL (default 30s grace period). Save progress so retries can resume.
Visualize your Kubernetes architecture#
Map out your Jobs, CronJobs, queues, and services together — try Codelit to generate interactive architecture diagrams.
Key takeaways#
- Jobs run to completion — use them for migrations, batch processing, and one-off tasks
- Parallel Jobs scale horizontally with
completionsandparallelism - backoffLimit controls retries with exponential backoff — default is 6
- CronJobs schedule recurring Jobs using standard cron syntax
- concurrencyPolicy: Forbid prevents dangerous overlapping runs
- TTL controller and history limits prevent Job clutter in your cluster
- Monitor with Prometheus — alert on failures and missed schedules
Article #429 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency
8 min read
system designCircuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
7 min read
testingAPI Contract Testing with Pact — Consumer-Driven Contracts for Microservices
8 min read
Try these templates
Payment Processing Platform
PCI-compliant payment system with multi-gateway routing, fraud detection, and reconciliation.
9 componentsKubernetes Container Orchestration
K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.
9 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsBuild this architecture
Generate an interactive architecture for Kubernetes Jobs and CronJobs in seconds.
Try it in Codelit →
Comments