kubernetesinfrastructuredevopssystem-design

Kubernetes Jobs and CronJobs — Batch Processing Done Right

March 29, 2026 7 min readBy Codelit Team Discussion

Not everything is a long-running service#

Kubernetes excels at running Deployments — pods that stay alive, serve traffic, and restart on failure. But many workloads are fundamentally different: data migrations, report generation, batch imports, email campaigns, nightly cleanups.

These tasks run to completion and exit. Kubernetes Jobs and CronJobs are purpose-built for this.

Jobs — run to completion#

A Job creates one or more pods and ensures a specified number of them successfully terminate. Unlike a Deployment, a Job does not restart pods after success.

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: myapp/migrate:v2.3
          command: ["python", "migrate.py", "--target", "v2.3"]
      restartPolicy: Never
  backoffLimit: 4

When this Job runs, Kubernetes creates a pod. If the pod succeeds (exit code 0), the Job is complete. If it fails, Kubernetes retries up to backoffLimit times.

Job types — single, parallel, and indexed#

Single completion (default)#

One pod runs. It succeeds or fails. This is the simplest form — good for migrations, one-off scripts, and database seeds.

Parallel with fixed completions#

Multiple pods run, and the Job succeeds when a target number of completions is reached.

spec:
  completions: 10
  parallelism: 3
  template:
    # ...

This runs 3 pods at a time until 10 pods have succeeded. Each pod processes one work item. Use this for batch processing where each item is independent.

Parallel work queue#

Pods pull work from an external queue (Redis, SQS, RabbitMQ). The Job succeeds when all pods exit successfully.

spec:
  parallelism: 5
  template:
    spec:
      containers:
        - name: worker
          image: myapp/worker:latest
          env:
            - name: QUEUE_URL
              value: "redis://redis:6379/0"
      restartPolicy: Never

Each pod dequeues items until the queue is empty, then exits 0. No completions field — Kubernetes doesn't know the total work items.

Indexed Jobs (Kubernetes 1.24+)#

Each pod gets a unique index via the JOB_COMPLETION_INDEX environment variable. Useful when you need to partition work deterministically.

spec:
  completions: 8
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      containers:
        - name: processor
          image: myapp/processor:latest
          # Pod receives JOB_COMPLETION_INDEX (0-7)
      restartPolicy: Never

Pod 0 processes shard 0, pod 1 processes shard 1, and so on. No need for an external queue — the index itself is the work assignment.

Failure handling — backoffLimit and restartPolicy#

backoffLimit#

Controls how many times Kubernetes retries a failed Job. Default is 6.

spec:
  backoffLimit: 3

After 3 failures, the Job is marked Failed. Kubernetes uses exponential backoff between retries: 10s, 20s, 40s, capped at 6 minutes.

restartPolicy#

Jobs support two restart policies:

Never — failed pods are not restarted. Kubernetes creates a new pod for each retry. Old pods remain for log inspection.
OnFailure — the same pod is restarted in place. Saves resources but loses logs from previous attempts.

For debugging, Never is better. For production cost efficiency, OnFailure is usually preferred.

activeDeadlineSeconds#

A hard timeout for the entire Job. If the Job hasn't completed within this window, all pods are terminated.

spec:
  activeDeadlineSeconds: 3600  # 1 hour max
  backoffLimit: 3

This prevents runaway jobs from consuming cluster resources indefinitely.

CronJobs — scheduled execution#

A CronJob creates Jobs on a schedule. It uses standard cron syntax.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 2 * * *"  # 2:00 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: report
              image: myapp/report-generator:latest
              command: ["python", "generate_report.py"]
          restartPolicy: OnFailure
      backoffLimit: 2

Cron syntax quick reference#

Field	Values	Example
Minute	0-59	`30` = minute 30
Hour	0-23	`2` = 2 AM
Day of month	1-31	`15` = 15th
Month	1-12	`*/3` = every 3 months
Day of week	0-6 (Sun=0)	`1-5` = weekdays

Common patterns:

*/15 * * * * — every 15 minutes
0 */6 * * * — every 6 hours
0 2 * * 1 — Monday at 2 AM
0 0 1 * * — first day of each month at midnight

concurrencyPolicy — handling overlaps#

What happens when a CronJob triggers but the previous run is still active?

spec:
  concurrencyPolicy: Forbid

Allow (default) — multiple Jobs run simultaneously. Risk of resource contention and data races.
Forbid — skip the new run if the previous one is still active. Safest for idempotent tasks.
Replace — kill the running Job and start a new one. Use when freshness matters more than completion.

For most production workloads, Forbid is the right choice. If a report takes 90 minutes but runs every hour, you want to skip rather than overlap.

TTL controller — automatic cleanup#

By default, completed Job pods stick around forever. They clutter kubectl get pods and consume etcd storage.

spec:
  ttlSecondsAfterFinished: 86400  # Clean up after 24 hours

The TTL-after-finished controller deletes the Job and its pods after the specified duration. Set this on every Job.

For CronJobs, also use:

spec:
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5

This keeps the last 3 successful and 5 failed Jobs for debugging while cleaning up older ones.

Monitoring job failures#

Pod-level signals#

Check exit codes and logs from completed pods:

# List all pods for a Job
kubectl get pods --selector=job-name=data-migration

# Check logs from a failed pod
kubectl logs data-migration-abc12

# Describe for events and failure reasons
kubectl describe job data-migration

Conditions and events#

Jobs expose conditions in their status:

kubectl get job data-migration -o jsonpath='{.status.conditions}'

A Failed condition with reason: BackoffLimitExceeded means all retries are exhausted.

Alerting on failures#

Use Prometheus with kube-state-metrics to alert on Job failures:

# Prometheus alert rule
- alert: KubeJobFailed
  expr: kube_job_status_failed{job_name!=""} > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Job {{ $labels.job_name }} failed"

For CronJobs, also monitor missed schedules:

- alert: CronJobMissedSchedule
  expr: time() - kube_cronjob_next_schedule_time > 3600
  for: 10m
  labels:
    severity: critical

Production checklist#

Resource limits — always set CPU and memory requests/limits on Job pods. A runaway batch job can starve your Deployments.

resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "2"
    memory: "1Gi"

Node selection — run batch jobs on dedicated node pools to isolate them from serving workloads.

nodeSelector:
  workload-type: batch
tolerations:
  - key: "batch"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Idempotency — Jobs can be retried. Your task must handle being run multiple times without side effects. Use database transactions, deduplication keys, or idempotency tokens.

Graceful shutdown — handle SIGTERM in your containers. Kubernetes sends SIGTERM before SIGKILL (default 30s grace period). Save progress so retries can resume.

Visualize your Kubernetes architecture#

Map out your Jobs, CronJobs, queues, and services together — try Codelit to generate interactive architecture diagrams.

Key takeaways#

Jobs run to completion — use them for migrations, batch processing, and one-off tasks
Parallel Jobs scale horizontally with completions and parallelism
backoffLimit controls retries with exponential backoff — default is 6
CronJobs schedule recurring Jobs using standard cron syntax
concurrencyPolicy: Forbid prevents dangerous overlapping runs
TTL controller and history limits prevent Job clutter in your cluster
Monitor with Prometheus — alert on failures and missed schedules

Article #429 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

testing

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

8 min read

Try these templates

Payment Processing Platform

PCI-compliant payment system with multi-gateway routing, fraud detection, and reconciliation.

9 components

Kubernetes Container Orchestration

K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.

9 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Kubernetes Jobs and CronJobs in seconds.

Try it in Codelit →

kubernetesinfrastructuredevopssystem-design

Kubernetes Jobs and CronJobs — Batch Processing Done Right

March 29, 2026 7 min readBy Codelit Team Discussion

Not everything is a long-running service#

These tasks run to completion and exit. Kubernetes Jobs and CronJobs are purpose-built for this.

Jobs — run to completion#

A Job creates one or more pods and ensures a specified number of them successfully terminate. Unlike a Deployment, a Job does not restart pods after success.

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: myapp/migrate:v2.3
          command: ["python", "migrate.py", "--target", "v2.3"]
      restartPolicy: Never
  backoffLimit: 4

When this Job runs, Kubernetes creates a pod. If the pod succeeds (exit code 0), the Job is complete. If it fails, Kubernetes retries up to backoffLimit times.

Job types — single, parallel, and indexed#

Single completion (default)#

One pod runs. It succeeds or fails. This is the simplest form — good for migrations, one-off scripts, and database seeds.

Parallel with fixed completions#

Multiple pods run, and the Job succeeds when a target number of completions is reached.

spec:
  completions: 10
  parallelism: 3
  template:
    # ...

This runs 3 pods at a time until 10 pods have succeeded. Each pod processes one work item. Use this for batch processing where each item is independent.

Parallel work queue#

Pods pull work from an external queue (Redis, SQS, RabbitMQ). The Job succeeds when all pods exit successfully.

spec:
  parallelism: 5
  template:
    spec:
      containers:
        - name: worker
          image: myapp/worker:latest
          env:
            - name: QUEUE_URL
              value: "redis://redis:6379/0"
      restartPolicy: Never

Each pod dequeues items until the queue is empty, then exits 0. No completions field — Kubernetes doesn't know the total work items.

Indexed Jobs (Kubernetes 1.24+)#

Each pod gets a unique index via the JOB_COMPLETION_INDEX environment variable. Useful when you need to partition work deterministically.

spec:
  completions: 8
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      containers:
        - name: processor
          image: myapp/processor:latest
          # Pod receives JOB_COMPLETION_INDEX (0-7)
      restartPolicy: Never

Pod 0 processes shard 0, pod 1 processes shard 1, and so on. No need for an external queue — the index itself is the work assignment.

Failure handling — backoffLimit and restartPolicy#

backoffLimit#

Controls how many times Kubernetes retries a failed Job. Default is 6.

spec:
  backoffLimit: 3

After 3 failures, the Job is marked Failed. Kubernetes uses exponential backoff between retries: 10s, 20s, 40s, capped at 6 minutes.

restartPolicy#

Jobs support two restart policies:

Never — failed pods are not restarted. Kubernetes creates a new pod for each retry. Old pods remain for log inspection.
OnFailure — the same pod is restarted in place. Saves resources but loses logs from previous attempts.

For debugging, Never is better. For production cost efficiency, OnFailure is usually preferred.

activeDeadlineSeconds#

A hard timeout for the entire Job. If the Job hasn't completed within this window, all pods are terminated.

spec:
  activeDeadlineSeconds: 3600  # 1 hour max
  backoffLimit: 3

This prevents runaway jobs from consuming cluster resources indefinitely.

CronJobs — scheduled execution#

A CronJob creates Jobs on a schedule. It uses standard cron syntax.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 2 * * *"  # 2:00 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: report
              image: myapp/report-generator:latest
              command: ["python", "generate_report.py"]
          restartPolicy: OnFailure
      backoffLimit: 2

Cron syntax quick reference#

Field	Values	Example
Minute	0-59	`30` = minute 30
Hour	0-23	`2` = 2 AM
Day of month	1-31	`15` = 15th
Month	1-12	`*/3` = every 3 months
Day of week	0-6 (Sun=0)	`1-5` = weekdays

Common patterns:

*/15 * * * * — every 15 minutes
0 */6 * * * — every 6 hours
0 2 * * 1 — Monday at 2 AM
0 0 1 * * — first day of each month at midnight

concurrencyPolicy — handling overlaps#

What happens when a CronJob triggers but the previous run is still active?

spec:
  concurrencyPolicy: Forbid

Allow (default) — multiple Jobs run simultaneously. Risk of resource contention and data races.
Forbid — skip the new run if the previous one is still active. Safest for idempotent tasks.
Replace — kill the running Job and start a new one. Use when freshness matters more than completion.

For most production workloads, Forbid is the right choice. If a report takes 90 minutes but runs every hour, you want to skip rather than overlap.

TTL controller — automatic cleanup#

By default, completed Job pods stick around forever. They clutter kubectl get pods and consume etcd storage.

spec:
  ttlSecondsAfterFinished: 86400  # Clean up after 24 hours

The TTL-after-finished controller deletes the Job and its pods after the specified duration. Set this on every Job.

For CronJobs, also use:

spec:
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5

This keeps the last 3 successful and 5 failed Jobs for debugging while cleaning up older ones.

Monitoring job failures#

Pod-level signals#

Check exit codes and logs from completed pods:

# List all pods for a Job
kubectl get pods --selector=job-name=data-migration

# Check logs from a failed pod
kubectl logs data-migration-abc12

# Describe for events and failure reasons
kubectl describe job data-migration

Conditions and events#

Jobs expose conditions in their status:

kubectl get job data-migration -o jsonpath='{.status.conditions}'

A Failed condition with reason: BackoffLimitExceeded means all retries are exhausted.

Alerting on failures#

Use Prometheus with kube-state-metrics to alert on Job failures:

# Prometheus alert rule
- alert: KubeJobFailed
  expr: kube_job_status_failed{job_name!=""} > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Job {{ $labels.job_name }} failed"

For CronJobs, also monitor missed schedules:

- alert: CronJobMissedSchedule
  expr: time() - kube_cronjob_next_schedule_time > 3600
  for: 10m
  labels:
    severity: critical

Production checklist#

Resource limits — always set CPU and memory requests/limits on Job pods. A runaway batch job can starve your Deployments.

resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "2"
    memory: "1Gi"

Node selection — run batch jobs on dedicated node pools to isolate them from serving workloads.

nodeSelector:
  workload-type: batch
tolerations:
  - key: "batch"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Idempotency — Jobs can be retried. Your task must handle being run multiple times without side effects. Use database transactions, deduplication keys, or idempotency tokens.

Graceful shutdown — handle SIGTERM in your containers. Kubernetes sends SIGTERM before SIGKILL (default 30s grace period). Save progress so retries can resume.

Visualize your Kubernetes architecture#

Map out your Jobs, CronJobs, queues, and services together — try Codelit to generate interactive architecture diagrams.

Key takeaways#

Jobs run to completion — use them for migrations, batch processing, and one-off tasks
Parallel Jobs scale horizontally with completions and parallelism
backoffLimit controls retries with exponential backoff — default is 6
CronJobs schedule recurring Jobs using standard cron syntax
concurrencyPolicy: Forbid prevents dangerous overlapping runs
TTL controller and history limits prevent Job clutter in your cluster
Monitor with Prometheus — alert on failures and missed schedules

Article #429 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

api design

Build this architecture

Generate an interactive architecture for Kubernetes Jobs and CronJobs in seconds.

Try it in Codelit →

Kubernetes Jobs and CronJobs — Batch Processing Done Right

Not everything is a long-running service#

Jobs — run to completion#

Job types — single, parallel, and indexed#

Single completion (default)#

Parallel with fixed completions#

Parallel work queue#

Indexed Jobs (Kubernetes 1.24+)#

Failure handling — backoffLimit and restartPolicy#

backoffLimit#

restartPolicy#

activeDeadlineSeconds#

CronJobs — scheduled execution#

Cron syntax quick reference#

concurrencyPolicy — handling overlaps#

TTL controller — automatic cleanup#

Monitoring job failures#

Pod-level signals#

Conditions and events#

Alerting on failures#

Production checklist#

Visualize your Kubernetes architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

Try these templates

Payment Processing Platform

Kubernetes Container Orchestration

CI/CD Pipeline Architecture

Build this architecture

Kubernetes Jobs and CronJobs — Batch Processing Done Right

Not everything is a long-running service#

Jobs — run to completion#

Job types — single, parallel, and indexed#

Single completion (default)#

Parallel with fixed completions#

Parallel work queue#

Indexed Jobs (Kubernetes 1.24+)#

Failure handling — backoffLimit and restartPolicy#

backoffLimit#

restartPolicy#

activeDeadlineSeconds#

CronJobs — scheduled execution#

Cron syntax quick reference#

concurrencyPolicy — handling overlaps#

TTL controller — automatic cleanup#

Monitoring job failures#

Pod-level signals#

Conditions and events#

Alerting on failures#

Production checklist#

Visualize your Kubernetes architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

Try these templates

Payment Processing Platform

Kubernetes Container Orchestration

CI/CD Pipeline Architecture

Build this architecture