api-designkubernetesinfrastructuresystem-design

Health Check Endpoints — Liveness, Readiness, Startup Probes, and Dependency Checks

March 29, 2026 7 min readBy Codelit Team Discussion

Why Health Checks Are Not Optional#

A service that responds to requests but returns wrong data is worse than a service that is down. Health checks let your infrastructure know whether a service is truly healthy, partially degraded, or needs to be replaced. Without them, your load balancer sends traffic to broken instances and your orchestrator never restarts failing pods.

The Three Probe Types#

Liveness Probe — "Is the Process Alive?"#

Answers one question: should this instance be killed and restarted?

Check: The process is running and not deadlocked.

Do not check: Database connectivity, downstream services, disk space. If the database is down, restarting your app will not fix it.

GET /healthz

200 OK
{
  "status": "alive",
  "uptime": 84923
}

Failure response: The orchestrator kills and restarts the container. If your liveness probe depends on external services, a database outage will cascade into restarting every pod in your cluster.

Readiness Probe — "Can This Instance Handle Traffic?"#

Answers: should the load balancer send requests to this instance?

Check: The service has completed initialization, database connection pool is warm, caches are loaded, and the service can actually process requests.

GET /ready

200 OK
{
  "status": "ready",
  "checks": {
    "database": "connected",
    "cache": "warm",
    "migrations": "complete"
  }
}

503 Service Unavailable
{
  "status": "not_ready",
  "checks": {
    "database": "connected",
    "cache": "warming",
    "migrations": "complete"
  }
}

Failure response: The instance is removed from the load balancer's pool but is not killed. Once it passes again, traffic resumes.

Startup Probe — "Has the Process Finished Starting?"#

Answers: is the application still initializing?

Why it exists: Some applications take 30-120 seconds to start (JVM warmup, loading ML models, running migrations). Without a startup probe, the liveness probe would kill the pod before it finishes booting.

GET /startup

200 OK   → Startup complete, liveness/readiness probes begin
503      → Still starting, keep waiting

Failure response: If the startup probe does not pass within the configured deadline, the container is killed. Liveness and readiness probes do not run until the startup probe succeeds.

Kubernetes Probe Configuration#

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 10
        failureThreshold: 3
        timeoutSeconds: 2
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5
        failureThreshold: 2
        timeoutSeconds: 3
      startupProbe:
        httpGet:
          path: /startup
          port: 8080
        periodSeconds: 5
        failureThreshold: 30
        timeoutSeconds: 2

Key settings:

initialDelaySeconds: Wait before first check (use startup probe instead when possible)
periodSeconds: How often to check
failureThreshold: Consecutive failures before action
timeoutSeconds: How long to wait for a response (keep this short)

Dependency Checks#

Health endpoints should check dependencies, but only in the readiness probe.

What to Check#

Dependency	Check Method	Timeout
Database	`SELECT 1` or connection pool status	2 seconds
Redis/Cache	`PING` command	1 second
Message queue	Connection status	2 seconds
Downstream API	HEAD request or circuit breaker state	3 seconds
Disk space	Filesystem stats	instant
Certificate expiry	X.509 not-after date	instant

What Not to Check in Liveness#

External APIs (their outage should not restart your pods)
Non-critical dependencies (analytics, logging services)
Expensive queries (a health check that takes 5 seconds defeats the purpose)

Parallel Dependency Checks#

Run checks concurrently, not sequentially. If you check 5 dependencies at 2 seconds each, sequential checks take 10 seconds. Parallel checks take 2 seconds.

GET /ready
  ├── check_database()     (2s timeout)
  ├── check_redis()        (1s timeout)
  ├── check_queue()        (2s timeout)
  └── check_filesystem()   (instant)

Total: max(individual timeouts) = 2s, not sum

Degraded State#

Not everything is binary healthy/unhealthy. A service might be functional but impaired.

Three-State Model#

HEALTHY    → All systems operational, full capacity
DEGRADED   → Functional but impaired (cache miss, replica lag, high latency)
UNHEALTHY  → Cannot serve requests reliably

Response Format#

{
  "status": "degraded",
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 12
    },
    "cache": {
      "status": "degraded",
      "message": "Redis replica lag > 5s, serving stale reads",
      "latency_ms": 340
    },
    "disk": {
      "status": "healthy",
      "free_gb": 42.8
    }
  },
  "version": "2.14.3",
  "uptime_seconds": 259200
}

Degraded means readiness still passes but the response signals that capacity is reduced. Monitoring tools can alert on degraded state without triggering restarts.

Health Aggregation#

When you have dozens of services, you need a single view of system health.

Aggregation Architecture#

Service A  → /ready → Health Aggregator → /system/health
Service B  → /ready →                  → Dashboard
Service C  → /ready →                  → Alerting
Database   → ping   →
Redis      → ping   →

Aggregation Rules#

System healthy: All critical services healthy
System degraded: Any non-critical service unhealthy OR any critical service degraded
System unhealthy: Any critical service unhealthy

Deep vs Shallow Health Checks#

Shallow (/healthz): Process is alive. Fast, no dependency checks. Used by liveness probes.

Deep (/ready): Checks all dependencies. Used by readiness probes and health dashboards.

Recursive: Service A checks Service B's health as part of its own readiness. Dangerous: creates circular dependencies and cascading failures. Avoid recursive health checks. Check your own dependencies, not your dependencies' dependencies.

Standard Response Formats#

RFC Health Check (draft-inadarei-api-health-check)#

{
  "status": "pass",
  "version": "1.2.3",
  "releaseId": "abc123",
  "serviceId": "user-service",
  "description": "User account management API",
  "checks": {
    "postgresql:connection": [
      {
        "componentType": "datastore",
        "observedValue": 12,
        "observedUnit": "ms",
        "status": "pass",
        "time": "2026-03-29T10:15:00Z"
      }
    ]
  }
}

Status values: pass, fail, warn.

Simple Format (Recommended for Most Services)#

{
  "status": "ok",
  "timestamp": "2026-03-29T10:15:00Z",
  "checks": {
    "db": "ok",
    "cache": "ok",
    "queue": "degraded"
  }
}

Use HTTP status codes: 200 for healthy/degraded, 503 for unhealthy.

Implementation Patterns#

Caching Health Results#

Do not hit the database on every probe check. Cache results for a short window:

health_cache_ttl = 5 seconds

GET /ready:
  if cache.age < health_cache_ttl:
    return cache.result
  else:
    result = run_all_checks()
    cache.set(result)
    return result

This prevents health checks from overwhelming dependencies when the orchestrator checks every 5 seconds across 50 pods.

Circuit Breaker Integration#

If a dependency's circuit breaker is open, report degraded instead of running the check:

check_database():
  if circuit_breaker.is_open("database"):
    return { status: "degraded", reason: "circuit breaker open" }
  try:
    db.execute("SELECT 1")
    return { status: "healthy" }
  catch:
    return { status: "unhealthy" }

Security#

Health endpoints should not require authentication (the load balancer needs unauthenticated access), but they should:

Not expose sensitive information (connection strings, credentials)
Be rate-limited
Only be accessible from internal networks (not public-facing)
Return minimal info on /healthz, detailed info on /ready behind internal access only

Summary#

Liveness checks if the process is alive. Keep it simple. Never check external dependencies.
Readiness checks if the service can handle traffic. Check dependencies here.
Startup prevents premature liveness kills during slow initialization.
Degraded state gives you a third option between healthy and unhealthy.
Run dependency checks in parallel with short timeouts.
Health aggregation gives a single system-wide view across all services.

Article #438 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Try these templates

E-Commerce Checkout System

Production checkout flow with Stripe payments, inventory management, and fraud detection.

11 components

Kubernetes Container Orchestration

K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.

9 components

Telemedicine Platform

HIPAA-compliant telehealth with video consultations, EHR integration, e-prescriptions, and appointment scheduling.

9 components

Build this architecture

Generate an interactive architecture for Health Check Endpoints in seconds.

Try it in Codelit →

api-designkubernetesinfrastructuresystem-design

Health Check Endpoints — Liveness, Readiness, Startup Probes, and Dependency Checks

March 29, 2026 7 min readBy Codelit Team Discussion

Why Health Checks Are Not Optional#

The Three Probe Types#

Liveness Probe — "Is the Process Alive?"#

Answers one question: should this instance be killed and restarted?

Check: The process is running and not deadlocked.

Do not check: Database connectivity, downstream services, disk space. If the database is down, restarting your app will not fix it.

GET /healthz

200 OK
{
  "status": "alive",
  "uptime": 84923
}

Failure response: The orchestrator kills and restarts the container. If your liveness probe depends on external services, a database outage will cascade into restarting every pod in your cluster.

Readiness Probe — "Can This Instance Handle Traffic?"#

Answers: should the load balancer send requests to this instance?

Check: The service has completed initialization, database connection pool is warm, caches are loaded, and the service can actually process requests.

GET /ready

200 OK
{
  "status": "ready",
  "checks": {
    "database": "connected",
    "cache": "warm",
    "migrations": "complete"
  }
}

503 Service Unavailable
{
  "status": "not_ready",
  "checks": {
    "database": "connected",
    "cache": "warming",
    "migrations": "complete"
  }
}

Failure response: The instance is removed from the load balancer's pool but is not killed. Once it passes again, traffic resumes.

Startup Probe — "Has the Process Finished Starting?"#

Answers: is the application still initializing?

GET /startup

200 OK   → Startup complete, liveness/readiness probes begin
503      → Still starting, keep waiting

Failure response: If the startup probe does not pass within the configured deadline, the container is killed. Liveness and readiness probes do not run until the startup probe succeeds.

Kubernetes Probe Configuration#

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 10
        failureThreshold: 3
        timeoutSeconds: 2
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5
        failureThreshold: 2
        timeoutSeconds: 3
      startupProbe:
        httpGet:
          path: /startup
          port: 8080
        periodSeconds: 5
        failureThreshold: 30
        timeoutSeconds: 2

Key settings:

initialDelaySeconds: Wait before first check (use startup probe instead when possible)
periodSeconds: How often to check
failureThreshold: Consecutive failures before action
timeoutSeconds: How long to wait for a response (keep this short)

Dependency Checks#

Health endpoints should check dependencies, but only in the readiness probe.

What to Check#

Dependency	Check Method	Timeout
Database	`SELECT 1` or connection pool status	2 seconds
Redis/Cache	`PING` command	1 second
Message queue	Connection status	2 seconds
Downstream API	HEAD request or circuit breaker state	3 seconds
Disk space	Filesystem stats	instant
Certificate expiry	X.509 not-after date	instant

What Not to Check in Liveness#

External APIs (their outage should not restart your pods)
Non-critical dependencies (analytics, logging services)
Expensive queries (a health check that takes 5 seconds defeats the purpose)

Parallel Dependency Checks#

Run checks concurrently, not sequentially. If you check 5 dependencies at 2 seconds each, sequential checks take 10 seconds. Parallel checks take 2 seconds.

GET /ready
  ├── check_database()     (2s timeout)
  ├── check_redis()        (1s timeout)
  ├── check_queue()        (2s timeout)
  └── check_filesystem()   (instant)

Total: max(individual timeouts) = 2s, not sum

Degraded State#

Not everything is binary healthy/unhealthy. A service might be functional but impaired.

Three-State Model#

HEALTHY    → All systems operational, full capacity
DEGRADED   → Functional but impaired (cache miss, replica lag, high latency)
UNHEALTHY  → Cannot serve requests reliably

Response Format#

{
  "status": "degraded",
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 12
    },
    "cache": {
      "status": "degraded",
      "message": "Redis replica lag > 5s, serving stale reads",
      "latency_ms": 340
    },
    "disk": {
      "status": "healthy",
      "free_gb": 42.8
    }
  },
  "version": "2.14.3",
  "uptime_seconds": 259200
}

Degraded means readiness still passes but the response signals that capacity is reduced. Monitoring tools can alert on degraded state without triggering restarts.

Health Aggregation#

When you have dozens of services, you need a single view of system health.

Aggregation Architecture#

Service A  → /ready → Health Aggregator → /system/health
Service B  → /ready →                  → Dashboard
Service C  → /ready →                  → Alerting
Database   → ping   →
Redis      → ping   →

Aggregation Rules#

System healthy: All critical services healthy
System degraded: Any non-critical service unhealthy OR any critical service degraded
System unhealthy: Any critical service unhealthy

Deep vs Shallow Health Checks#

Shallow (/healthz): Process is alive. Fast, no dependency checks. Used by liveness probes.

Deep (/ready): Checks all dependencies. Used by readiness probes and health dashboards.

Standard Response Formats#

RFC Health Check (draft-inadarei-api-health-check)#

{
  "status": "pass",
  "version": "1.2.3",
  "releaseId": "abc123",
  "serviceId": "user-service",
  "description": "User account management API",
  "checks": {
    "postgresql:connection": [
      {
        "componentType": "datastore",
        "observedValue": 12,
        "observedUnit": "ms",
        "status": "pass",
        "time": "2026-03-29T10:15:00Z"
      }
    ]
  }
}

Status values: pass, fail, warn.

Simple Format (Recommended for Most Services)#

{
  "status": "ok",
  "timestamp": "2026-03-29T10:15:00Z",
  "checks": {
    "db": "ok",
    "cache": "ok",
    "queue": "degraded"
  }
}

Use HTTP status codes: 200 for healthy/degraded, 503 for unhealthy.

Implementation Patterns#

Caching Health Results#

Do not hit the database on every probe check. Cache results for a short window:

health_cache_ttl = 5 seconds

GET /ready:
  if cache.age < health_cache_ttl:
    return cache.result
  else:
    result = run_all_checks()
    cache.set(result)
    return result

This prevents health checks from overwhelming dependencies when the orchestrator checks every 5 seconds across 50 pods.

Circuit Breaker Integration#

If a dependency's circuit breaker is open, report degraded instead of running the check:

check_database():
  if circuit_breaker.is_open("database"):
    return { status: "degraded", reason: "circuit breaker open" }
  try:
    db.execute("SELECT 1")
    return { status: "healthy" }
  catch:
    return { status: "unhealthy" }

Security#

Health endpoints should not require authentication (the load balancer needs unauthenticated access), but they should:

Not expose sensitive information (connection strings, credentials)
Be rate-limited
Only be accessible from internal networks (not public-facing)
Return minimal info on /healthz, detailed info on /ready behind internal access only

Summary#

Liveness checks if the process is alive. Keep it simple. Never check external dependencies.
Readiness checks if the service can handle traffic. Check dependencies here.
Startup prevents premature liveness kills during slow initialization.
Degraded state gives you a third option between healthy and unhealthy.
Run dependency checks in parallel with short timeouts.
Health aggregation gives a single system-wide view across all services.

Article #438 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

api design

Build this architecture

Generate an interactive architecture for Health Check Endpoints in seconds.

Try it in Codelit →

Health Check Endpoints — Liveness, Readiness, Startup Probes, and Dependency Checks

Why Health Checks Are Not Optional#

The Three Probe Types#

Liveness Probe — "Is the Process Alive?"#

Readiness Probe — "Can This Instance Handle Traffic?"#

Startup Probe — "Has the Process Finished Starting?"#

Kubernetes Probe Configuration#

Dependency Checks#

What to Check#

What Not to Check in Liveness#

Parallel Dependency Checks#

Degraded State#

Three-State Model#

Response Format#

Health Aggregation#

Aggregation Architecture#

Aggregation Rules#

Deep vs Shallow Health Checks#

Standard Response Formats#

RFC Health Check (draft-inadarei-api-health-check)#

Simple Format (Recommended for Most Services)#

Implementation Patterns#

Caching Health Results#

Circuit Breaker Integration#

Security#

Summary#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

E-Commerce Checkout System

Kubernetes Container Orchestration

Telemedicine Platform

Build this architecture

Health Check Endpoints — Liveness, Readiness, Startup Probes, and Dependency Checks

Why Health Checks Are Not Optional#

The Three Probe Types#

Liveness Probe — "Is the Process Alive?"#

Readiness Probe — "Can This Instance Handle Traffic?"#

Startup Probe — "Has the Process Finished Starting?"#

Kubernetes Probe Configuration#

Dependency Checks#

What to Check#

What Not to Check in Liveness#

Parallel Dependency Checks#

Degraded State#

Three-State Model#

Response Format#

Health Aggregation#

Aggregation Architecture#

Aggregation Rules#

Deep vs Shallow Health Checks#

Standard Response Formats#

RFC Health Check (draft-inadarei-api-health-check)#

Simple Format (Recommended for Most Services)#

Implementation Patterns#

Caching Health Results#

Circuit Breaker Integration#

Security#

Summary#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

E-Commerce Checkout System

Kubernetes Container Orchestration

Telemedicine Platform

Build this architecture