health check patternsliveness probereadiness probeKubernetes health checksmonitoringalertingsystem designobservability

Health Check Patterns: Liveness, Readiness, and Monitoring Strategies

March 29, 2026 6 min readBy Codelit Team Discussion

A service that reports "running" is not necessarily a service that is working. Health checks bridge that gap — they give load balancers, orchestrators, and operators a structured way to ask a service: "Can you handle traffic right now?"

Why Health Checks Matter#

Without health checks, failures hide. A process stays up but stops processing requests because a database connection pool is exhausted. The load balancer keeps routing traffic to a pod that is stuck in a deadlock. Health checks surface these conditions before users notice.

Shallow vs Deep Health Checks#

Shallow (liveness) check:

Returns 200 if the process is alive and the HTTP server can respond.
Does not verify downstream dependencies.
Fast, cheap, and rarely flaps.

GET /healthz
200 OK

Deep (readiness) check:

Verifies that the service can actually do useful work.
Pings the database, checks cache connectivity, validates that critical config is loaded.
Slower and more likely to fail, but more informative.

GET /readyz
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "cache": "ok",
    "config": "ok"
  }
}

The key rule: liveness checks should never call external dependencies. If the database goes down, you do not want the orchestrator to restart your service — the service itself is fine; it is the dependency that failed.

Kubernetes Probes#

Kubernetes defines three probe types that map directly to health check patterns:

Liveness probe:

Determines whether the container should be restarted.
If the liveness probe fails N consecutive times, Kubernetes kills and restarts the pod.
Should only check the process itself (deadlock detection, memory corruption).

Readiness probe:

Determines whether the pod should receive traffic.
If the readiness probe fails, the pod is removed from the Service endpoints.
Should check dependency connectivity and warm-up status.

Startup probe:

Gives slow-starting containers extra time before liveness kicks in.
Prevents premature restarts during initialization (loading ML models, warming caches).

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Dependency Health Checks#

A deep health check should verify each critical dependency individually:

GET /readyz

{
  "status": "degraded",
  "checks": {
    "postgres": { "status": "ok", "latency_ms": 3 },
    "redis": { "status": "ok", "latency_ms": 1 },
    "s3": { "status": "timeout", "latency_ms": 5000 },
    "config": { "status": "ok" }
  }
}

Design guidelines:

Timeouts — Each dependency check must have a short timeout (1-3 seconds). A hanging check defeats the purpose.
Caching — Cache dependency check results for a few seconds to avoid hammering downstream services with probe traffic.
Degraded vs unhealthy — If a non-critical dependency fails (e.g., analytics service), return "degraded" but still accept traffic. Only return "unhealthy" when the service truly cannot function.

Circuit Breaker Health Integration#

When your service uses circuit breakers for downstream calls, expose their state in the health endpoint:

{
  "circuitBreakers": {
    "payment-service": { "state": "closed", "failureRate": 0.02 },
    "notification-service": { "state": "open", "openSince": "2026-03-29T10:15:00Z" },
    "inventory-service": { "state": "half-open", "testRequests": 3 }
  }
}

An open circuit breaker does not necessarily mean the service is unhealthy — it may mean the service is protecting itself by rejecting calls to a failing dependency. Use this information for operational dashboards, not for readiness decisions.

Health Check Endpoint Design#

Standard paths:

Path	Purpose
`/healthz`	Liveness — is the process alive?
`/readyz`	Readiness — can the service handle requests?
`/startupz`	Startup — has initialization completed?
`/metrics`	Prometheus metrics (not a health check, but related)

Response codes:

200 OK — healthy
503 Service Unavailable — unhealthy (load balancer should stop routing)
429 Too Many Requests — overloaded (back off)

Security considerations:

Health check endpoints should not require authentication (load balancers need unauthenticated access).
Do not expose sensitive information (connection strings, credentials) in health responses.
Consider restricting deep health checks to internal networks only.

Monitoring Tools and Integration#

Health checks feed into the broader monitoring stack:

Infrastructure-level:

Kubernetes — Built-in probe support, automatic pod restart and traffic management.
AWS ELB/ALB — Target group health checks determine routing.
Consul — Service mesh health checks for service discovery.

Application-level:

Prometheus — Scrape /metrics and /healthz; alert on probe failures.
Datadog / New Relic — Synthetic health check monitors with geographic distribution.
Pingdom / UptimeRobot — External uptime monitoring for public endpoints.

Health check aggregation:

For microservice architectures, aggregate individual service health into a system-wide dashboard:

System Health Dashboard
├── API Gateway .............. OK
├── Auth Service ............. OK
├── Order Service ............ DEGRADED (Redis timeout)
├── Payment Service .......... OK
├── Notification Service ..... DOWN
└── Overall .................. DEGRADED

Alerting Strategies#

Not every failed health check deserves a page. Use tiered alerting:

Tier 1 — Page immediately:

Multiple critical services report unhealthy simultaneously.
The overall system health is "down."
Liveness probes fail (process is stuck or crashed).

Tier 2 — Alert (Slack / email):

A single service is degraded for more than 5 minutes.
A circuit breaker has been open for longer than the expected recovery window.
Health check latency exceeds thresholds.

Tier 3 — Log and review:

Transient readiness failures that self-resolve within one probe cycle.
Non-critical dependency timeouts.

Alert fatigue prevention:

Debounce — Require N consecutive failures before alerting.
Hysteresis — Require M consecutive successes before clearing an alert.
Grouping — If 20 pods fail readiness because the database is down, send one alert about the database, not 20 alerts about pods.

Anti-Patterns#

1. Liveness checks that call the database

If the database is slow, Kubernetes restarts all pods, which reconnect simultaneously and overwhelm the database — a cascading failure triggered by health checks.

2. Health checks with no timeout

A hanging dependency check causes the probe to time out at the orchestrator level, which interprets it as a failure. Always set explicit timeouts shorter than the probe timeout.

3. Binary healthy/unhealthy with no detail

A bare 200/503 tells you nothing about why the service is unhealthy. Include structured details for operators.

4. Health checks that do real work

Do not run a database migration check or a heavy computation inside a health probe. Probes run frequently and must be lightweight.

Key Takeaways#

Health checks are the contract between your service and its infrastructure. Separate liveness from readiness. Keep liveness probes simple and dependency-free. Make readiness probes informative but fast. Integrate with circuit breakers for a complete picture. And build alerting tiers that match the severity of the failure.

This is article #264 of the Codelit system design series. For more deep dives on observability and reliability patterns, explore the full blog archive.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

E-Commerce Checkout System

Production checkout flow with Stripe payments, inventory management, and fraud detection.

11 components

Telemedicine Platform

HIPAA-compliant telehealth with video consultations, EHR integration, e-prescriptions, and appointment scheduling.

9 components

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Build this architecture

Generate an interactive architecture for Health Check Patterns in seconds.

Try it in Codelit →

health check patternsliveness probereadiness probeKubernetes health checksmonitoringalertingsystem designobservability

Health Check Patterns: Liveness, Readiness, and Monitoring Strategies

March 29, 2026 6 min readBy Codelit Team Discussion

Why Health Checks Matter#

Shallow vs Deep Health Checks#

Shallow (liveness) check:

Returns 200 if the process is alive and the HTTP server can respond.
Does not verify downstream dependencies.
Fast, cheap, and rarely flaps.

GET /healthz
200 OK

Deep (readiness) check:

Verifies that the service can actually do useful work.
Pings the database, checks cache connectivity, validates that critical config is loaded.
Slower and more likely to fail, but more informative.

GET /readyz
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "cache": "ok",
    "config": "ok"
  }
}

Kubernetes Probes#

Kubernetes defines three probe types that map directly to health check patterns:

Liveness probe:

Determines whether the container should be restarted.
If the liveness probe fails N consecutive times, Kubernetes kills and restarts the pod.
Should only check the process itself (deadlock detection, memory corruption).

Readiness probe:

Determines whether the pod should receive traffic.
If the readiness probe fails, the pod is removed from the Service endpoints.
Should check dependency connectivity and warm-up status.

Startup probe:

Gives slow-starting containers extra time before liveness kicks in.
Prevents premature restarts during initialization (loading ML models, warming caches).

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Dependency Health Checks#

A deep health check should verify each critical dependency individually:

GET /readyz

{
  "status": "degraded",
  "checks": {
    "postgres": { "status": "ok", "latency_ms": 3 },
    "redis": { "status": "ok", "latency_ms": 1 },
    "s3": { "status": "timeout", "latency_ms": 5000 },
    "config": { "status": "ok" }
  }
}

Design guidelines:

Timeouts — Each dependency check must have a short timeout (1-3 seconds). A hanging check defeats the purpose.
Caching — Cache dependency check results for a few seconds to avoid hammering downstream services with probe traffic.
Degraded vs unhealthy — If a non-critical dependency fails (e.g., analytics service), return "degraded" but still accept traffic. Only return "unhealthy" when the service truly cannot function.

Circuit Breaker Health Integration#

When your service uses circuit breakers for downstream calls, expose their state in the health endpoint:

{
  "circuitBreakers": {
    "payment-service": { "state": "closed", "failureRate": 0.02 },
    "notification-service": { "state": "open", "openSince": "2026-03-29T10:15:00Z" },
    "inventory-service": { "state": "half-open", "testRequests": 3 }
  }
}

Health Check Endpoint Design#

Standard paths:

Path	Purpose
`/healthz`	Liveness — is the process alive?
`/readyz`	Readiness — can the service handle requests?
`/startupz`	Startup — has initialization completed?
`/metrics`	Prometheus metrics (not a health check, but related)

Response codes:

200 OK — healthy
503 Service Unavailable — unhealthy (load balancer should stop routing)
429 Too Many Requests — overloaded (back off)

Security considerations:

Health check endpoints should not require authentication (load balancers need unauthenticated access).
Do not expose sensitive information (connection strings, credentials) in health responses.
Consider restricting deep health checks to internal networks only.

Monitoring Tools and Integration#

Health checks feed into the broader monitoring stack:

Infrastructure-level:

Kubernetes — Built-in probe support, automatic pod restart and traffic management.
AWS ELB/ALB — Target group health checks determine routing.
Consul — Service mesh health checks for service discovery.

Application-level:

Prometheus — Scrape /metrics and /healthz; alert on probe failures.
Datadog / New Relic — Synthetic health check monitors with geographic distribution.
Pingdom / UptimeRobot — External uptime monitoring for public endpoints.

Health check aggregation:

For microservice architectures, aggregate individual service health into a system-wide dashboard:

System Health Dashboard
├── API Gateway .............. OK
├── Auth Service ............. OK
├── Order Service ............ DEGRADED (Redis timeout)
├── Payment Service .......... OK
├── Notification Service ..... DOWN
└── Overall .................. DEGRADED

Alerting Strategies#

Not every failed health check deserves a page. Use tiered alerting:

Tier 1 — Page immediately:

Multiple critical services report unhealthy simultaneously.
The overall system health is "down."
Liveness probes fail (process is stuck or crashed).

Tier 2 — Alert (Slack / email):

A single service is degraded for more than 5 minutes.
A circuit breaker has been open for longer than the expected recovery window.
Health check latency exceeds thresholds.

Tier 3 — Log and review:

Transient readiness failures that self-resolve within one probe cycle.
Non-critical dependency timeouts.

Alert fatigue prevention:

Debounce — Require N consecutive failures before alerting.
Hysteresis — Require M consecutive successes before clearing an alert.
Grouping — If 20 pods fail readiness because the database is down, send one alert about the database, not 20 alerts about pods.

Anti-Patterns#

1. Liveness checks that call the database

If the database is slow, Kubernetes restarts all pods, which reconnect simultaneously and overwhelm the database — a cascading failure triggered by health checks.

2. Health checks with no timeout

A hanging dependency check causes the probe to time out at the orchestrator level, which interprets it as a failure. Always set explicit timeouts shorter than the probe timeout.

3. Binary healthy/unhealthy with no detail

A bare 200/503 tells you nothing about why the service is unhealthy. Include structured details for operators.

4. Health checks that do real work

Do not run a database migration check or a heavy computation inside a health probe. Probes run frequently and must be lightweight.

Key Takeaways#

This is article #264 of the Codelit system design series. For more deep dives on observability and reliability patterns, explore the full blog archive.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Health Check Patterns in seconds.

Try it in Codelit →

Health Check Patterns: Liveness, Readiness, and Monitoring Strategies

Why Health Checks Matter#

Shallow vs Deep Health Checks#

Kubernetes Probes#

Dependency Health Checks#

Circuit Breaker Health Integration#

Health Check Endpoint Design#

Monitoring Tools and Integration#

Alerting Strategies#

Anti-Patterns#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

E-Commerce Checkout System

Telemedicine Platform

Logging & Observability Platform

Build this architecture

Health Check Patterns: Liveness, Readiness, and Monitoring Strategies

Why Health Checks Matter#

Shallow vs Deep Health Checks#

Kubernetes Probes#

Dependency Health Checks#

Circuit Breaker Health Integration#

Health Check Endpoint Design#

Monitoring Tools and Integration#

Alerting Strategies#

Anti-Patterns#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

E-Commerce Checkout System

Telemedicine Platform

Logging & Observability Platform

Build this architecture