Health Check Patterns: Liveness, Readiness, and Monitoring Strategies
A service that reports "running" is not necessarily a service that is working. Health checks bridge that gap — they give load balancers, orchestrators, and operators a structured way to ask a service: "Can you handle traffic right now?"
Why Health Checks Matter#
Without health checks, failures hide. A process stays up but stops processing requests because a database connection pool is exhausted. The load balancer keeps routing traffic to a pod that is stuck in a deadlock. Health checks surface these conditions before users notice.
Shallow vs Deep Health Checks#
Shallow (liveness) check:
- Returns 200 if the process is alive and the HTTP server can respond.
- Does not verify downstream dependencies.
- Fast, cheap, and rarely flaps.
GET /healthz
200 OK
Deep (readiness) check:
- Verifies that the service can actually do useful work.
- Pings the database, checks cache connectivity, validates that critical config is loaded.
- Slower and more likely to fail, but more informative.
GET /readyz
{
"status": "ready",
"checks": {
"database": "ok",
"cache": "ok",
"config": "ok"
}
}
The key rule: liveness checks should never call external dependencies. If the database goes down, you do not want the orchestrator to restart your service — the service itself is fine; it is the dependency that failed.
Kubernetes Probes#
Kubernetes defines three probe types that map directly to health check patterns:
Liveness probe:
- Determines whether the container should be restarted.
- If the liveness probe fails N consecutive times, Kubernetes kills and restarts the pod.
- Should only check the process itself (deadlock detection, memory corruption).
Readiness probe:
- Determines whether the pod should receive traffic.
- If the readiness probe fails, the pod is removed from the Service endpoints.
- Should check dependency connectivity and warm-up status.
Startup probe:
- Gives slow-starting containers extra time before liveness kicks in.
- Prevents premature restarts during initialization (loading ML models, warming caches).
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
Dependency Health Checks#
A deep health check should verify each critical dependency individually:
GET /readyz
{
"status": "degraded",
"checks": {
"postgres": { "status": "ok", "latency_ms": 3 },
"redis": { "status": "ok", "latency_ms": 1 },
"s3": { "status": "timeout", "latency_ms": 5000 },
"config": { "status": "ok" }
}
}
Design guidelines:
- Timeouts — Each dependency check must have a short timeout (1-3 seconds). A hanging check defeats the purpose.
- Caching — Cache dependency check results for a few seconds to avoid hammering downstream services with probe traffic.
- Degraded vs unhealthy — If a non-critical dependency fails (e.g., analytics service), return "degraded" but still accept traffic. Only return "unhealthy" when the service truly cannot function.
Circuit Breaker Health Integration#
When your service uses circuit breakers for downstream calls, expose their state in the health endpoint:
{
"circuitBreakers": {
"payment-service": { "state": "closed", "failureRate": 0.02 },
"notification-service": { "state": "open", "openSince": "2026-03-29T10:15:00Z" },
"inventory-service": { "state": "half-open", "testRequests": 3 }
}
}
An open circuit breaker does not necessarily mean the service is unhealthy — it may mean the service is protecting itself by rejecting calls to a failing dependency. Use this information for operational dashboards, not for readiness decisions.
Health Check Endpoint Design#
Standard paths:
| Path | Purpose |
|---|---|
/healthz | Liveness — is the process alive? |
/readyz | Readiness — can the service handle requests? |
/startupz | Startup — has initialization completed? |
/metrics | Prometheus metrics (not a health check, but related) |
Response codes:
200 OK— healthy503 Service Unavailable— unhealthy (load balancer should stop routing)429 Too Many Requests— overloaded (back off)
Security considerations:
- Health check endpoints should not require authentication (load balancers need unauthenticated access).
- Do not expose sensitive information (connection strings, credentials) in health responses.
- Consider restricting deep health checks to internal networks only.
Monitoring Tools and Integration#
Health checks feed into the broader monitoring stack:
Infrastructure-level:
- Kubernetes — Built-in probe support, automatic pod restart and traffic management.
- AWS ELB/ALB — Target group health checks determine routing.
- Consul — Service mesh health checks for service discovery.
Application-level:
- Prometheus — Scrape
/metricsand/healthz; alert on probe failures. - Datadog / New Relic — Synthetic health check monitors with geographic distribution.
- Pingdom / UptimeRobot — External uptime monitoring for public endpoints.
Health check aggregation:
For microservice architectures, aggregate individual service health into a system-wide dashboard:
System Health Dashboard
├── API Gateway .............. OK
├── Auth Service ............. OK
├── Order Service ............ DEGRADED (Redis timeout)
├── Payment Service .......... OK
├── Notification Service ..... DOWN
└── Overall .................. DEGRADED
Alerting Strategies#
Not every failed health check deserves a page. Use tiered alerting:
Tier 1 — Page immediately:
- Multiple critical services report unhealthy simultaneously.
- The overall system health is "down."
- Liveness probes fail (process is stuck or crashed).
Tier 2 — Alert (Slack / email):
- A single service is degraded for more than 5 minutes.
- A circuit breaker has been open for longer than the expected recovery window.
- Health check latency exceeds thresholds.
Tier 3 — Log and review:
- Transient readiness failures that self-resolve within one probe cycle.
- Non-critical dependency timeouts.
Alert fatigue prevention:
- Debounce — Require N consecutive failures before alerting.
- Hysteresis — Require M consecutive successes before clearing an alert.
- Grouping — If 20 pods fail readiness because the database is down, send one alert about the database, not 20 alerts about pods.
Anti-Patterns#
1. Liveness checks that call the database
If the database is slow, Kubernetes restarts all pods, which reconnect simultaneously and overwhelm the database — a cascading failure triggered by health checks.
2. Health checks with no timeout
A hanging dependency check causes the probe to time out at the orchestrator level, which interprets it as a failure. Always set explicit timeouts shorter than the probe timeout.
3. Binary healthy/unhealthy with no detail
A bare 200/503 tells you nothing about why the service is unhealthy. Include structured details for operators.
4. Health checks that do real work
Do not run a database migration check or a heavy computation inside a health probe. Probes run frequently and must be lightweight.
Key Takeaways#
Health checks are the contract between your service and its infrastructure. Separate liveness from readiness. Keep liveness probes simple and dependency-free. Make readiness probes informative but fast. Integrate with circuit breakers for a complete picture. And build alerting tiers that match the severity of the failure.
This is article #264 of the Codelit system design series. For more deep dives on observability and reliability patterns, explore the full blog archive.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
E-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsTelemedicine Platform
HIPAA-compliant telehealth with video consultations, EHR integration, e-prescriptions, and appointment scheduling.
9 componentsLogging & Observability Platform
Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.
8 componentsBuild this architecture
Generate an interactive architecture for Health Check Patterns in seconds.
Try it in Codelit →
Comments