Health Check Endpoints — Liveness, Readiness, Startup Probes, and Dependency Checks
Why Health Checks Are Not Optional#
A service that responds to requests but returns wrong data is worse than a service that is down. Health checks let your infrastructure know whether a service is truly healthy, partially degraded, or needs to be replaced. Without them, your load balancer sends traffic to broken instances and your orchestrator never restarts failing pods.
The Three Probe Types#
Liveness Probe — "Is the Process Alive?"#
Answers one question: should this instance be killed and restarted?
Check: The process is running and not deadlocked.
Do not check: Database connectivity, downstream services, disk space. If the database is down, restarting your app will not fix it.
GET /healthz
200 OK
{
"status": "alive",
"uptime": 84923
}
Failure response: The orchestrator kills and restarts the container. If your liveness probe depends on external services, a database outage will cascade into restarting every pod in your cluster.
Readiness Probe — "Can This Instance Handle Traffic?"#
Answers: should the load balancer send requests to this instance?
Check: The service has completed initialization, database connection pool is warm, caches are loaded, and the service can actually process requests.
GET /ready
200 OK
{
"status": "ready",
"checks": {
"database": "connected",
"cache": "warm",
"migrations": "complete"
}
}
503 Service Unavailable
{
"status": "not_ready",
"checks": {
"database": "connected",
"cache": "warming",
"migrations": "complete"
}
}
Failure response: The instance is removed from the load balancer's pool but is not killed. Once it passes again, traffic resumes.
Startup Probe — "Has the Process Finished Starting?"#
Answers: is the application still initializing?
Why it exists: Some applications take 30-120 seconds to start (JVM warmup, loading ML models, running migrations). Without a startup probe, the liveness probe would kill the pod before it finishes booting.
GET /startup
200 OK → Startup complete, liveness/readiness probes begin
503 → Still starting, keep waiting
Failure response: If the startup probe does not pass within the configured deadline, the container is killed. Liveness and readiness probes do not run until the startup probe succeeds.
Kubernetes Probe Configuration#
apiVersion: v1
kind: Pod
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 2
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
timeoutSeconds: 3
startupProbe:
httpGet:
path: /startup
port: 8080
periodSeconds: 5
failureThreshold: 30
timeoutSeconds: 2
Key settings:
- initialDelaySeconds: Wait before first check (use startup probe instead when possible)
- periodSeconds: How often to check
- failureThreshold: Consecutive failures before action
- timeoutSeconds: How long to wait for a response (keep this short)
Dependency Checks#
Health endpoints should check dependencies, but only in the readiness probe.
What to Check#
| Dependency | Check Method | Timeout |
|---|---|---|
| Database | SELECT 1 or connection pool status | 2 seconds |
| Redis/Cache | PING command | 1 second |
| Message queue | Connection status | 2 seconds |
| Downstream API | HEAD request or circuit breaker state | 3 seconds |
| Disk space | Filesystem stats | instant |
| Certificate expiry | X.509 not-after date | instant |
What Not to Check in Liveness#
- External APIs (their outage should not restart your pods)
- Non-critical dependencies (analytics, logging services)
- Expensive queries (a health check that takes 5 seconds defeats the purpose)
Parallel Dependency Checks#
Run checks concurrently, not sequentially. If you check 5 dependencies at 2 seconds each, sequential checks take 10 seconds. Parallel checks take 2 seconds.
GET /ready
├── check_database() (2s timeout)
├── check_redis() (1s timeout)
├── check_queue() (2s timeout)
└── check_filesystem() (instant)
Total: max(individual timeouts) = 2s, not sum
Degraded State#
Not everything is binary healthy/unhealthy. A service might be functional but impaired.
Three-State Model#
HEALTHY → All systems operational, full capacity
DEGRADED → Functional but impaired (cache miss, replica lag, high latency)
UNHEALTHY → Cannot serve requests reliably
Response Format#
{
"status": "degraded",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 12
},
"cache": {
"status": "degraded",
"message": "Redis replica lag > 5s, serving stale reads",
"latency_ms": 340
},
"disk": {
"status": "healthy",
"free_gb": 42.8
}
},
"version": "2.14.3",
"uptime_seconds": 259200
}
Degraded means readiness still passes but the response signals that capacity is reduced. Monitoring tools can alert on degraded state without triggering restarts.
Health Aggregation#
When you have dozens of services, you need a single view of system health.
Aggregation Architecture#
Service A → /ready → Health Aggregator → /system/health
Service B → /ready → → Dashboard
Service C → /ready → → Alerting
Database → ping →
Redis → ping →
Aggregation Rules#
- System healthy: All critical services healthy
- System degraded: Any non-critical service unhealthy OR any critical service degraded
- System unhealthy: Any critical service unhealthy
Deep vs Shallow Health Checks#
Shallow (/healthz): Process is alive. Fast, no dependency checks. Used by liveness probes.
Deep (/ready): Checks all dependencies. Used by readiness probes and health dashboards.
Recursive: Service A checks Service B's health as part of its own readiness. Dangerous: creates circular dependencies and cascading failures. Avoid recursive health checks. Check your own dependencies, not your dependencies' dependencies.
Standard Response Formats#
RFC Health Check (draft-inadarei-api-health-check)#
{
"status": "pass",
"version": "1.2.3",
"releaseId": "abc123",
"serviceId": "user-service",
"description": "User account management API",
"checks": {
"postgresql:connection": [
{
"componentType": "datastore",
"observedValue": 12,
"observedUnit": "ms",
"status": "pass",
"time": "2026-03-29T10:15:00Z"
}
]
}
}
Status values: pass, fail, warn.
Simple Format (Recommended for Most Services)#
{
"status": "ok",
"timestamp": "2026-03-29T10:15:00Z",
"checks": {
"db": "ok",
"cache": "ok",
"queue": "degraded"
}
}
Use HTTP status codes: 200 for healthy/degraded, 503 for unhealthy.
Implementation Patterns#
Caching Health Results#
Do not hit the database on every probe check. Cache results for a short window:
health_cache_ttl = 5 seconds
GET /ready:
if cache.age < health_cache_ttl:
return cache.result
else:
result = run_all_checks()
cache.set(result)
return result
This prevents health checks from overwhelming dependencies when the orchestrator checks every 5 seconds across 50 pods.
Circuit Breaker Integration#
If a dependency's circuit breaker is open, report degraded instead of running the check:
check_database():
if circuit_breaker.is_open("database"):
return { status: "degraded", reason: "circuit breaker open" }
try:
db.execute("SELECT 1")
return { status: "healthy" }
catch:
return { status: "unhealthy" }
Security#
Health endpoints should not require authentication (the load balancer needs unauthenticated access), but they should:
- Not expose sensitive information (connection strings, credentials)
- Be rate-limited
- Only be accessible from internal networks (not public-facing)
- Return minimal info on
/healthz, detailed info on/readybehind internal access only
Summary#
- Liveness checks if the process is alive. Keep it simple. Never check external dependencies.
- Readiness checks if the service can handle traffic. Check dependencies here.
- Startup prevents premature liveness kills during slow initialization.
- Degraded state gives you a third option between healthy and unhealthy.
- Run dependency checks in parallel with short timeouts.
- Health aggregation gives a single system-wide view across all services.
Article #438 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
E-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsKubernetes Container Orchestration
K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.
9 componentsTelemedicine Platform
HIPAA-compliant telehealth with video consultations, EHR integration, e-prescriptions, and appointment scheduling.
9 componentsBuild this architecture
Generate an interactive architecture for Health Check Endpoints in seconds.
Try it in Codelit →
Comments