Kubernetes HPA Deep Dive — CPU, Custom Metrics, Scaling Policies, and KEDA
What the HPA does#
The Horizontal Pod Autoscaler watches metrics and adjusts the number of pod replicas. More traffic means more pods. Less traffic means fewer pods. You define the rules — Kubernetes executes them.
Metrics Server → HPA controller → Deployment replica count → Scheduler → Pods
The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Each iteration: fetch metrics, compute desired replicas, update the deployment.
CPU and memory metrics#
The simplest HPA configuration targets CPU utilization.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
How the formula works#
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
If you have 4 replicas at 90% CPU and your target is 70%:
desiredReplicas = ceil(4 * (90 / 70)) = ceil(5.14) = 6
The HPA scales to 6 replicas.
Resource requests matter#
CPU utilization is calculated relative to the pod's resource request, not the node capacity. If your pod requests 500m CPU and uses 350m, utilization is 70%.
resources:
requests:
cpu: 500m
memory: 256Mi
No resource request means the HPA cannot calculate utilization. Always set requests.
Custom metrics (Prometheus adapter)#
CPU and memory are not enough for most workloads. A web server might have low CPU but a queue of 10,000 pending requests. You need custom metrics.
Architecture#
Your App → Prometheus (scrape) → Prometheus Adapter → Kubernetes Custom Metrics API → HPA
Step 1: Expose metrics from your app#
# Your app exposes a /metrics endpoint
http_requests_in_flight 42
http_request_queue_length 1500
Step 2: Install Prometheus adapter#
The adapter translates Prometheus queries into the Kubernetes custom metrics API.
# prometheus-adapter config
rules:
- seriesQuery: 'http_requests_in_flight{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
Step 3: Reference custom metrics in HPA#
metrics:
- type: Pods
pods:
metric:
name: http_requests_in_flight
target:
type: AverageValue
averageValue: 100
Now the HPA scales based on in-flight requests — each pod should handle no more than 100 concurrent requests.
Common custom metrics#
- Request queue length — scale when work is piling up
- Consumer lag (Kafka) — scale consumers when messages back up
- Active connections — scale based on connection count
- Custom business metrics — orders per second, jobs in queue
Scaling behavior policies#
Kubernetes v2 HPA lets you control how fast scaling happens, not just when.
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min
What this means#
Scale up: Every 60 seconds, add up to 100% more pods OR 4 pods (whichever is higher). Aggressive — respond fast to load spikes.
Scale down: Every 60 seconds, remove at most 10% of pods. Conservative — avoid thrashing on temporary dips.
selectPolicy#
Max— use the policy that allows the most change (aggressive)Min— use the policy that allows the least change (conservative)Disabled— prevent scaling in that direction entirely
Cooldown and stabilization windows#
The problem: thrashing#
Without stabilization, the HPA oscillates:
10:00 — CPU spikes to 90% → scale up to 8 pods
10:01 — 8 pods reduce CPU to 40% → scale down to 4 pods
10:02 — 4 pods, CPU spikes again → scale up to 8 pods
This is thrashing. Pods are constantly created and destroyed.
Stabilization window#
The stabilization window looks at all desired replica recommendations over the window period and picks the most conservative one.
scaleDown:
stabilizationWindowSeconds: 300 # 5 minutes
For scale-down: the HPA looks at all computed desired replica counts in the last 5 minutes and picks the highest one. It only scales down if all recommendations in the window agree.
For scale-up: same logic but picks the lowest recommendation. Usually set shorter (0-60 seconds) because you want to respond quickly to load.
Default values#
| Parameter | Default |
|---|---|
| Scale-up stabilization | 0 seconds (immediate) |
| Scale-down stabilization | 300 seconds (5 minutes) |
Practical tuning#
- Bursty traffic (API servers): short scale-up window (0-30s), long scale-down window (5-10 min)
- Steady traffic (batch workers): moderate both directions (60s up, 300s down)
- Cost-sensitive workloads: aggressive scale-down (60-120s) with conservative policies
Scale-to-zero with KEDA#
The built-in HPA has a hard limit: minReplicas must be at least 1. You always pay for at least one pod, even with zero traffic.
KEDA (Kubernetes Event-Driven Autoscaling) solves this.
How KEDA works#
Event Source → KEDA Scaler → KEDA Operator → HPA (1→N) or Deployment (0→1)
KEDA handles the 0-to-1 and 1-to-0 transitions. Once at least 1 pod exists, it creates a standard HPA to handle 1-to-N scaling.
Example: scale on Kafka consumer lag#
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0 # KEDA allows zero
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: order-group
topic: orders
lagThreshold: "100"
When there are zero messages in the topic, KEDA scales to zero pods. When messages arrive, KEDA spins up a pod, then the HPA takes over for further scaling.
KEDA scalers#
KEDA supports 60+ event sources out of the box:
- Message queues — Kafka, RabbitMQ, SQS, Azure Service Bus
- Databases — PostgreSQL, MySQL, MongoDB (query-based)
- HTTP — scale on request rate (via KEDA HTTP add-on)
- Cron — schedule-based scaling (scale up during business hours)
- Prometheus — any Prometheus metric
- Cloud services — AWS CloudWatch, GCP Pub/Sub, Azure Event Hubs
Scale-to-zero tradeoffs#
Cold start latency: The first request after scale-to-zero waits for a pod to start. Container pull, init containers, application startup — this can be 5-30 seconds.
Mitigation strategies:
- Pre-pull images on nodes (DaemonSet that pulls images)
- Use lightweight base images (distroless, Alpine)
- Optimize application startup (lazy initialization, connection pooling on first request)
- Set
minReplicaCount: 1during business hours, 0 at night
Monitoring your HPA#
Always monitor the autoscaler itself, not just the workload.
# Check HPA status
kubectl get hpa web-app-hpa
# Detailed conditions
kubectl describe hpa web-app-hpa
Key things to watch:
- AbleToScale — can the HPA actually make changes?
- ScalingActive — is the HPA actively scaling?
- ScalingLimited — has the HPA hit min/max bounds?
- Current vs desired replicas — is there a gap?
Export these as Prometheus metrics and alert when the HPA is stuck at maxReplicas (you may need to increase the limit) or when ScalingActive is false (metrics pipeline might be broken).
Visualize your Kubernetes architecture#
Map out your HPAs, deployments, and scaling triggers — try Codelit to generate an interactive architecture diagram.
Key takeaways#
- Always set resource requests — the HPA cannot compute utilization without them
- CPU alone is rarely enough — use custom metrics (queue depth, request count) for accurate scaling
- Tune stabilization windows — fast scale-up (0-60s), slow scale-down (300s+) prevents thrashing
- Use behavior policies to control the rate of scaling, not just the trigger
- KEDA enables scale-to-zero — essential for event-driven and cost-sensitive workloads
- Monitor the HPA itself — a broken metrics pipeline means no autoscaling
Article #435 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency
8 min read
system designCircuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
7 min read
testingAPI Contract Testing with Pact — Consumer-Driven Contracts for Microservices
8 min read
Try these templates
Kubernetes Container Orchestration
K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.
9 componentsCustomer Support Platform
Zendesk-like helpdesk with tickets, live chat, knowledge base, and AI-powered auto-responses.
8 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsBuild this architecture
Generate an interactive architecture for Kubernetes HPA Deep Dive in seconds.
Try it in Codelit →
Comments