Kubernetes Autoscaling: HPA, VPA, KEDA & Cluster Autoscaler
Kubernetes workloads rarely need a fixed number of replicas around the clock. Traffic spikes, batch jobs, and seasonal patterns all demand elastic capacity. Autoscaling lets the cluster match resources to real demand — saving cost during lulls and preserving latency during surges.
The Four Layers of Kubernetes Autoscaling#
Autoscaling in Kubernetes operates at multiple levels, each solving a different part of the puzzle:
┌────────────────────────────────────────────────────┐
│ Cluster Autoscaler │
│ Adds / removes nodes from the pool │
├────────────────────────────────────────────────────┤
│ Horizontal Pod Autoscaler (HPA) │
│ Scales replica count based on metrics │
├────────────────────────────────────────────────────┤
│ Vertical Pod Autoscaler (VPA) │
│ Adjusts CPU / memory requests per pod │
├────────────────────────────────────────────────────┤
│ KEDA — Event-driven autoscaling │
│ Scales from / to zero based on external events │
└────────────────────────────────────────────────────┘
Understanding when to use each layer — and how they interact — is the key to a well-tuned cluster.
Horizontal Pod Autoscaler (HPA)#
The HPA watches one or more metrics and adjusts the replicas field on a Deployment, StatefulSet, or ReplicaSet.
Basic CPU-Based HPA#
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
The controller computes desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric)). If average CPU sits at 90 % with a target of 60 %, the HPA scales up by 50 %.
Custom and External Metrics#
CPU alone is a poor proxy for many workloads. The HPA v2 API supports three metric types:
- Resource — CPU, memory (built-in metrics-server).
- Pods — Per-pod custom metrics exposed via the custom metrics API (e.g., requests-per-second from Prometheus Adapter).
- External — Metrics that do not map to any Kubernetes object, such as an SQS queue depth or a Pub/Sub subscription backlog.
Scaling Policies and Cooldown#
HPA v2 introduced behavior to control the speed and stability of scaling:
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 60
- Stabilization window — The controller looks back over this window and picks the highest (scale-up) or lowest (scale-down) recommendation. A longer window prevents flapping.
- Policies — You can cap scaling by absolute pod count or by percentage, and combine multiple policies. The most restrictive policy wins by default (
selectPolicy: Min).
Best practice: scale up aggressively (short window, high percentage) and scale down conservatively (long window, small steps). Premature scale-down causes latency spikes that trigger another scale-up.
Vertical Pod Autoscaler (VPA)#
Where the HPA changes replica count, the VPA changes the resource requests and limits of individual pods.
How VPA Works#
- Recommender — Observes historical CPU and memory usage and computes target requests.
- Updater — Evicts pods whose requests deviate significantly from the recommendation.
- Admission Controller — Mutates the pod spec at creation time to apply the recommended values.
Update Modes#
| Mode | Behavior |
|---|---|
Off | Only produces recommendations; no mutations |
Initial | Sets requests at pod creation; never evicts |
Auto | Sets requests at creation and evicts running pods to apply updates |
VPA and HPA Together#
Running VPA in Auto mode alongside an HPA that scales on CPU can cause a feedback loop: VPA raises requests, CPU utilization drops, HPA scales down, utilization rises, VPA raises again. Two safe patterns:
- Use VPA in
OfforInitialmode and feed its recommendations into your deployment manifests during CI. - Use HPA on a custom metric (requests-per-second) so it does not conflict with VPA adjusting CPU requests.
KEDA — Event-Driven Autoscaling#
KEDA (Kubernetes Event-Driven Autoscaling) extends the HPA with 60+ scalers for external event sources.
Why KEDA?#
- Scale to zero — The HPA cannot reduce replicas below 1. KEDA can deactivate a deployment entirely when there is no work, then spin pods back up when events arrive.
- Rich scaler ecosystem — Kafka consumer lag, RabbitMQ queue length, Prometheus queries, cron schedules, AWS SQS, Azure Service Bus, and more.
- ScaledObject abstraction — A single CRD that wires a trigger to a target deployment.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0
maxReplicaCount: 50
cooldownPeriod: 120
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: orders
topic: order-events
lagThreshold: "10"
KEDA creates an HPA under the hood. The cooldownPeriod controls how long the deployment stays at minimum replicas after the last trigger activation — analogous to the HPA stabilization window.
Cluster Autoscaler#
The cluster autoscaler adjusts the number of nodes in a node group (or managed node pool) based on pod scheduling pressure.
Scale-Up Trigger#
When the scheduler cannot place a pod because no node has sufficient allocatable resources, the cluster autoscaler provisions a new node. It evaluates node group templates to find the cheapest option that satisfies the pending pod's requests.
Scale-Down Trigger#
A node becomes a candidate for removal when its utilization (sum of pod requests / allocatable) falls below a threshold (default 50 %) for a sustained period (default 10 minutes). The autoscaler checks:
- Are there pods that cannot be moved (DaemonSets are excluded, but PDBs, local storage, and system-critical pods block removal)?
- Can the remaining nodes absorb the evicted pods?
Tuning Parameters#
| Parameter | Default | Recommendation |
|---|---|---|
scan-interval | 10 s | Keep default |
scale-down-unneeded-time | 10 min | Increase to 15–20 min in bursty workloads |
scale-down-utilization-threshold | 0.5 | Lower to 0.35 for cost optimization |
max-node-provision-time | 15 min | Reduce to 5 min with warm pools |
Right-Sizing Pods#
Autoscalers can only work well when the resource requests on your pods reflect reality. Over-requested pods waste capacity; under-requested pods get throttled or OOM-killed.
A Practical Right-Sizing Workflow#
- Deploy with generous requests and VPA in
Offmode. - Let VPA collect at least 24 hours (ideally 7 days) of usage data.
- Review VPA recommendations:
kubectl get vpa -o jsonpath='{.status.recommendation}'. - Set requests to the VPA target and limits to 2–3x the target (or remove CPU limits entirely — CPU is compressible).
- Re-enable HPA and monitor p99 latency for regressions.
Memory vs. CPU#
- CPU — Compressible. Pods are throttled, not killed, when they exceed their request. Many teams remove CPU limits altogether and rely on requests for scheduling.
- Memory — Incompressible. Exceeding the limit triggers an OOM kill. Set memory limits close to (but above) the VPA upper-bound recommendation.
Putting It All Together#
A production-grade autoscaling stack typically combines:
Traffic ──▶ HPA (requests/sec) ──▶ scales Deployment replicas
│
VPA (Off mode) ──▶ CI pipeline ──▶ updates resource requests
│
Cluster Autoscaler ──▶ adds nodes when pods are pending
│
KEDA ──▶ scales async workers from 0 based on queue depth
Checklist#
- Set meaningful resource requests on every pod.
- Use HPA with a custom metric that reflects user-facing load.
- Use VPA in
Offmode to inform request tuning in CI. - Configure HPA
behaviorfor asymmetric scaling (fast up, slow down). - Enable cluster autoscaler with appropriate node group sizing.
- Use KEDA for event-driven and scale-to-zero workloads.
- Set Pod Disruption Budgets to protect availability during scale-down.
- Monitor autoscaler decisions with dedicated dashboards and alerts.
Conclusion#
Kubernetes autoscaling is not a single switch you flip. It is a layered system — HPA for replica count, VPA for resource requests, KEDA for event-driven workloads, and the cluster autoscaler for node capacity. Each layer has its own metrics, policies, and failure modes. Master them individually, compose them carefully, and your cluster will handle traffic surges without burning budget during quiet hours.
Article #316 of the Codelit system design series. Explore all articles at codelit.io.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Kubernetes Container Orchestration
K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.
9 componentsElasticsearch Search Cluster
Distributed search and analytics engine with inverted indexes, sharding, replication, and the ELK stack.
10 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsBuild this architecture
Generate an interactive architecture for Kubernetes Autoscaling in seconds.
Try it in Codelit →
Comments