kubernetes autoscalingHPAVPAKEDAcluster autoscalercustom metricsscaling policiespod right-sizingsystem designdevops

Kubernetes Autoscaling: HPA, VPA, KEDA & Cluster Autoscaler

March 29, 2026 7 min readBy Codelit Team Discussion

Kubernetes workloads rarely need a fixed number of replicas around the clock. Traffic spikes, batch jobs, and seasonal patterns all demand elastic capacity. Autoscaling lets the cluster match resources to real demand — saving cost during lulls and preserving latency during surges.

The Four Layers of Kubernetes Autoscaling#

Autoscaling in Kubernetes operates at multiple levels, each solving a different part of the puzzle:

┌────────────────────────────────────────────────────┐
│                 Cluster Autoscaler                  │
│         Adds / removes nodes from the pool          │
├────────────────────────────────────────────────────┤
│   Horizontal Pod Autoscaler (HPA)                   │
│   Scales replica count based on metrics              │
├────────────────────────────────────────────────────┤
│   Vertical Pod Autoscaler (VPA)                      │
│   Adjusts CPU / memory requests per pod              │
├────────────────────────────────────────────────────┤
│   KEDA  —  Event-driven autoscaling                  │
│   Scales from / to zero based on external events     │
└────────────────────────────────────────────────────┘

Understanding when to use each layer — and how they interact — is the key to a well-tuned cluster.

Horizontal Pod Autoscaler (HPA)#

The HPA watches one or more metrics and adjusts the replicas field on a Deployment, StatefulSet, or ReplicaSet.

Basic CPU-Based HPA#

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

The controller computes desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric)). If average CPU sits at 90 % with a target of 60 %, the HPA scales up by 50 %.

Custom and External Metrics#

CPU alone is a poor proxy for many workloads. The HPA v2 API supports three metric types:

Resource — CPU, memory (built-in metrics-server).
Pods — Per-pod custom metrics exposed via the custom metrics API (e.g., requests-per-second from Prometheus Adapter).
External — Metrics that do not map to any Kubernetes object, such as an SQS queue depth or a Pub/Sub subscription backlog.

Scaling Policies and Cooldown#

HPA v2 introduced behavior to control the speed and stability of scaling:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 30
    policies:
      - type: Percent
        value: 100
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Pods
        value: 2
        periodSeconds: 60

Stabilization window — The controller looks back over this window and picks the highest (scale-up) or lowest (scale-down) recommendation. A longer window prevents flapping.
Policies — You can cap scaling by absolute pod count or by percentage, and combine multiple policies. The most restrictive policy wins by default (selectPolicy: Min).

Best practice: scale up aggressively (short window, high percentage) and scale down conservatively (long window, small steps). Premature scale-down causes latency spikes that trigger another scale-up.

Vertical Pod Autoscaler (VPA)#

Where the HPA changes replica count, the VPA changes the resource requests and limits of individual pods.

How VPA Works#

Recommender — Observes historical CPU and memory usage and computes target requests.
Updater — Evicts pods whose requests deviate significantly from the recommendation.
Admission Controller — Mutates the pod spec at creation time to apply the recommended values.

Update Modes#

Mode	Behavior
`Off`	Only produces recommendations; no mutations
`Initial`	Sets requests at pod creation; never evicts
`Auto`	Sets requests at creation and evicts running pods to apply updates

VPA and HPA Together#

Running VPA in Auto mode alongside an HPA that scales on CPU can cause a feedback loop: VPA raises requests, CPU utilization drops, HPA scales down, utilization rises, VPA raises again. Two safe patterns:

Use VPA in Off or Initial mode and feed its recommendations into your deployment manifests during CI.
Use HPA on a custom metric (requests-per-second) so it does not conflict with VPA adjusting CPU requests.

KEDA — Event-Driven Autoscaling#

KEDA (Kubernetes Event-Driven Autoscaling) extends the HPA with 60+ scalers for external event sources.

Why KEDA?#

Scale to zero — The HPA cannot reduce replicas below 1. KEDA can deactivate a deployment entirely when there is no work, then spin pods back up when events arrive.
Rich scaler ecosystem — Kafka consumer lag, RabbitMQ queue length, Prometheus queries, cron schedules, AWS SQS, Azure Service Bus, and more.
ScaledObject abstraction — A single CRD that wires a trigger to a target deployment.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0
  maxReplicaCount: 50
  cooldownPeriod: 120
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: orders
        topic: order-events
        lagThreshold: "10"

KEDA creates an HPA under the hood. The cooldownPeriod controls how long the deployment stays at minimum replicas after the last trigger activation — analogous to the HPA stabilization window.

Cluster Autoscaler#

The cluster autoscaler adjusts the number of nodes in a node group (or managed node pool) based on pod scheduling pressure.

Scale-Up Trigger#

When the scheduler cannot place a pod because no node has sufficient allocatable resources, the cluster autoscaler provisions a new node. It evaluates node group templates to find the cheapest option that satisfies the pending pod's requests.

Scale-Down Trigger#

A node becomes a candidate for removal when its utilization (sum of pod requests / allocatable) falls below a threshold (default 50 %) for a sustained period (default 10 minutes). The autoscaler checks:

Are there pods that cannot be moved (DaemonSets are excluded, but PDBs, local storage, and system-critical pods block removal)?
Can the remaining nodes absorb the evicted pods?

Tuning Parameters#

Parameter	Default	Recommendation
`scan-interval`	10 s	Keep default
`scale-down-unneeded-time`	10 min	Increase to 15–20 min in bursty workloads
`scale-down-utilization-threshold`	0.5	Lower to 0.35 for cost optimization
`max-node-provision-time`	15 min	Reduce to 5 min with warm pools

Right-Sizing Pods#

Autoscalers can only work well when the resource requests on your pods reflect reality. Over-requested pods waste capacity; under-requested pods get throttled or OOM-killed.

A Practical Right-Sizing Workflow#

Deploy with generous requests and VPA in Off mode.
Let VPA collect at least 24 hours (ideally 7 days) of usage data.
Review VPA recommendations: kubectl get vpa -o jsonpath='{.status.recommendation}'.
Set requests to the VPA target and limits to 2–3x the target (or remove CPU limits entirely — CPU is compressible).
Re-enable HPA and monitor p99 latency for regressions.

Memory vs. CPU#

CPU — Compressible. Pods are throttled, not killed, when they exceed their request. Many teams remove CPU limits altogether and rely on requests for scheduling.
Memory — Incompressible. Exceeding the limit triggers an OOM kill. Set memory limits close to (but above) the VPA upper-bound recommendation.

Putting It All Together#

A production-grade autoscaling stack typically combines:

Traffic ──▶ HPA (requests/sec) ──▶ scales Deployment replicas
                                       │
VPA (Off mode) ──▶ CI pipeline ──▶ updates resource requests
                                       │
Cluster Autoscaler ──▶ adds nodes when pods are pending
                                       │
KEDA ──▶ scales async workers from 0 based on queue depth

Checklist#

Set meaningful resource requests on every pod.
Use HPA with a custom metric that reflects user-facing load.
Use VPA in Off mode to inform request tuning in CI.
Configure HPA behavior for asymmetric scaling (fast up, slow down).
Enable cluster autoscaler with appropriate node group sizing.
Use KEDA for event-driven and scale-to-zero workloads.
Set Pod Disruption Budgets to protect availability during scale-down.
Monitor autoscaler decisions with dedicated dashboards and alerts.

Conclusion#

Kubernetes autoscaling is not a single switch you flip. It is a layered system — HPA for replica count, VPA for resource requests, KEDA for event-driven workloads, and the cluster autoscaler for node capacity. Each layer has its own metrics, policies, and failure modes. Master them individually, compose them carefully, and your cluster will handle traffic surges without burning budget during quiet hours.

Article #316 of the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Kubernetes Container Orchestration

K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.

9 components

Elasticsearch Search Cluster

Distributed search and analytics engine with inverted indexes, sharding, replication, and the ELK stack.

10 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Kubernetes Autoscaling in seconds.

Try it in Codelit →

kubernetes autoscalingHPAVPAKEDAcluster autoscalercustom metricsscaling policiespod right-sizingsystem designdevops

Kubernetes Autoscaling: HPA, VPA, KEDA & Cluster Autoscaler

March 29, 2026 7 min readBy Codelit Team Discussion

The Four Layers of Kubernetes Autoscaling#

Autoscaling in Kubernetes operates at multiple levels, each solving a different part of the puzzle:

┌────────────────────────────────────────────────────┐
│                 Cluster Autoscaler                  │
│         Adds / removes nodes from the pool          │
├────────────────────────────────────────────────────┤
│   Horizontal Pod Autoscaler (HPA)                   │
│   Scales replica count based on metrics              │
├────────────────────────────────────────────────────┤
│   Vertical Pod Autoscaler (VPA)                      │
│   Adjusts CPU / memory requests per pod              │
├────────────────────────────────────────────────────┤
│   KEDA  —  Event-driven autoscaling                  │
│   Scales from / to zero based on external events     │
└────────────────────────────────────────────────────┘

Understanding when to use each layer — and how they interact — is the key to a well-tuned cluster.

Horizontal Pod Autoscaler (HPA)#

The HPA watches one or more metrics and adjusts the replicas field on a Deployment, StatefulSet, or ReplicaSet.

Basic CPU-Based HPA#

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

The controller computes desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric)). If average CPU sits at 90 % with a target of 60 %, the HPA scales up by 50 %.

Custom and External Metrics#

CPU alone is a poor proxy for many workloads. The HPA v2 API supports three metric types:

Resource — CPU, memory (built-in metrics-server).
Pods — Per-pod custom metrics exposed via the custom metrics API (e.g., requests-per-second from Prometheus Adapter).
External — Metrics that do not map to any Kubernetes object, such as an SQS queue depth or a Pub/Sub subscription backlog.

Scaling Policies and Cooldown#

HPA v2 introduced behavior to control the speed and stability of scaling:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 30
    policies:
      - type: Percent
        value: 100
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Pods
        value: 2
        periodSeconds: 60

Stabilization window — The controller looks back over this window and picks the highest (scale-up) or lowest (scale-down) recommendation. A longer window prevents flapping.
Policies — You can cap scaling by absolute pod count or by percentage, and combine multiple policies. The most restrictive policy wins by default (selectPolicy: Min).

Vertical Pod Autoscaler (VPA)#

Where the HPA changes replica count, the VPA changes the resource requests and limits of individual pods.

How VPA Works#

Recommender — Observes historical CPU and memory usage and computes target requests.
Updater — Evicts pods whose requests deviate significantly from the recommendation.
Admission Controller — Mutates the pod spec at creation time to apply the recommended values.

Update Modes#

Mode	Behavior
`Off`	Only produces recommendations; no mutations
`Initial`	Sets requests at pod creation; never evicts
`Auto`	Sets requests at creation and evicts running pods to apply updates

VPA and HPA Together#

Use VPA in Off or Initial mode and feed its recommendations into your deployment manifests during CI.
Use HPA on a custom metric (requests-per-second) so it does not conflict with VPA adjusting CPU requests.

KEDA — Event-Driven Autoscaling#

KEDA (Kubernetes Event-Driven Autoscaling) extends the HPA with 60+ scalers for external event sources.

Why KEDA?#

Scale to zero — The HPA cannot reduce replicas below 1. KEDA can deactivate a deployment entirely when there is no work, then spin pods back up when events arrive.
Rich scaler ecosystem — Kafka consumer lag, RabbitMQ queue length, Prometheus queries, cron schedules, AWS SQS, Azure Service Bus, and more.
ScaledObject abstraction — A single CRD that wires a trigger to a target deployment.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0
  maxReplicaCount: 50
  cooldownPeriod: 120
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: orders
        topic: order-events
        lagThreshold: "10"

KEDA creates an HPA under the hood. The cooldownPeriod controls how long the deployment stays at minimum replicas after the last trigger activation — analogous to the HPA stabilization window.

Cluster Autoscaler#

The cluster autoscaler adjusts the number of nodes in a node group (or managed node pool) based on pod scheduling pressure.

Scale-Up Trigger#

Scale-Down Trigger#

Are there pods that cannot be moved (DaemonSets are excluded, but PDBs, local storage, and system-critical pods block removal)?
Can the remaining nodes absorb the evicted pods?

Tuning Parameters#

Parameter	Default	Recommendation
`scan-interval`	10 s	Keep default
`scale-down-unneeded-time`	10 min	Increase to 15–20 min in bursty workloads
`scale-down-utilization-threshold`	0.5	Lower to 0.35 for cost optimization
`max-node-provision-time`	15 min	Reduce to 5 min with warm pools

Right-Sizing Pods#

Autoscalers can only work well when the resource requests on your pods reflect reality. Over-requested pods waste capacity; under-requested pods get throttled or OOM-killed.

A Practical Right-Sizing Workflow#

Deploy with generous requests and VPA in Off mode.
Let VPA collect at least 24 hours (ideally 7 days) of usage data.
Review VPA recommendations: kubectl get vpa -o jsonpath='{.status.recommendation}'.
Set requests to the VPA target and limits to 2–3x the target (or remove CPU limits entirely — CPU is compressible).
Re-enable HPA and monitor p99 latency for regressions.

Memory vs. CPU#

CPU — Compressible. Pods are throttled, not killed, when they exceed their request. Many teams remove CPU limits altogether and rely on requests for scheduling.
Memory — Incompressible. Exceeding the limit triggers an OOM kill. Set memory limits close to (but above) the VPA upper-bound recommendation.

Putting It All Together#

A production-grade autoscaling stack typically combines:

Traffic ──▶ HPA (requests/sec) ──▶ scales Deployment replicas
                                       │
VPA (Off mode) ──▶ CI pipeline ──▶ updates resource requests
                                       │
Cluster Autoscaler ──▶ adds nodes when pods are pending
                                       │
KEDA ──▶ scales async workers from 0 based on queue depth

Checklist#

Set meaningful resource requests on every pod.
Use HPA with a custom metric that reflects user-facing load.
Use VPA in Off mode to inform request tuning in CI.
Configure HPA behavior for asymmetric scaling (fast up, slow down).
Enable cluster autoscaler with appropriate node group sizing.
Use KEDA for event-driven and scale-to-zero workloads.
Set Pod Disruption Budgets to protect availability during scale-down.
Monitor autoscaler decisions with dedicated dashboards and alerts.

Conclusion#

Article #316 of the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Build this architecture

Generate an interactive architecture for Kubernetes Autoscaling in seconds.

Try it in Codelit →

Kubernetes Autoscaling: HPA, VPA, KEDA & Cluster Autoscaler

The Four Layers of Kubernetes Autoscaling#

Horizontal Pod Autoscaler (HPA)#

Basic CPU-Based HPA#

Custom and External Metrics#

Scaling Policies and Cooldown#

Vertical Pod Autoscaler (VPA)#

How VPA Works#

Update Modes#

VPA and HPA Together#

KEDA — Event-Driven Autoscaling#

Why KEDA?#

Cluster Autoscaler#

Scale-Up Trigger#

Scale-Down Trigger#

Tuning Parameters#

Right-Sizing Pods#

A Practical Right-Sizing Workflow#

Memory vs. CPU#

Putting It All Together#

Checklist#

Conclusion#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Kubernetes Container Orchestration

Elasticsearch Search Cluster

CI/CD Pipeline Architecture

Build this architecture

Kubernetes Autoscaling: HPA, VPA, KEDA & Cluster Autoscaler

The Four Layers of Kubernetes Autoscaling#

Horizontal Pod Autoscaler (HPA)#

Basic CPU-Based HPA#

Custom and External Metrics#

Scaling Policies and Cooldown#

Vertical Pod Autoscaler (VPA)#

How VPA Works#

Update Modes#

VPA and HPA Together#

KEDA — Event-Driven Autoscaling#

Why KEDA?#

Cluster Autoscaler#

Scale-Up Trigger#

Scale-Down Trigger#

Tuning Parameters#

Right-Sizing Pods#

A Practical Right-Sizing Workflow#

Memory vs. CPU#

Putting It All Together#

Checklist#

Conclusion#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Kubernetes Container Orchestration

Elasticsearch Search Cluster

CI/CD Pipeline Architecture

Build this architecture