kubernetesinfrastructuredevopssystem-design

Kubernetes HPA Deep Dive — CPU, Custom Metrics, Scaling Policies, and KEDA

March 29, 2026 7 min readBy Codelit Team Discussion

What the HPA does#

The Horizontal Pod Autoscaler watches metrics and adjusts the number of pod replicas. More traffic means more pods. Less traffic means fewer pods. You define the rules — Kubernetes executes them.

Metrics Server → HPA controller → Deployment replica count → Scheduler → Pods

The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Each iteration: fetch metrics, compute desired replicas, update the deployment.

CPU and memory metrics#

The simplest HPA configuration targets CPU utilization.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

How the formula works#

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

If you have 4 replicas at 90% CPU and your target is 70%:

desiredReplicas = ceil(4 * (90 / 70)) = ceil(5.14) = 6

The HPA scales to 6 replicas.

Resource requests matter#

CPU utilization is calculated relative to the pod's resource request, not the node capacity. If your pod requests 500m CPU and uses 350m, utilization is 70%.

resources:
  requests:
    cpu: 500m
    memory: 256Mi

No resource request means the HPA cannot calculate utilization. Always set requests.

Custom metrics (Prometheus adapter)#

CPU and memory are not enough for most workloads. A web server might have low CPU but a queue of 10,000 pending requests. You need custom metrics.

Architecture#

Your App → Prometheus (scrape) → Prometheus Adapter → Kubernetes Custom Metrics API → HPA

Step 1: Expose metrics from your app#

# Your app exposes a /metrics endpoint
http_requests_in_flight 42
http_request_queue_length 1500

Step 2: Install Prometheus adapter#

The adapter translates Prometheus queries into the Kubernetes custom metrics API.

# prometheus-adapter config
rules:
  - seriesQuery: 'http_requests_in_flight{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)$"
      as: "${1}"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

Step 3: Reference custom metrics in HPA#

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_in_flight
      target:
        type: AverageValue
        averageValue: 100

Now the HPA scales based on in-flight requests — each pod should handle no more than 100 concurrent requests.

Common custom metrics#

Request queue length — scale when work is piling up
Consumer lag (Kafka) — scale consumers when messages back up
Active connections — scale based on connection count
Custom business metrics — orders per second, jobs in queue

Scaling behavior policies#

Kubernetes v2 HPA lets you control how fast scaling happens, not just when.

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    selectPolicy: Min

What this means#

Scale up: Every 60 seconds, add up to 100% more pods OR 4 pods (whichever is higher). Aggressive — respond fast to load spikes.

Scale down: Every 60 seconds, remove at most 10% of pods. Conservative — avoid thrashing on temporary dips.

selectPolicy#

Max — use the policy that allows the most change (aggressive)
Min — use the policy that allows the least change (conservative)
Disabled — prevent scaling in that direction entirely

Cooldown and stabilization windows#

The problem: thrashing#

Without stabilization, the HPA oscillates:

10:00 — CPU spikes to 90% → scale up to 8 pods
10:01 — 8 pods reduce CPU to 40% → scale down to 4 pods
10:02 — 4 pods, CPU spikes again → scale up to 8 pods

This is thrashing. Pods are constantly created and destroyed.

Stabilization window#

The stabilization window looks at all desired replica recommendations over the window period and picks the most conservative one.

scaleDown:
  stabilizationWindowSeconds: 300  # 5 minutes

For scale-down: the HPA looks at all computed desired replica counts in the last 5 minutes and picks the highest one. It only scales down if all recommendations in the window agree.

For scale-up: same logic but picks the lowest recommendation. Usually set shorter (0-60 seconds) because you want to respond quickly to load.

Default values#

Parameter	Default
Scale-up stabilization	0 seconds (immediate)
Scale-down stabilization	300 seconds (5 minutes)

Practical tuning#

Bursty traffic (API servers): short scale-up window (0-30s), long scale-down window (5-10 min)
Steady traffic (batch workers): moderate both directions (60s up, 300s down)
Cost-sensitive workloads: aggressive scale-down (60-120s) with conservative policies

Scale-to-zero with KEDA#

The built-in HPA has a hard limit: minReplicas must be at least 1. You always pay for at least one pod, even with zero traffic.

KEDA (Kubernetes Event-Driven Autoscaling) solves this.

How KEDA works#

Event Source → KEDA Scaler → KEDA Operator → HPA (1→N) or Deployment (0→1)

KEDA handles the 0-to-1 and 1-to-0 transitions. Once at least 1 pod exists, it creates a standard HPA to handle 1-to-N scaling.

Example: scale on Kafka consumer lag#

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0    # KEDA allows zero
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: order-group
        topic: orders
        lagThreshold: "100"

When there are zero messages in the topic, KEDA scales to zero pods. When messages arrive, KEDA spins up a pod, then the HPA takes over for further scaling.

KEDA scalers#

KEDA supports 60+ event sources out of the box:

Message queues — Kafka, RabbitMQ, SQS, Azure Service Bus
Databases — PostgreSQL, MySQL, MongoDB (query-based)
HTTP — scale on request rate (via KEDA HTTP add-on)
Cron — schedule-based scaling (scale up during business hours)
Prometheus — any Prometheus metric
Cloud services — AWS CloudWatch, GCP Pub/Sub, Azure Event Hubs

Scale-to-zero tradeoffs#

Cold start latency: The first request after scale-to-zero waits for a pod to start. Container pull, init containers, application startup — this can be 5-30 seconds.

Mitigation strategies:

Pre-pull images on nodes (DaemonSet that pulls images)
Use lightweight base images (distroless, Alpine)
Optimize application startup (lazy initialization, connection pooling on first request)
Set minReplicaCount: 1 during business hours, 0 at night

Monitoring your HPA#

Always monitor the autoscaler itself, not just the workload.

# Check HPA status
kubectl get hpa web-app-hpa

# Detailed conditions
kubectl describe hpa web-app-hpa

Key things to watch:

AbleToScale — can the HPA actually make changes?
ScalingActive — is the HPA actively scaling?
ScalingLimited — has the HPA hit min/max bounds?
Current vs desired replicas — is there a gap?

Export these as Prometheus metrics and alert when the HPA is stuck at maxReplicas (you may need to increase the limit) or when ScalingActive is false (metrics pipeline might be broken).

Visualize your Kubernetes architecture#

Map out your HPAs, deployments, and scaling triggers — try Codelit to generate an interactive architecture diagram.

Key takeaways#

Always set resource requests — the HPA cannot compute utilization without them
CPU alone is rarely enough — use custom metrics (queue depth, request count) for accurate scaling
Tune stabilization windows — fast scale-up (0-60s), slow scale-down (300s+) prevents thrashing
Use behavior policies to control the rate of scaling, not just the trigger
KEDA enables scale-to-zero — essential for event-driven and cost-sensitive workloads
Monitor the HPA itself — a broken metrics pipeline means no autoscaling

Article #435 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

testing

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

8 min read

Try these templates

Kubernetes Container Orchestration

K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.

9 components

Customer Support Platform

Zendesk-like helpdesk with tickets, live chat, knowledge base, and AI-powered auto-responses.

8 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Kubernetes HPA Deep Dive in seconds.

Try it in Codelit →

kubernetesinfrastructuredevopssystem-design

Kubernetes HPA Deep Dive — CPU, Custom Metrics, Scaling Policies, and KEDA

March 29, 2026 7 min readBy Codelit Team Discussion

What the HPA does#

The Horizontal Pod Autoscaler watches metrics and adjusts the number of pod replicas. More traffic means more pods. Less traffic means fewer pods. You define the rules — Kubernetes executes them.

Metrics Server → HPA controller → Deployment replica count → Scheduler → Pods

The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Each iteration: fetch metrics, compute desired replicas, update the deployment.

CPU and memory metrics#

The simplest HPA configuration targets CPU utilization.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

How the formula works#

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

If you have 4 replicas at 90% CPU and your target is 70%:

desiredReplicas = ceil(4 * (90 / 70)) = ceil(5.14) = 6

The HPA scales to 6 replicas.

Resource requests matter#

CPU utilization is calculated relative to the pod's resource request, not the node capacity. If your pod requests 500m CPU and uses 350m, utilization is 70%.

resources:
  requests:
    cpu: 500m
    memory: 256Mi

No resource request means the HPA cannot calculate utilization. Always set requests.

Custom metrics (Prometheus adapter)#

CPU and memory are not enough for most workloads. A web server might have low CPU but a queue of 10,000 pending requests. You need custom metrics.

Architecture#

Your App → Prometheus (scrape) → Prometheus Adapter → Kubernetes Custom Metrics API → HPA

Step 1: Expose metrics from your app#

# Your app exposes a /metrics endpoint
http_requests_in_flight 42
http_request_queue_length 1500

Step 2: Install Prometheus adapter#

The adapter translates Prometheus queries into the Kubernetes custom metrics API.

# prometheus-adapter config
rules:
  - seriesQuery: 'http_requests_in_flight{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)$"
      as: "${1}"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

Step 3: Reference custom metrics in HPA#

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_in_flight
      target:
        type: AverageValue
        averageValue: 100

Now the HPA scales based on in-flight requests — each pod should handle no more than 100 concurrent requests.

Common custom metrics#

Request queue length — scale when work is piling up
Consumer lag (Kafka) — scale consumers when messages back up
Active connections — scale based on connection count
Custom business metrics — orders per second, jobs in queue

Scaling behavior policies#

Kubernetes v2 HPA lets you control how fast scaling happens, not just when.

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    selectPolicy: Min

What this means#

Scale up: Every 60 seconds, add up to 100% more pods OR 4 pods (whichever is higher). Aggressive — respond fast to load spikes.

Scale down: Every 60 seconds, remove at most 10% of pods. Conservative — avoid thrashing on temporary dips.

selectPolicy#

Max — use the policy that allows the most change (aggressive)
Min — use the policy that allows the least change (conservative)
Disabled — prevent scaling in that direction entirely

Cooldown and stabilization windows#

The problem: thrashing#

Without stabilization, the HPA oscillates:

10:00 — CPU spikes to 90% → scale up to 8 pods
10:01 — 8 pods reduce CPU to 40% → scale down to 4 pods
10:02 — 4 pods, CPU spikes again → scale up to 8 pods

This is thrashing. Pods are constantly created and destroyed.

Stabilization window#

The stabilization window looks at all desired replica recommendations over the window period and picks the most conservative one.

scaleDown:
  stabilizationWindowSeconds: 300  # 5 minutes

For scale-down: the HPA looks at all computed desired replica counts in the last 5 minutes and picks the highest one. It only scales down if all recommendations in the window agree.

For scale-up: same logic but picks the lowest recommendation. Usually set shorter (0-60 seconds) because you want to respond quickly to load.

Default values#

Parameter	Default
Scale-up stabilization	0 seconds (immediate)
Scale-down stabilization	300 seconds (5 minutes)

Practical tuning#

Bursty traffic (API servers): short scale-up window (0-30s), long scale-down window (5-10 min)
Steady traffic (batch workers): moderate both directions (60s up, 300s down)
Cost-sensitive workloads: aggressive scale-down (60-120s) with conservative policies

Scale-to-zero with KEDA#

The built-in HPA has a hard limit: minReplicas must be at least 1. You always pay for at least one pod, even with zero traffic.

KEDA (Kubernetes Event-Driven Autoscaling) solves this.

How KEDA works#

Event Source → KEDA Scaler → KEDA Operator → HPA (1→N) or Deployment (0→1)

KEDA handles the 0-to-1 and 1-to-0 transitions. Once at least 1 pod exists, it creates a standard HPA to handle 1-to-N scaling.

Example: scale on Kafka consumer lag#

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0    # KEDA allows zero
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: order-group
        topic: orders
        lagThreshold: "100"

When there are zero messages in the topic, KEDA scales to zero pods. When messages arrive, KEDA spins up a pod, then the HPA takes over for further scaling.

KEDA scalers#

KEDA supports 60+ event sources out of the box:

Message queues — Kafka, RabbitMQ, SQS, Azure Service Bus
Databases — PostgreSQL, MySQL, MongoDB (query-based)
HTTP — scale on request rate (via KEDA HTTP add-on)
Cron — schedule-based scaling (scale up during business hours)
Prometheus — any Prometheus metric
Cloud services — AWS CloudWatch, GCP Pub/Sub, Azure Event Hubs

Scale-to-zero tradeoffs#

Cold start latency: The first request after scale-to-zero waits for a pod to start. Container pull, init containers, application startup — this can be 5-30 seconds.

Mitigation strategies:

Pre-pull images on nodes (DaemonSet that pulls images)
Use lightweight base images (distroless, Alpine)
Optimize application startup (lazy initialization, connection pooling on first request)
Set minReplicaCount: 1 during business hours, 0 at night

Monitoring your HPA#

Always monitor the autoscaler itself, not just the workload.

# Check HPA status
kubectl get hpa web-app-hpa

# Detailed conditions
kubectl describe hpa web-app-hpa

Key things to watch:

AbleToScale — can the HPA actually make changes?
ScalingActive — is the HPA actively scaling?
ScalingLimited — has the HPA hit min/max bounds?
Current vs desired replicas — is there a gap?

Export these as Prometheus metrics and alert when the HPA is stuck at maxReplicas (you may need to increase the limit) or when ScalingActive is false (metrics pipeline might be broken).

Visualize your Kubernetes architecture#

Map out your HPAs, deployments, and scaling triggers — try Codelit to generate an interactive architecture diagram.

Key takeaways#

Always set resource requests — the HPA cannot compute utilization without them
CPU alone is rarely enough — use custom metrics (queue depth, request count) for accurate scaling
Tune stabilization windows — fast scale-up (0-60s), slow scale-down (300s+) prevents thrashing
Use behavior policies to control the rate of scaling, not just the trigger
KEDA enables scale-to-zero — essential for event-driven and cost-sensitive workloads
Monitor the HPA itself — a broken metrics pipeline means no autoscaling

Article #435 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

api design

Build this architecture

Generate an interactive architecture for Kubernetes HPA Deep Dive in seconds.

Try it in Codelit →

Kubernetes HPA Deep Dive — CPU, Custom Metrics, Scaling Policies, and KEDA

What the HPA does#

CPU and memory metrics#

How the formula works#

Resource requests matter#

Custom metrics (Prometheus adapter)#

Architecture#

Step 1: Expose metrics from your app#

Step 2: Install Prometheus adapter#

Step 3: Reference custom metrics in HPA#

Common custom metrics#

Scaling behavior policies#

What this means#

selectPolicy#

Cooldown and stabilization windows#

The problem: thrashing#

Stabilization window#

Default values#

Practical tuning#

Scale-to-zero with KEDA#

How KEDA works#

Example: scale on Kafka consumer lag#

KEDA scalers#

Scale-to-zero tradeoffs#

Monitoring your HPA#

Visualize your Kubernetes architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

Try these templates

Kubernetes Container Orchestration

Customer Support Platform

CI/CD Pipeline Architecture

Build this architecture

Kubernetes HPA Deep Dive — CPU, Custom Metrics, Scaling Policies, and KEDA

What the HPA does#

CPU and memory metrics#

How the formula works#

Resource requests matter#

Custom metrics (Prometheus adapter)#

Architecture#

Step 1: Expose metrics from your app#

Step 2: Install Prometheus adapter#

Step 3: Reference custom metrics in HPA#

Common custom metrics#

Scaling behavior policies#

What this means#

selectPolicy#

Cooldown and stabilization windows#

The problem: thrashing#

Stabilization window#

Default values#

Practical tuning#

Scale-to-zero with KEDA#

How KEDA works#

Example: scale on Kafka consumer lag#

KEDA scalers#

Scale-to-zero tradeoffs#

Monitoring your HPA#

Visualize your Kubernetes architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

Try these templates

Kubernetes Container Orchestration

Customer Support Platform

CI/CD Pipeline Architecture

Build this architecture