Kubernetesresource managementQoS classesLimitRangeResourceQuotaVPAOOM killsCPU throttlingsystem design

Kubernetes Resource Management: Requests, Limits, QoS & Beyond

March 29, 2026 7 min readBy Codelit Team Discussion

Kubernetes schedules containers onto nodes with finite CPU and memory. If you do not tell the scheduler how much each container needs, you get noisy neighbors, OOM kills, and unpredictable performance. Resource management is how you communicate your workload's needs and how Kubernetes enforces them.

Requests vs Limits#

Every container can declare two resource values for CPU and memory:

Requests — The guaranteed minimum. The scheduler uses this to place pods on nodes with enough available capacity. The container is guaranteed at least this much.
Limits — The hard ceiling. The container cannot use more than this. If it tries, CPU is throttled and memory triggers an OOM kill.

resources:
  requests:
    cpu: "250m"      # 0.25 CPU cores guaranteed
    memory: "256Mi"  # 256 MiB guaranteed
  limits:
    cpu: "1000m"     # 1 CPU core maximum
    memory: "512Mi"  # 512 MiB maximum — OOM kill if exceeded

CPU vs Memory Behavior#

CPU and memory are enforced differently:

Resource	Compressible	Over-limit behavior
CPU	Yes	Throttled — the container gets fewer cycles but keeps running
Memory	No	OOM killed — the kernel terminates the process

This distinction is critical. You can be generous with CPU limits (throttling is survivable) but must be precise with memory limits (OOM kills restart your container).

QoS Classes#

Kubernetes assigns each pod a Quality of Service class based on its resource configuration. QoS determines eviction priority when a node runs out of resources.

Guaranteed#

All containers in the pod have requests equal to limits for both CPU and memory.

resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"       # same as request
    memory: "256Mi"   # same as request

Guaranteed pods are the last to be evicted. Use this for critical workloads like databases and stateful services.

Burstable#

At least one container has requests set, but requests and limits differ (or limits are not set).

resources:
  requests:
    cpu: "250m"
    memory: "128Mi"
  limits:
    cpu: "1000m"      # different from request
    memory: "512Mi"   # different from request

Burstable pods are evicted after BestEffort pods. Most application workloads fall here.

BestEffort#

No requests or limits are set on any container.

resources: {}

BestEffort pods are evicted first. Never run production workloads without resource requests.

Eviction order (node under pressure):
  1. BestEffort   ← evicted first
  2. Burstable    ← evicted based on usage vs request ratio
  3. Guaranteed   ← evicted last (only if node is critically low)

LimitRange#

A LimitRange sets default and maximum resource values for containers in a namespace. It prevents developers from deploying pods without resource declarations.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4"
        memory: "8Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

When a pod is created without resource specs, the LimitRange injects the defaults. When a pod exceeds the max, the API server rejects it.

ResourceQuota#

While LimitRange controls individual containers, ResourceQuota controls aggregate resource consumption across all pods in a namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "50"
    persistentvolumeclaims: "10"

This prevents a single team from consuming all cluster resources. When the quota is reached, new pod creation is rejected until existing pods are removed or the quota is increased.

OOM Kills: Causes and Prevention#

An OOM (Out of Memory) kill happens when a container's memory usage exceeds its limit. The Linux kernel's OOM killer terminates the process, and Kubernetes restarts the container.

Common Causes#

Memory limit set too low — The application legitimately needs more memory.
Memory leaks — Slow growth over time until the limit is hit.
JVM heap misconfiguration — The JVM allocates heap larger than the container limit.
Large request payloads — Unbounded request body parsing.
Cache without eviction — In-memory caches grow without bound.

Diagnosis#

kubectl describe pod my-app-xyz
# Look for: Last State: Terminated, Reason: OOMKilled, Exit Code: 137

kubectl top pod my-app-xyz
# Check current memory usage relative to limits

Prevention#

Set memory requests based on steady-state usage (p50).
Set memory limits based on peak usage (p99) plus a 20-30% buffer.
For JVMs, set -Xmx to 75% of the container memory limit to leave room for non-heap memory.
Use memory-bounded caches with eviction policies.
Load-test to find actual memory profiles before setting limits.

CPU Throttling#

When a container exceeds its CPU limit, the kernel's CFS (Completely Fair Scheduler) throttles it. The container does not get killed — it just gets fewer CPU cycles.

How CFS Throttling Works#

Kubernetes translates CPU limits into CFS quota and period:

CPU limit: 500m (0.5 cores)
CFS period: 100ms
CFS quota: 50ms

The container gets 50ms of CPU time per 100ms period.
If it uses all 50ms in the first 20ms, it is throttled for 80ms.

Throttling Symptoms#

High request latency that does not correlate with load.
container_cpu_cfs_throttled_seconds_total metric increasing.
Application appears slow despite low CPU utilization in monitoring.

Mitigation#

Some teams remove CPU limits entirely and rely only on requests. This prevents throttling while still guaranteeing scheduling. The trade-off is that a runaway process can starve neighbors on the same node.

resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    # cpu: omitted — no throttling
    memory: "512Mi"

This is a valid strategy when nodes run a small number of well-understood workloads.

Vertical Pod Autoscaler (VPA)#

Setting the right requests and limits is hard. The Vertical Pod Autoscaler monitors actual resource usage and recommends (or automatically applies) better values.

VPA Modes#

Mode	Behavior
Off	VPA computes recommendations but does not apply them
Initial	VPA sets resources only when pods are created
Auto	VPA evicts and recreates pods with updated resources

VPA Recommendation Example#

kubectl describe vpa my-app-vpa

Recommendation:
  Container: my-app
    Lower Bound:   cpu: 100m,  memory: 128Mi
    Target:        cpu: 250m,  memory: 320Mi
    Upper Bound:   cpu: 800m,  memory: 640Mi
    Uncapped:      cpu: 250m,  memory: 320Mi

Target is the recommended request value.
Lower Bound and Upper Bound define the confidence interval.
Start with VPA in Off mode to review recommendations before enabling auto-scaling.

VPA Caveats#

VPA recreates pods to change resources, causing brief downtime. Run multiple replicas.
VPA and HPA (Horizontal Pod Autoscaler) should not both target CPU. Use HPA for scaling replicas and VPA for right-sizing individual pods on memory.
VPA needs several days of usage data to produce stable recommendations.

Resource Management Strategy#

A practical approach for production clusters:

Always set requests — Never deploy BestEffort pods in production.
Set memory limits — OOM kills are preferable to nodes running out of memory.
Consider omitting CPU limits — Throttling causes hidden latency; rely on requests for scheduling.
Use LimitRange — Enforce defaults so no pod slips through without resource declarations.
Use ResourceQuota — Prevent any single namespace from monopolizing the cluster.
Deploy VPA in Off mode — Review recommendations quarterly and adjust requests.
Monitor throttling and OOM kills — Alert on container_cpu_cfs_throttled_seconds_total and OOMKilled restart reasons.

Key Takeaways#

Requests guarantee scheduling capacity; limits enforce ceilings. Set both for memory, and at minimum requests for CPU.
QoS classes (Guaranteed, Burstable, BestEffort) determine eviction priority — never run production without requests.
LimitRange and ResourceQuota provide namespace-level guardrails against resource sprawl.
OOM kills are caused by exceeding memory limits; diagnose with kubectl describe pod and prevent with proper profiling.
CPU throttling is silent and insidious — monitor CFS throttling metrics and consider omitting CPU limits.
VPA automates right-sizing but requires multiple replicas to handle pod recreation.

Build and explore system design concepts hands-on at codelit.io.

394 articles on system design at codelit.io/blog.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Kubernetes Container Orchestration

K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.

9 components

Build this architecture

Generate an interactive architecture for Kubernetes Resource Management in seconds.

Try it in Codelit →

Kubernetesresource managementQoS classesLimitRangeResourceQuotaVPAOOM killsCPU throttlingsystem design

Kubernetes Resource Management: Requests, Limits, QoS & Beyond

March 29, 2026 7 min readBy Codelit Team Discussion

Requests vs Limits#

Every container can declare two resource values for CPU and memory:

Requests — The guaranteed minimum. The scheduler uses this to place pods on nodes with enough available capacity. The container is guaranteed at least this much.
Limits — The hard ceiling. The container cannot use more than this. If it tries, CPU is throttled and memory triggers an OOM kill.

resources:
  requests:
    cpu: "250m"      # 0.25 CPU cores guaranteed
    memory: "256Mi"  # 256 MiB guaranteed
  limits:
    cpu: "1000m"     # 1 CPU core maximum
    memory: "512Mi"  # 512 MiB maximum — OOM kill if exceeded

CPU vs Memory Behavior#

CPU and memory are enforced differently:

Resource	Compressible	Over-limit behavior
CPU	Yes	Throttled — the container gets fewer cycles but keeps running
Memory	No	OOM killed — the kernel terminates the process

This distinction is critical. You can be generous with CPU limits (throttling is survivable) but must be precise with memory limits (OOM kills restart your container).

QoS Classes#

Kubernetes assigns each pod a Quality of Service class based on its resource configuration. QoS determines eviction priority when a node runs out of resources.

Guaranteed#

All containers in the pod have requests equal to limits for both CPU and memory.

resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"       # same as request
    memory: "256Mi"   # same as request

Guaranteed pods are the last to be evicted. Use this for critical workloads like databases and stateful services.

Burstable#

At least one container has requests set, but requests and limits differ (or limits are not set).

resources:
  requests:
    cpu: "250m"
    memory: "128Mi"
  limits:
    cpu: "1000m"      # different from request
    memory: "512Mi"   # different from request

Burstable pods are evicted after BestEffort pods. Most application workloads fall here.

BestEffort#

No requests or limits are set on any container.

resources: {}

BestEffort pods are evicted first. Never run production workloads without resource requests.

Eviction order (node under pressure):
  1. BestEffort   ← evicted first
  2. Burstable    ← evicted based on usage vs request ratio
  3. Guaranteed   ← evicted last (only if node is critically low)

LimitRange#

A LimitRange sets default and maximum resource values for containers in a namespace. It prevents developers from deploying pods without resource declarations.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4"
        memory: "8Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

When a pod is created without resource specs, the LimitRange injects the defaults. When a pod exceeds the max, the API server rejects it.

ResourceQuota#

While LimitRange controls individual containers, ResourceQuota controls aggregate resource consumption across all pods in a namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "50"
    persistentvolumeclaims: "10"

This prevents a single team from consuming all cluster resources. When the quota is reached, new pod creation is rejected until existing pods are removed or the quota is increased.

OOM Kills: Causes and Prevention#

An OOM (Out of Memory) kill happens when a container's memory usage exceeds its limit. The Linux kernel's OOM killer terminates the process, and Kubernetes restarts the container.

Common Causes#

Memory limit set too low — The application legitimately needs more memory.
Memory leaks — Slow growth over time until the limit is hit.
JVM heap misconfiguration — The JVM allocates heap larger than the container limit.
Large request payloads — Unbounded request body parsing.
Cache without eviction — In-memory caches grow without bound.

Diagnosis#

kubectl describe pod my-app-xyz
# Look for: Last State: Terminated, Reason: OOMKilled, Exit Code: 137

kubectl top pod my-app-xyz
# Check current memory usage relative to limits

Prevention#

Set memory requests based on steady-state usage (p50).
Set memory limits based on peak usage (p99) plus a 20-30% buffer.
For JVMs, set -Xmx to 75% of the container memory limit to leave room for non-heap memory.
Use memory-bounded caches with eviction policies.
Load-test to find actual memory profiles before setting limits.

CPU Throttling#

When a container exceeds its CPU limit, the kernel's CFS (Completely Fair Scheduler) throttles it. The container does not get killed — it just gets fewer CPU cycles.

How CFS Throttling Works#

Kubernetes translates CPU limits into CFS quota and period:

CPU limit: 500m (0.5 cores)
CFS period: 100ms
CFS quota: 50ms

The container gets 50ms of CPU time per 100ms period.
If it uses all 50ms in the first 20ms, it is throttled for 80ms.

Throttling Symptoms#

High request latency that does not correlate with load.
container_cpu_cfs_throttled_seconds_total metric increasing.
Application appears slow despite low CPU utilization in monitoring.

Mitigation#

resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    # cpu: omitted — no throttling
    memory: "512Mi"

This is a valid strategy when nodes run a small number of well-understood workloads.

Vertical Pod Autoscaler (VPA)#

Setting the right requests and limits is hard. The Vertical Pod Autoscaler monitors actual resource usage and recommends (or automatically applies) better values.

VPA Modes#

Mode	Behavior
Off	VPA computes recommendations but does not apply them
Initial	VPA sets resources only when pods are created
Auto	VPA evicts and recreates pods with updated resources

VPA Recommendation Example#

kubectl describe vpa my-app-vpa

Recommendation:
  Container: my-app
    Lower Bound:   cpu: 100m,  memory: 128Mi
    Target:        cpu: 250m,  memory: 320Mi
    Upper Bound:   cpu: 800m,  memory: 640Mi
    Uncapped:      cpu: 250m,  memory: 320Mi

Target is the recommended request value.
Lower Bound and Upper Bound define the confidence interval.
Start with VPA in Off mode to review recommendations before enabling auto-scaling.

VPA Caveats#

VPA recreates pods to change resources, causing brief downtime. Run multiple replicas.
VPA and HPA (Horizontal Pod Autoscaler) should not both target CPU. Use HPA for scaling replicas and VPA for right-sizing individual pods on memory.
VPA needs several days of usage data to produce stable recommendations.

Resource Management Strategy#

A practical approach for production clusters:

Always set requests — Never deploy BestEffort pods in production.
Set memory limits — OOM kills are preferable to nodes running out of memory.
Consider omitting CPU limits — Throttling causes hidden latency; rely on requests for scheduling.
Use LimitRange — Enforce defaults so no pod slips through without resource declarations.
Use ResourceQuota — Prevent any single namespace from monopolizing the cluster.
Deploy VPA in Off mode — Review recommendations quarterly and adjust requests.
Monitor throttling and OOM kills — Alert on container_cpu_cfs_throttled_seconds_total and OOMKilled restart reasons.

Key Takeaways#

Requests guarantee scheduling capacity; limits enforce ceilings. Set both for memory, and at minimum requests for CPU.
QoS classes (Guaranteed, Burstable, BestEffort) determine eviction priority — never run production without requests.
LimitRange and ResourceQuota provide namespace-level guardrails against resource sprawl.
OOM kills are caused by exceeding memory limits; diagnose with kubectl describe pod and prevent with proper profiling.
CPU throttling is silent and insidious — monitor CFS throttling metrics and consider omitting CPU limits.
VPA automates right-sizing but requires multiple replicas to handle pod recreation.

Build and explore system design concepts hands-on at codelit.io.

394 articles on system design at codelit.io/blog.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI search

Try these templates

Kubernetes Container Orchestration

K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.

9 components

Build this architecture

Generate an interactive architecture for Kubernetes Resource Management in seconds.

Try it in Codelit →

Kubernetes Resource Management: Requests, Limits, QoS & Beyond

Requests vs Limits#

CPU vs Memory Behavior#

QoS Classes#

Guaranteed#

Burstable#

BestEffort#

LimitRange#

ResourceQuota#

OOM Kills: Causes and Prevention#

Common Causes#

Diagnosis#

Prevention#

CPU Throttling#

How CFS Throttling Works#

Throttling Symptoms#

Mitigation#

Vertical Pod Autoscaler (VPA)#

VPA Modes#

VPA Recommendation Example#

VPA Caveats#

Resource Management Strategy#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Kubernetes Container Orchestration

Build this architecture

Kubernetes Resource Management: Requests, Limits, QoS & Beyond

Requests vs Limits#

CPU vs Memory Behavior#

QoS Classes#

Guaranteed#

Burstable#

BestEffort#

LimitRange#

ResourceQuota#

OOM Kills: Causes and Prevention#

Common Causes#

Diagnosis#

Prevention#

CPU Throttling#

How CFS Throttling Works#

Throttling Symptoms#

Mitigation#

Vertical Pod Autoscaler (VPA)#

VPA Modes#

VPA Recommendation Example#

VPA Caveats#

Resource Management Strategy#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Kubernetes Container Orchestration

Build this architecture