Kubernetes Resource Management: Requests, Limits, QoS & Beyond
Kubernetes schedules containers onto nodes with finite CPU and memory. If you do not tell the scheduler how much each container needs, you get noisy neighbors, OOM kills, and unpredictable performance. Resource management is how you communicate your workload's needs and how Kubernetes enforces them.
Requests vs Limits#
Every container can declare two resource values for CPU and memory:
- Requests — The guaranteed minimum. The scheduler uses this to place pods on nodes with enough available capacity. The container is guaranteed at least this much.
- Limits — The hard ceiling. The container cannot use more than this. If it tries, CPU is throttled and memory triggers an OOM kill.
resources:
requests:
cpu: "250m" # 0.25 CPU cores guaranteed
memory: "256Mi" # 256 MiB guaranteed
limits:
cpu: "1000m" # 1 CPU core maximum
memory: "512Mi" # 512 MiB maximum — OOM kill if exceeded
CPU vs Memory Behavior#
CPU and memory are enforced differently:
| Resource | Compressible | Over-limit behavior |
|---|---|---|
| CPU | Yes | Throttled — the container gets fewer cycles but keeps running |
| Memory | No | OOM killed — the kernel terminates the process |
This distinction is critical. You can be generous with CPU limits (throttling is survivable) but must be precise with memory limits (OOM kills restart your container).
QoS Classes#
Kubernetes assigns each pod a Quality of Service class based on its resource configuration. QoS determines eviction priority when a node runs out of resources.
Guaranteed#
All containers in the pod have requests equal to limits for both CPU and memory.
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "500m" # same as request
memory: "256Mi" # same as request
Guaranteed pods are the last to be evicted. Use this for critical workloads like databases and stateful services.
Burstable#
At least one container has requests set, but requests and limits differ (or limits are not set).
resources:
requests:
cpu: "250m"
memory: "128Mi"
limits:
cpu: "1000m" # different from request
memory: "512Mi" # different from request
Burstable pods are evicted after BestEffort pods. Most application workloads fall here.
BestEffort#
No requests or limits are set on any container.
resources: {}
BestEffort pods are evicted first. Never run production workloads without resource requests.
Eviction order (node under pressure):
1. BestEffort ← evicted first
2. Burstable ← evicted based on usage vs request ratio
3. Guaranteed ← evicted last (only if node is critically low)
LimitRange#
A LimitRange sets default and maximum resource values for containers in a namespace. It prevents developers from deploying pods without resource declarations.
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
When a pod is created without resource specs, the LimitRange injects the defaults. When a pod exceeds the max, the API server rejects it.
ResourceQuota#
While LimitRange controls individual containers, ResourceQuota controls aggregate resource consumption across all pods in a namespace.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "50"
persistentvolumeclaims: "10"
This prevents a single team from consuming all cluster resources. When the quota is reached, new pod creation is rejected until existing pods are removed or the quota is increased.
OOM Kills: Causes and Prevention#
An OOM (Out of Memory) kill happens when a container's memory usage exceeds its limit. The Linux kernel's OOM killer terminates the process, and Kubernetes restarts the container.
Common Causes#
- Memory limit set too low — The application legitimately needs more memory.
- Memory leaks — Slow growth over time until the limit is hit.
- JVM heap misconfiguration — The JVM allocates heap larger than the container limit.
- Large request payloads — Unbounded request body parsing.
- Cache without eviction — In-memory caches grow without bound.
Diagnosis#
kubectl describe pod my-app-xyz
# Look for: Last State: Terminated, Reason: OOMKilled, Exit Code: 137
kubectl top pod my-app-xyz
# Check current memory usage relative to limits
Prevention#
- Set memory requests based on steady-state usage (p50).
- Set memory limits based on peak usage (p99) plus a 20-30% buffer.
- For JVMs, set
-Xmxto 75% of the container memory limit to leave room for non-heap memory. - Use memory-bounded caches with eviction policies.
- Load-test to find actual memory profiles before setting limits.
CPU Throttling#
When a container exceeds its CPU limit, the kernel's CFS (Completely Fair Scheduler) throttles it. The container does not get killed — it just gets fewer CPU cycles.
How CFS Throttling Works#
Kubernetes translates CPU limits into CFS quota and period:
CPU limit: 500m (0.5 cores)
CFS period: 100ms
CFS quota: 50ms
The container gets 50ms of CPU time per 100ms period.
If it uses all 50ms in the first 20ms, it is throttled for 80ms.
Throttling Symptoms#
- High request latency that does not correlate with load.
container_cpu_cfs_throttled_seconds_totalmetric increasing.- Application appears slow despite low CPU utilization in monitoring.
Mitigation#
Some teams remove CPU limits entirely and rely only on requests. This prevents throttling while still guaranteeing scheduling. The trade-off is that a runaway process can starve neighbors on the same node.
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
# cpu: omitted — no throttling
memory: "512Mi"
This is a valid strategy when nodes run a small number of well-understood workloads.
Vertical Pod Autoscaler (VPA)#
Setting the right requests and limits is hard. The Vertical Pod Autoscaler monitors actual resource usage and recommends (or automatically applies) better values.
VPA Modes#
| Mode | Behavior |
|---|---|
| Off | VPA computes recommendations but does not apply them |
| Initial | VPA sets resources only when pods are created |
| Auto | VPA evicts and recreates pods with updated resources |
VPA Recommendation Example#
kubectl describe vpa my-app-vpa
Recommendation:
Container: my-app
Lower Bound: cpu: 100m, memory: 128Mi
Target: cpu: 250m, memory: 320Mi
Upper Bound: cpu: 800m, memory: 640Mi
Uncapped: cpu: 250m, memory: 320Mi
- Target is the recommended request value.
- Lower Bound and Upper Bound define the confidence interval.
- Start with VPA in Off mode to review recommendations before enabling auto-scaling.
VPA Caveats#
- VPA recreates pods to change resources, causing brief downtime. Run multiple replicas.
- VPA and HPA (Horizontal Pod Autoscaler) should not both target CPU. Use HPA for scaling replicas and VPA for right-sizing individual pods on memory.
- VPA needs several days of usage data to produce stable recommendations.
Resource Management Strategy#
A practical approach for production clusters:
- Always set requests — Never deploy BestEffort pods in production.
- Set memory limits — OOM kills are preferable to nodes running out of memory.
- Consider omitting CPU limits — Throttling causes hidden latency; rely on requests for scheduling.
- Use LimitRange — Enforce defaults so no pod slips through without resource declarations.
- Use ResourceQuota — Prevent any single namespace from monopolizing the cluster.
- Deploy VPA in Off mode — Review recommendations quarterly and adjust requests.
- Monitor throttling and OOM kills — Alert on
container_cpu_cfs_throttled_seconds_totaland OOMKilled restart reasons.
Key Takeaways#
- Requests guarantee scheduling capacity; limits enforce ceilings. Set both for memory, and at minimum requests for CPU.
- QoS classes (Guaranteed, Burstable, BestEffort) determine eviction priority — never run production without requests.
- LimitRange and ResourceQuota provide namespace-level guardrails against resource sprawl.
- OOM kills are caused by exceeding memory limits; diagnose with
kubectl describe podand prevent with proper profiling. - CPU throttling is silent and insidious — monitor CFS throttling metrics and consider omitting CPU limits.
- VPA automates right-sizing but requires multiple replicas to handle pod recreation.
Build and explore system design concepts hands-on at codelit.io.
394 articles on system design at codelit.io/blog.
Try it on Codelit
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Kubernetes Resource Management in seconds.
Try it in Codelit →
Comments