Kubernetes resource management is one of those areas where small misconfigurations cause large, hard-to-diagnose problems in production. Pods that run fine in a test cluster randomly get OOM-killed in production. The cluster appears healthy by node utilization but pods are throttled and latency spikes. Deployments get stuck in Pending with Insufficient cpu even though the cluster isn't obviously full. Most of these issues trace back to misunderstood resource requests and limits.
This guide covers how requests and limits actually work in Kubernetes — including the important differences between CPU and memory behavior — and explains QoS classes, namespace quotas, and autoscaling.
Requests and limits: what they actually mean
Every container in a pod can specify two types of resource constraints: requests and limits.
Requests are a guarantee. When Kubernetes schedules a pod, it finds a node with enough available capacity to satisfy the pod's total requested resources. The scheduler tracks allocated resources (sum of all pod requests), not actual utilization. This means a node can be at 20% actual CPU usage but fully scheduled if the sum of requests equals node capacity.
Limits are a ceiling. A container cannot exceed its resource limit. What happens when it tries depends on the resource type — and this is where most confusion comes from.
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "500m"
In this example: the scheduler finds a node with at least 250m CPU and 128Mi memory available. At runtime, the container can burst up to 500m CPU and 256Mi memory.
CPU: throttled, not killed
CPU is a compressible resource. When a container tries to use more CPU than its limit, the Linux kernel's CFS (Completely Fair Scheduler) throttles it — it gets fewer CPU time slices. The container keeps running, but slower. It does not die.
CPU is measured in millicores: 1000m = 1 vCPU core. 250m = 0.25 of one core. Fractional CPUs are valid: cpu: "0.5" equals cpu: "500m".
CPU throttling is a common source of latency spikes that are invisible in node-level metrics. A container hitting its CPU limit doesn't show up as high CPU utilization — it shows up as elevated p95/p99 latency. If your application has latency problems and the nodes look healthy, check whether pods are being throttled:
kubectl top pod --containers # shows actual CPU/memory usage # Compare against limits # More detailed view — throttling percentage kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu/cpu.stat # throttled_time shows cumulative nanoseconds throttled
Memory: OOM-killed
Memory is an incompressible resource. When a container exceeds its memory limit, the Linux kernel OOM (Out of Memory) killer terminates it. In Kubernetes, this shows up as the pod entering a CrashLoopBackOff state with an exit code of 137 (SIGKILL from OOM).
The important distinction: memory requests set a guarantee that the scheduler uses for placement, but memory limits are enforced by cgroup memory limits at runtime. A container that slowly leaks memory will eventually hit its limit and be killed — even if the node has plenty of free memory.
# Check if a pod was OOM-killed kubectl describe pod <pod-name> # Look for: OOMKilled in the Last State section # Exit Code: 137 = OOM kill # Exit Code: 1 = application crash # Check memory usage vs limits kubectl top pod --containers
Setting memory limits equal to requests (or not setting limits at all) changes scheduling and eviction behavior significantly — more on this in the QoS section below.
Units: CPU and memory notation
- CPU:
1,0.5,500m(millicores).1= 1 vCPU/core/hyperthread. - Memory:
128Mi(mebibytes, base-2),128M(megabytes, base-10). Always useMifor memory to avoid confusion —1Mi= 1,048,576 bytes,1M= 1,000,000 bytes. The difference compounds at large scales.
QoS classes
Kubernetes assigns every pod one of three Quality of Service classes based on its resource configuration. QoS class determines the order in which pods are evicted when a node runs low on resources:
Guaranteed
Every container in the pod has both requests and limits set, and requests equal limits for both CPU and memory.
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "256Mi" # same as request
cpu: "500m" # same as request
Guaranteed pods are the last to be evicted. The node will evict BestEffort and Burstable pods before touching Guaranteed pods. This is the right QoS class for critical production workloads that cannot be interrupted.
The tradeoff: the container cannot burst beyond its request. If it needs 600m CPU momentarily, it gets throttled at 500m.
Burstable
At least one container has a request set, but either the limit is higher than the request, or limits aren't set on all containers. Most production workloads fall here.
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "512Mi" # limit > request: Burstable
cpu: "1000m"
Burstable pods can use more resources than their request if the node has headroom. They're evicted after BestEffort pods but before Guaranteed pods. The eviction order within Burstable is based on how far each pod's usage exceeds its request — pods using the most above their request get evicted first.
BestEffort
No requests or limits are set on any container. These pods get whatever CPU/memory is left over after scheduled pods get their requests. They're the first to be evicted under pressure.
resources: {} # or omitting resources entirely
BestEffort is appropriate for batch jobs or non-critical background tasks where it's acceptable to be evicted. Never use BestEffort for production services — they can be evicted with no notice when any node comes under pressure.
LimitRange: default limits per namespace
If a container doesn't specify requests/limits, and there's no LimitRange, it runs without constraints. On a shared cluster, this can cause noisy-neighbor problems. LimitRange lets you set per-namespace defaults and constraints:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # applied if no limit is specified
memory: "256Mi"
cpu: "500m"
defaultRequest: # applied if no request is specified
memory: "128Mi"
cpu: "100m"
max: # hard ceiling — containers cannot exceed this
memory: "2Gi"
cpu: "2"
min: # must specify at least this much
memory: "64Mi"
cpu: "50m"
LimitRange also applies to PersistentVolumeClaims (max/min storage size) and Pods as a whole (not just individual containers).
ResourceQuota: namespace-level capacity caps
ResourceQuota sets total resource caps for an entire namespace. This prevents any single team or application from consuming disproportionate cluster resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "4"
requests.memory: "8Gi"
limits.cpu: "8"
limits.memory: "16Gi"
pods: "50"
services: "10"
persistentvolumeclaims: "20"
Once a namespace hits its quota, new pods that would exceed it are rejected. Operations fail with a clear error: exceeded quota: team-quota, requested: requests.cpu=500m, used: requests.cpu=4, limited: requests.cpu=4.
ResourceQuota combined with LimitRange enforces that every container has requests/limits set (because without them, the quota can't be tracked), and caps total namespace consumption.
Horizontal Pod Autoscaler (HPA)
HPA automatically scales the number of pod replicas based on observed metrics. The most common metric is CPU utilization relative to the request:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # target 60% of request across all pods
HPA requires metrics-server to be installed. It compares actual usage to the target utilization of the request. If your pods have no CPU request set, HPA can't calculate utilization and won't scale correctly.
Custom metrics (via the custom.metrics.k8s.io API) let HPA scale on application-level metrics: request rate, queue depth, active connections. KEDA (Kubernetes Event-Driven Autoscaling) extends this further to scale on events from external sources like SQS queue length, Kafka lag, or scheduled time windows.
Vertical Pod Autoscaler (VPA)
VPA automatically adjusts CPU and memory requests based on observed usage. It addresses the common problem of over-allocated requests that waste cluster capacity.
VPA has three modes:
- Off: VPA only recommends new request values, doesn't apply them. Use this to get sizing recommendations without automatic changes.
- Initial: VPA sets requests only at pod creation time. Running pods aren't modified.
- Auto: VPA updates requests on running pods by evicting and recreating them. This can disrupt availability if not carefully managed.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off" # recommendation-only mode
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
memory: "64Mi"
maxAllowed:
memory: "2Gi"
HPA and VPA can conflict: HPA scales replicas based on current utilization, VPA changes the request that utilization is measured against. Running both in Auto mode on the same deployment requires KEDA or careful coordination. A common pattern: use VPA in recommendation-only mode to right-size requests, then use HPA for replica scaling.
Common mistakes
No requests set. Without requests, the scheduler places pods anywhere, nodes become overcommitted, and under pressure pods with BestEffort QoS get evicted. Always set requests on production pods.
Requests equal node capacity. If a pod requests 8 CPU and your nodes have 8 vCPUs, the pod can only be scheduled on an otherwise-empty node. Reserve 10–15% of node capacity for system pods (kubelet, kube-proxy, CNI agent, monitoring agents).
Memory limit set much higher than request. A container that requests 128Mi but has no memory limit (or a very high limit) can consume all node memory and trigger eviction of other pods. Set memory limits at 2–4x the request, not uncapped.
Setting CPU limit equal to CPU request. This forces Guaranteed QoS but also prevents CPU bursting. For latency-sensitive workloads that occasionally need extra CPU, this causes throttling. Use Burstable QoS for services that can tolerate occasional throttling; use Guaranteed only for services where predictability matters more than burst performance.