Ever wondered what happens behind the scenes when you set resource requests and limits in your Kubernetes YAML? This deep dive explores how Kubernetes translates abstract definitions into concrete Linux kernel instructions, covering scheduling algorithms, CPU throttling logic, and memory termination signals.
1. The Linux OOM Killer & QoS Classes
When a node runs out of memory, the Linux Kernel triggers the Out of Memory (OOM) Killer. This is not a random process; it is a deterministic algorithm based on a scoring system. Kubernetes manipulates this system using Quality of Service (QoS) classes to protect critical workloads.
The Mechanism: oom_score and oom_score_adj
Every process in Linux has an oom_score (seen at
/proc/<pid>/oom_score), ranging from 0 (never kill) to 1000 (kill first).
Kubernetes adjusts this score via the oom_score_adj value based on the Pod's QoS
class.
QoS Classes Explained
Guaranteed
Condition: requests.memory == limits.memory (and CPU)
oom_score_adj: -998
Behavior: "The VIP". The kernel will practically never kill this pod unless the system is in a catastrophic state (OOM on system daemons).
Burstable
Condition: requests.memory < limits.memory
oom_score_adj: 2 to 999
Behavior: "The Middle Class". The score is calculated dynamically based on how much memory the pod requested vs. how much it is currently using above that request.
BestEffort
Condition: No requests or limits defined
oom_score_adj: 1000
Behavior: "The Sacrificial Lamb". These pods have the highest possible kill score. They are the first to be terminated when the node experiences memory pressure.
The OOM Algorithm
When Available Memory < Threshold:
- The Kernel invokes
select_bad_process() - It iterates through all processes
- It calculates
badness = oom_score + oom_score_adj - The process with the highest badness receives SIGKILL (Signal 9). It cannot be caught or ignored; the process is terminated immediately.
2. CPU Architecture: CFS, Quotas, and Throttling
Unlike memory, CPU is a compressible resource. When a process demands more CPU than allowed, it is not killed; it is throttled. Kubernetes relies on the Completely Fair Scheduler (CFS) within the Linux Kernel to handle this.
The CFS Mechanism
The CFS manages CPU allocation using two key settings in the cgroup:
cpu.cfs_period_us: The accounting period (usually 100ms or 100,000µs)cpu.cfs_quota_us: The amount of time a process is allowed to run within that period
The Throttling Algorithm
If you set limits: cpu: 250m (0.25 cores), Kubernetes translates this to:
- Period: 100ms
- Quota: 25ms
- Your application starts a heavy computation at T=0ms
- It burns through its 25ms quota by T=25ms
- The Kernel suspends the process threads. They are removed from the CPU run queue
- The application sleeps for the remaining 75ms
- At T=100ms (next period), the quota resets, and the application resumes
container_cpu_cfs_throttled_periods_total metric.
3. The Kube-Scheduler: Filtering & Scoring Logic
The kube-scheduler does not randomly pick nodes. It follows a strict
Scheduling Framework pipeline for every unscheduled Pod.
Phase 1: Filtering (Predicates)
This is a boolean (Yes/No) pass. If a node fails any check, it is discarded.
PodFitsResources
Does the node have enough allocatable CPU/Memory for the Pod's requests?
MatchNodeSelector / NodeAffinity
Does the node match the required labels?
TaintToleration
Does the Pod tolerate the Node's taints (e.g., NoSchedule)?
VolumeZone
If the pod needs an EBS volume in us-east-1a, filter out nodes in us-east-1b.
Phase 2: Scoring (Priorities)
The remaining nodes are ranked 0-100. The highest score wins.
ImageLocality
Nodes that already have the Docker image cached get a boost (saves bandwidth/time)
LeastRequested
Favors nodes with the most free resources (spreads load)
MostRequested
Favors nodes that are already mostly full (bin-packing, saves cost)
InterPodAffinity
Favors nodes running other specific pods (for lower network latency)
Phase 3: Preemption (The Bully Logic)
If no nodes survive Phase 1, the Scheduler checks PriorityClass.
If the pending Pod has a higher priorityClassName than existing pods on a node, the
Scheduler initiates Preemption:
- It evicts the lower-priority pods
- Wait for them to terminate
- Schedule the high-priority pod
4. Kubelet Eviction Manager (Node Pressure)
While OOMKill is the Kernel's last resort (panic button), Eviction is Kubelet's proactive attempt to save the node. The Kubelet monitors node stability signals.
Eviction Signals
memory.available
Low RAM
nodefs.available
Low disk space (for logs/containers)
nodefs.inodesFree
Critical. Running out of file descriptors (inodes). Even if you have 100GB of disk space, if you have millions of tiny files, the filesystem locks up.
Hard vs. Soft Eviction
Soft Eviction
Config: eviction-soft: memory.available<1.5GB
eviction-soft-grace-period: 1m30s
Behavior: If the threshold is breached, Kubelet waits for the grace period. If it persists, it terminates pods gracefully (SIGTERM → wait → SIGKILL).
Hard Eviction
Config: eviction-hard: memory.available<100Mi
Behavior: Immediate action. Kubelet sends SIGKILL instantly. No grace period.
Eviction Ranking (Who dies first?)
When Kubelet decides to evict, it ranks candidates differently than the Kernel OOM Killer:
- Pods that consume more than their requests
- PriorityClass (Low priority first)
- BestEffort pods
Practical Implications
Best Practices for Resource Configuration
Set Requests Accurately
Requests determine scheduling decisions and eviction priority. Set them based on actual baseline usage, not peak usage.
Monitor Throttling
Watch container_cpu_cfs_throttled_periods_total metrics. High throttling
indicates your CPU limits are too restrictive.
Use Guaranteed QoS for Critical Workloads
For mission-critical services, set requests == limits to achieve Guaranteed QoS
and maximum protection from OOM killer.
Leverage PriorityClasses
Define PriorityClasses for different workload tiers to ensure critical services can preempt less important ones during resource contention.
Conclusion
Understanding how Kubernetes interacts with the Linux kernel is crucial for building reliable, performant containerized applications. The abstractions provided by Kubernetes YAML are powerful, but knowing what happens under the hood—from OOM scores to CFS throttling to scheduler predicates—enables you to make informed decisions about resource allocation and troubleshoot production issues effectively.
The next time you see a pod getting OOMKilled or experiencing high latency, you'll know exactly where to look: check the QoS class, examine throttling metrics, review eviction events, and understand the kernel-level mechanisms at play.