Kubernetes Internals: Resource Management & Kernel Interaction

Ever wondered what happens behind the scenes when you set resource requests and limits in your Kubernetes YAML? This deep dive explores how Kubernetes translates abstract definitions into concrete Linux kernel instructions, covering scheduling algorithms, CPU throttling logic, and memory termination signals.

1. The Linux OOM Killer & QoS Classes

When a node runs out of memory, the Linux Kernel triggers the Out of Memory (OOM) Killer. This is not a random process; it is a deterministic algorithm based on a scoring system. Kubernetes manipulates this system using Quality of Service (QoS) classes to protect critical workloads.

The Mechanism: oom_score and oom_score_adj

Every process in Linux has an oom_score (seen at /proc/<pid>/oom_score), ranging from 0 (never kill) to 1000 (kill first). Kubernetes adjusts this score via the oom_score_adj value based on the Pod's QoS class.

ğŸ’¡ Key Insight: The QoS class is automatically determined by Kubernetes based on your resource configurationâ€”you don't set it explicitly.

QoS Classes Explained

Guaranteed

Condition: requests.memory == limits.memory (and CPU)

oom_score_adj: -998

Behavior: "The VIP". The kernel will practically never kill this pod unless the system is in a catastrophic state (OOM on system daemons).

Burstable

Condition: requests.memory < limits.memory

oom_score_adj: 2 to 999

Behavior: "The Middle Class". The score is calculated dynamically based on how much memory the pod requested vs. how much it is currently using above that request.

BestEffort

Condition: No requests or limits defined

oom_score_adj: 1000

Behavior: "The Sacrificial Lamb". These pods have the highest possible kill score. They are the first to be terminated when the node experiences memory pressure.

The OOM Algorithm

When Available Memory < Threshold:

The Kernel invokes select_bad_process()
It iterates through all processes
It calculates badness = oom_score + oom_score_adj
The process with the highest badness receives SIGKILL (Signal 9). It cannot be caught or ignored; the process is terminated immediately.

âš ï¸ Deep Dive Note: If a container exceeds its own Limit (not the Node's limit), the cgroup memory controller kills it immediately, regardless of the node's overall status. This is a Cgroup OOM, not a Node OOM.

2. CPU Architecture: CFS, Quotas, and Throttling

Unlike memory, CPU is a compressible resource. When a process demands more CPU than allowed, it is not killed; it is throttled. Kubernetes relies on the Completely Fair Scheduler (CFS) within the Linux Kernel to handle this.

The CFS Mechanism

The CFS manages CPU allocation using two key settings in the cgroup:

cpu.cfs_period_us: The accounting period (usually 100ms or 100,000Âµs)
cpu.cfs_quota_us: The amount of time a process is allowed to run within that period

The Throttling Algorithm

If you set limits: cpu: 250m (0.25 cores), Kubernetes translates this to:

Period: 100ms
Quota: 25ms

The Scenario:

Your application starts a heavy computation at T=0ms
It burns through its 25ms quota by T=25ms
The Kernel suspends the process threads. They are removed from the CPU run queue
The application sleeps for the remaining 75ms
At T=100ms (next period), the quota resets, and the application resumes

âš ï¸ The Latency Trap: This "stop-start" behavior causes tail latency. Your application isn't slow because of code; it's slow because the Kernel is actively preventing it from running 75% of the time. This is monitored via the container_cpu_cfs_throttled_periods_total metric.

3. The Kube-Scheduler: Filtering & Scoring Logic

The kube-scheduler does not randomly pick nodes. It follows a strict Scheduling Framework pipeline for every unscheduled Pod.

Phase 1: Filtering (Predicates)

This is a boolean (Yes/No) pass. If a node fails any check, it is discarded.

PodFitsResources

Does the node have enough allocatable CPU/Memory for the Pod's requests?

MatchNodeSelector / NodeAffinity

Does the node match the required labels?

TaintToleration

Does the Pod tolerate the Node's taints (e.g., NoSchedule)?

VolumeZone

If the pod needs an EBS volume in us-east-1a, filter out nodes in us-east-1b.

Phase 2: Scoring (Priorities)

The remaining nodes are ranked 0-100. The highest score wins.

ImageLocality

Nodes that already have the Docker image cached get a boost (saves bandwidth/time)

LeastRequested

Favors nodes with the most free resources (spreads load)

MostRequested

Favors nodes that are already mostly full (bin-packing, saves cost)

InterPodAffinity

Favors nodes running other specific pods (for lower network latency)

Phase 3: Preemption (The Bully Logic)

If no nodes survive Phase 1, the Scheduler checks PriorityClass.

If the pending Pod has a higher priorityClassName than existing pods on a node, the Scheduler initiates Preemption:

It evicts the lower-priority pods
Wait for them to terminate
Schedule the high-priority pod

4. Kubelet Eviction Manager (Node Pressure)

While OOMKill is the Kernel's last resort (panic button), Eviction is Kubelet's proactive attempt to save the node. The Kubelet monitors node stability signals.

Eviction Signals

memory.available

Low RAM

nodefs.available

Low disk space (for logs/containers)

nodefs.inodesFree

Critical. Running out of file descriptors (inodes). Even if you have 100GB of disk space, if you have millions of tiny files, the filesystem locks up.

Hard vs. Soft Eviction

Soft Eviction

Config: eviction-soft: memory.available<1.5GB
eviction-soft-grace-period: 1m30s

Behavior: If the threshold is breached, Kubelet waits for the grace period. If it persists, it terminates pods gracefully (SIGTERM â†’ wait â†’ SIGKILL).

Hard Eviction

Config: eviction-hard: memory.available<100Mi

Behavior: Immediate action. Kubelet sends SIGKILL instantly. No grace period.

Eviction Ranking (Who dies first?)

When Kubelet decides to evict, it ranks candidates differently than the Kernel OOM Killer:

Pods that consume more than their requests
PriorityClass (Low priority first)
BestEffort pods

âœ… Key Takeaway: This logic ensures that pods sticking to their Service Level Agreements (SLAs)â€”i.e., staying within requestsâ€”are the last to be touched.

Practical Implications

Best Practices for Resource Configuration

Set Requests Accurately

Requests determine scheduling decisions and eviction priority. Set them based on actual baseline usage, not peak usage.

Monitor Throttling

Watch container_cpu_cfs_throttled_periods_total metrics. High throttling indicates your CPU limits are too restrictive.

Use Guaranteed QoS for Critical Workloads

For mission-critical services, set requests == limits to achieve Guaranteed QoS and maximum protection from OOM killer.

Leverage PriorityClasses

Define PriorityClasses for different workload tiers to ensure critical services can preempt less important ones during resource contention.

Conclusion

Understanding how Kubernetes interacts with the Linux kernel is crucial for building reliable, performant containerized applications. The abstractions provided by Kubernetes YAML are powerful, but knowing what happens under the hoodâ€”from OOM scores to CFS throttling to scheduler predicatesâ€”enables you to make informed decisions about resource allocation and troubleshoot production issues effectively.

The next time you see a pod getting OOMKilled or experiencing high latency, you'll know exactly where to look: check the QoS class, examine throttling metrics, review eviction events, and understand the kernel-level mechanisms at play.