Kubernetes Overhead
You’re probably aware that Kubernetes can help with management and scalability if you manage a lot of machines. These benefits, however, come at a price: Kubernetes and container runtime overhead can be significant. A poorly designed or naively implemented Kubernetes could worsen the situation by forcing all of your devices to be underutilized. They migrated our current job system to Kubernetes. It took much more CPU time than before but completed jobs at a 40% to 50% slower rate.
This post outlines how we were able to overcome the performance drop. The method includes some performance experiment design, light performance adjustment, and timing analysis to go back to parity. A significant deployment topic was also addressed: what is the per-pod overhead?
Experiment setup
We mainly focused our first Kubernetes effort on deployment rather than performance. Our job system featured a worker-coordinator design, with “parents” serving as coordinators who distributed work to each node and tracked their progress.
VM
|
Kubernetes
|
|
---|---|---|
Node Count
|
100
|
45
|
Instance Types
|
c5.4xlarge |
c5.4xlarge |
Kernel Version
|
3.13.0-141 |
5.0.0-1023 |
Parent Processes / Node
|
16
|
8
|
Children / Parent
|
5
|
5
|
Percent Jobs Enqueued
|
75%
|
25%
|
Avg CPU Idle
|
52%
|
22%
|
Jobs Completed
|
12,830
|
3,360
|
In Kubernetes, we have half as many nodes and half as many worker-parent clusters in this deployed configuration. As a result, throughput of around a fourth makes reasonable.
We did observe something odd: Kubernetes had a decreased CPU idle time. However, this correlated with the reduced work may enqueue rate. Because the deployment architecture and workloads were so disparate, we couldn’t do anything at the time. We couldn’t discern whether changes in performance were due to nodes performing distinct tasks. It made performance analysis extremely difficult.
The optimum experiment arrangement is to have no known differences so that each difference can be treated as a part of the problem. The job system team returned with a fantastic setup:
VM
|
Kubernetes
|
|
---|---|---|
Node Count
|
3
|
3
|
Instance Types
|
c5.2xlarge |
c5.2xlarge |
Kernel Version
|
3.13.0-141 |
5.0.0-1023 |
Parent Processes / Node
|
12
|
4
|
Children / Parent
|
10
|
10
|
Percent Jobs Enqueued
|
100%
|
100%
|
Avg CPU Idle
|
76%
|
79%
|
Jobs Completed
|
998
|
459
|
It was a toss-up between the two. We could observe what was going on because both clusters were executing a simple Python script as their jobs.
Each pod in the Kubernetes configuration had one parent process and its offspring. As a result, the number of parent processes on a node was the same as the number of job system pods. It’s worth noting that the previous kernel version used in the VM configuration has no CPU mitigations.
Work metrics
It was time to decide on metrics now that we have equivalent work for VMs and Kubernetes. Performance indicators are nuanced and can be deceiving. As a result, the metric selection is crucial.
Two essential measures were examined: effort and performance.
The metric of effort
What is the amount of effort that each node is putting in? It tells us how much more the node can take compared to its total capacity.
The first effort metric we looked at was the average CPU load. Each CPU core has a list of tasks that it can execute right now in its run queue. If the CPU is genuinely busy, the load average is an exponentially weighted average of the length of that run queue plus one. That’s a reasonable approach to telling how active a machine is without knowing anything else about what it’s doing.
This measure, however, proved misleading in this Kubernetes scenario. On Kubernetes, processes that do background work, such as polling the cluster, checking the condition of pods, and so on, will notice an increase in average load. These processes wake up now and then, do some work, and then go back to sleep. They don’t make much use of the CPU.
When you have a small number of processes doing the actual work, however, Kubernetes’ CPU blips can significantly increase the load average. It is an issue of counting the methods vs. how much CPU they require—a process that takes a microsecond to complete adds 1.0 to the load, just as a process that takes hours to complete adds 1.0 to the bag.
So, instead of focusing on what the CPU intended to perform (its run queue), we focused on how hard it worked. Or, too put it differently, how ineffective it was. Idle CPU is a remarkable statistic for determining how much work the CPU is doing when not in use. For all process CPU consumption $p i$, it’s $1 – I p i$, which tells us how much CPU we have left on the machine. The number of processes is not tracked, but the amount of time spent on the CPU is.
The metric for measuring performance
How well is the system performing? It is what we aim for most of the time.
Eventually, any system must choose between latency and a throughput bias. How much of one can you give up for the other?
You usually watch both, although their roles are different. You can optimize throughput while keeping an eye on latency for potential issues. That makes sense for batch processing, where latency is a proxy for the length of your task queue. Alternatively, you can optimize for latency and utilize throughput as a proxy for capacity utilization.
The job system team’s main priority is throughput. The team’s primary metric is the number of jobs it can accomplish in 30 seconds.
Kubernetes configuration
The main distinction between raw VMs and Kubernetes was correctly scheduling the latter. A parent coordinates workers in the job system. The set of clusters is scaled to match the task, and the parent cluster and its associated workers are deployed concurrently. Kubernetes pods make up each cluster.
Most of our performance gains came from fine-tuning resource requests to optimize pod scheduling. The idea was for each node to have six pods.
Initially, each pod was programmed to request a whole CPU core. c5.2xlarge has eight cores and 16 gigabytes of RAM, Kubernetes, system services, and the kernel taking up 1.5 gigabytes. Only four pods were scheduled on a node due to the CPU and memory requests for the pod. As a result, we made our initial changes: lowering the RAM request to 500MB and lowering the CPU to 100m—that is, 100 millicuries.
Adjusting the CPU requests brought us to six pods per node for the most part, although some nodes only scheduled five pods.
The RAM had been set similarly, but it was still set too high. In addition to Kubernetes, we have other daemons running on all pods. They took up so much memory that some hosts could only support five of the six desired pods due to the memory request value.
We should only utilize one resource metric when trying to size request values to get a specific number of pods on a node. The others should represent how much of each resource the pod requires appropriately.
There was no negative impact on runtime as a result of these changes. The request amounts are solely used to schedule pods to nodes as a minimum. They’ll prepare alright if the pod asks for enough resources to function. Once the pods are up and going, the limit amounts confine them. Limit values have no impact on pod scheduling.
Why is it that each pod only has one parent?
Should each pod have different parent processes? A single parent and its employees from a central unit for the application, but is it too small?
There was an overhead question lurking behind that one. How much does a pod cost in terms of overhead? If the overhead is considerable, we should try to reduce the number of pods and stuff more into each one. However, it’s far easier to manage one parent per pod than several if it’s low.
Recognizing the pods
We looked at all the processes on a node using pstree and discovered six job system instances (together with all their child processes and threads) with lines like: containerd-shimtini/opt/app/bin3*[jobsystem – dumm]. After some finagling with the results, we figured out which processes belonged to each job system pod. We then began counting from above.
It’s three containers’ overhead per-pod, which appears to be contained-overhead.shim’s After that. We looked at the CPU and RAM.
Per pod CPU overhead
We looked at perf sched for the CPU:
$ sudo perf sched record -- sleep 10 $ sudo perf sched latency > k8s-jobsystem-lat-x2.txt
For 10 seconds of time on eight CPUs, we have 80 possible CPU seconds usable. After doing some tabulations on the file k8s-jobsystem-lat-x2.txt
we got:
Process
|
Total CPU Time
|
---|---|
TOTAL:
|
27,914.6580
|
jobsystem – dumm:(779)
|
11,823.1570
|
/opt/app/bin:(52)
|
5,214.8220
|
runc:(1939)
|
2,623.0330
|
containerd:(195)
|
1,684.0350
|
agent:(123)
|
1,608.1120
|
psql:(24)
|
819.7830
|
trace-agent:(19)
|
800.7580
|
containerd-shim:(299)
|
601.6430
|
systemd:(3)
|
516.8040
|
Out of the 80 seconds available, the CPU was only active for 28 seconds. The lone container-shim process that was running took 600 milliseconds. It’s 100 milliseconds for each pod per ten seconds, or ten milliseconds per second if you average it out.
Memory overhead per pod
The memory overhead of a process is complicated. Because operations share a memory, the kernel uses RAM that isn’t expressly assigned to a function (via the page cache). We don’t have to worry about the page cache issue with containerd-shim, but we do have a lot of instances that share a lot of memory.
So we queried the kernel about them, and it said they were all between 1 and 5 megabytes in size.
$ for i in $(pgrep containerd-shim) ; do echo -n $i; sudo cat /proc/$i/smaps | grep '^Private' | awk 'BEGIN{ f=0 } { f += $2; } END { print ": ", f, " kB" }'; done 436: 1404 kB .. snip .. 23796: 4688 kB 24722: 1380 kB 26633: 1332 kB 26734: 4536 kB .. snip .. 32355: 1080 kB
Things we didn’t account for
10ms/pod/second, with 1-5MB of memory per pod, is a short amount of time. However, it is only a part of the story. Duplicate pods will share the same image, necessitate a few more routing table rules, and present in various Kubernetes data structures.
We don’t believe those overheads would be any higher than those calculated for containerd-shim. They’re also more challenging to measure, resulting in less reliable results. One of our Kubernetes nodes, for example, can hold 110 pods. However, some of the overheads may be $log(N)$. With such low numbers of $N$, it can be challenging to disentangle the noise from the runtime and memory cost in that class.
Work execution
Second, we attempted to observe each machine’s operation. We ran most one on one of these production hosts to watch how busy the CPU was. In a regular cadence, the percent usr column (similar to “User CPU (percent)” in the host dashboard) oscillated high and low.
The CPUs were frequently idle. It was highly unexpected compared to job system baseline data four months ago in all-VMs. There was a block like this while digging through the source:
for w in workers: w.check() time.sleep(w.sleep_delay)
w.check() sleeps after checking if the worker is finished. Len(workers) * sleep delay is the total time interval between checks on the same worker. It's idle any time after the worker finishes and before it's rechecked. It is acceptable if the check interval to average work time ratio is low. However, this was not the case in this instance. Each worker checked every two seconds, and most jobs took three seconds to complete. This leaves $3 mod 2/2 = 1/2 = 0.5$ seconds of idle time per job, equating to an overhead of $0.5/3 = 1/6 = 16.67% $ of the worker's idle CPU time. That's more than one idle core per instance.
The node count was reduced by 15% when reduced the delay to checking a single worker once every second. There could yet be some fine-tuning to be done to increase performance.
Results
Compared to bare VMs, the Kubernetes switch had a 100 percent overhead. We reduced the overhead to 38%, with more room for improvement.
The Kubernetes cluster had 200 nodes at first, replacing a non-Kubernetes cluster on AWS with 100 nodes. The node count was reduced to 138 after the requests and delay adjustments. The AWS cluster has an older kernel (3.13) than the Kubernetes cluster (5.0), and no CPU mitigations were indicated. Because of these mitigations and Kubernetes overhead is around 15-20%, we believe that further tuning will reduce the cluster to about 120 nodes.
How We Reduced Kubernetes Overhead in Our Job System
After answering those problems, we were left with two more: why is Kubernetes’ node count so low, and how might our job system map its processes to pods after that?
Experiment design might be time-consuming. We’d be able to answer our queries once we’d sorted that out. The first question’s response comes down to the reality that deployment configuration is complex. To fit enough pods on each node, we chose a better set of request values in the pod specification.
The second question prompted us to investigate how Kubernetes pods are represented in the process tree. To get CPU overhead, we utilized perf sched and retrieved a memory. We went to /proc/$pid/smaps. “Very little” was the outcome. The memory overhead is in the tens of megabytes. The CPU overhead is a few different kubelet status requests every few seconds—roughly ten milliseconds of CPU time each second.
Finally, we noticed that each node’s CPU use followed a sawtooth pattern. The workers spend a lot of time doing nothing. The parent’s time between worker status checks was the source of the problem. The period between a worker doing their job and the parent giving it a new one was cut in half when lowered the interval.
The service may now be maintained and scaled the same way as any other Kubernetes service, with all of the advantages of a single containerized workflow that works across cloud providers.
A look to the future: bin-packing pods
We also recognized that by improving the pod specification and node size, we might be able to squeeze out even more efficiency. But that’s a story for another time.
Enteros
About Enteros
IT organizations routinely spend days and weeks troubleshooting production database performance issues across multitudes of critical business systems. Fast and reliable resolution of database performance problems by Enteros enables businesses to generate and save millions of direct revenue, minimize waste of employees’ productivity, reduce the number of licenses, servers, and cloud resources and maximize the productivity of the application, database, and IT operations teams.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Accountability and Cost Estimation in the Financial Sector with Enteros
- 27 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing E-commerce Operations with Enteros: Leveraging Enterprise Agreements and AWS Cloud Resources for Maximum Efficiency
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…