Kubernetes Scaling - How it's Done
So you’ve finally moved to Kubernetes, great! How do you plan on making use of the handy technology it brings you though? Hopefully in this article I can give a quick breakdown of what the methods of scaling are, their use case, and the pitfalls of each.
The Elephant in the Room: Proper Resource Requests #
While I was at a previous job, I encountered a customer that for the life of them, couldn’t figure out why their AKS cluster wasn’t scaling nodes properly. The reason: Resource requests. In this case, it was a lack of them rather than misconfigured requests.
Kubernetes assigns workloads out to nodes based on, among other things, the resources requested by that workload. If these are misconfigured, that could lead to pods suddenly getting terminated since the node doesn’t have enough free resources itself, or for more nodes to get spun up that needed. That’s why it’s so important to get this right. What I usually do when running a deployment in a test environment is to look at the usage there and run a load test tool such as K6. (Article on it Here )
Then, I set the resource usage there + a few % as the request. The limit I usually put as twice that, and then tweak it when it’s in production.
Node Autoscaling #
This, in most managed kubernetes scenarios, is what can be the real kicker for cost, but can also be great for saving money depending on how you use it.
Simply put, if you’ve got workloads requesting that many resources that you’re nearly out of CPU or RAM, a new node should get provisioned. This is great for workloads that you expect to ramp up in demand in different periods. Since you can use multiple node pools as well, you could have a standard node pool with the minimum amount of nodes you know you’ll always need, and then another node pool featuring only spot instances. While that’s not great for workloads that need to be running 24/7 with no interruptions, it can save you up to 90%!
If you’re running managed kubernetes in a lab environment, this is an absolute life saver.
DaemonSets #
These go somewhat hand-in-hand with node autoscaling, although it’s not required or preferred. A DaemonSet will ensure that all (or some) nodes will run a copy of a pod. This is great for if you have something like node-exporter installed to expose node metrics to Prometheus. Lots of storage tech also needs DaemonSets, so you will often see them used in Kubernetes clusters.
While this isn’t strictly for pod autoscaling, it’s definitely something to keep in mind for if you have boilerplate pods for your workload that should be on every node.
Pod Autoscaling #
In a lot of scenarios, you may need your pods to scale as well as your nodes. Infact, it’s something I’d highly recommend you do no matter the type of application you have in your cluster. However, there’s a few different ways to Approach it.
HPA #
Horizontal Pod Autoscaling Looks a little bit like this diagram above. Any pod with a HPA enabled will report its resource usage to a Metric Server API, and if it breaches a threshold you set (e.g. 80% CPU or RAM) then another pod will get spun up. This is great for if you have an application that can run in a distributed manner. Something like a nginx or apache server would be perfect for this assuming you have a layer 7 load balancer infront of it such as Nginx Ingress .
There is a bit of a gotcha here though for certain workloads. If your workload stores any information using an EmptyDir, you really should move it to a Persistent Volume instead. Additionally, if you currently use any Persistent Volumes, make sure the access mode is set to ReadWriteMany . If you don’t, any copies of your existing pod will fail to spin up since they can’t bind the volume to themselves.
VPA #
So what if you don’t have a workload that you can have multiple pods serving? If that’s the case, Vertical Pod Autoscaling is your friend! Remember how I mentioned at the start about how important it is to set good resource requests? Well, this does it for you. You can either have a HPA make recommendations but not modify any configuration, or you can have it automatically apply the configuration as well. I’d be cautious of the latter option though in some cases, since if you experience very sudden upticks in demand, you may have a small period where your app will slow to a crawl. Here’s how you set it up in EKS just for reference.
What would I recommend? #
So now I’ve explained what each method of scaling is, you might ask which one I’d recommend. The answer really is all of them! In most scenarios, especially in a multi-tenant environment, you’d likely benefit from every single method of scaling. For example, in my GKE cluster, I have a node autoscaler set just in case I get a big spike in load, with a VPA set to make recommendations (but not change my configuration) for my monitoring stack since that’s currently the biggest resource hog, and HPA’s for my web deployments so I never have any requests rejected due to high load. I probably have a DaemonSet somewhere too.