Scott Ringwelski's Blog

Autoscaling Your Kubernetes Service

When I first started exploring Kubernetes, autoscaling appeared to be a flip-a-switch option that simply needed to be turned on. Indeed, it provides a lot for you out of the box such as Cluster Autoscaler and Horizontal Pod Autoscalers, and the abstractions such as Deployments provide solid foundations necessary to make autoscaling properly easy.

But autoscaling involves more than creating and deleting Nodes and Pods. Assuming you are running a highly available system where errors, timeouts, or downtime are unacceptable then there is a bit more digging required. In this post I’ll outline what is required to get autoscaling a highly available system on Kubernetes working.

The Two Autoscalers

In Kubernetes, there are two autoscalers required to have a truly autoscaling system.


The core to how your Kubernetes Deployments will autoscale will be through the HorizontalPodAutoscaler (HPA). The HPA is a built-in Kubernetes Resource that can be declared just like a Deployment or Daemonset. An HPA spec has four key components:

  1. The target to autoscale, such as a Deployment
  2. Minimum number of Replicas
  3. Maximum number of Replicas
  4. Metric criteria to autoscale based off of

Let’s take a look at an example for our “blog” app:

This HPA is scaling the blog-web Deployment from 5 to 25 replicas. It is scaling based on CPU, which is percentage of the resource requests given to it.

The HPA can also autoscale based off of your own custom metrics and support has been greatly improved as of version 1.10. In our case, CPU has worked well for not only web servers but also for background workers .

Cluster Autoscaler

Horizontal Pod Autoscalers won’t get you anywhere on their own. Kubernetes runs workloads on its Node compute pool, and if you run out of compute your new replicas will be waiting in an Unschedulable state.

This is where the Cluster Autoscaler comes in. Cluster Autoscaler looks at your total resource requests across all of your workloads, and adds and removes nodes based on this information. Roughly speaking, it targets 75% of Node capacity at all times.

As your HPA scales up blog-web Pods, the total resource requests in your cluster increases. If the total requests go above 75% (in either CPU or Memory), the Cluster Autoscaler will add node(s) to ensure that the new Pods can run. And, as the HPA scales down, the Cluster Autoscaler will remove a Node that is no longer needed (all workloads can be moved to a different Node).

To use the Cluster Autoscaler effectively, it’s important that all your Pods have a resource request and limit specified. The autoscaler uses the requested CPU and Memory to make its decisions, not the actual CPU usage or Memory usage.

Keeping Services Highly Available

Autoscaling has an important side effect: Pods can be constantly disrupted, moving to different nodes, and spinning up. In order to ensure our users aren’t hitting errors or slow page loads during these autoscaling events, we need to implement numerous safeguards and policies in our services. These policies tell Kubernetes how to gracefully move Pods around without impacting our Services availability or performance.

Liveness and Readiness Probes

Core to running an HA service on Kubernetes is Liveness and Readiness checks. Even without autoscaling, these are important to have set and working. These health checks are what determine if a Pod is available and serving traffic or not.

When it comes to cluster autoscaling, these checks are critical to moving Pods to different nodes. Without them, Kubernetes has no information to know if a Pod is ready to accept traffic after it has been moved.

It’s also important for our HPA. With HPA, we may be creating new Pods fairly frequently. It’s important that the Pod does not start receiving requests before it is ready to handle them.

Pod Disruption Budget

The reason we autoscale is to handle increased traffic, which requires more Pods. But what if half of our Pods are running on a Node that the Cluster Autoscaler has calculated should be removed?

Kubernetes provides the PodDisruptionBudget (PDB) to set policies around what level of “disruption” is allowed for a given set of Pods. A PDB policy spec has two parts:

  1. The target Pods, using a selector
  2. The availability rule to apply to those Pods

Here’s an example:

In this example, we are telling Kubernetes to ensure that the web Pods for our blog do not have more than 2 unavailable Pods. With this information, Kubernetes will only evict a target pod if the availability spec would still be met. It will use Liveness and Readiness checks to ensure that the moved Pod is ready and available before moving the next Pod. A PDB spec can also take a percentage, such as maxUnavailable: 10%, to better account for a set of Pods using an HPA.

To see the PDBs in the cluster, and how many disruptions are allowed at a given time, you can use kubectl get poddisruptionbudget --all-namespaces.

Graceful Termination

In our autoscaling cluster, Pods will be terminated constantly. It’s important that terminating a Pod be graceful. What “graceful” means depends on the service. For a Pod serving web requests, this means the Pod finishes processing its current requests and terminate only once all requests are complete. For a worker Pod, it means finishing the jobs it is running and adding any long-running Jobs back into the queue.

When a Pod needs to be terminated, Kubernetes uses the following steps:

  1. Remove Pod from any Services, so it stops receiving traffic
  2. Pod status set to “Terminating”
  3. (Simultaneous with 4) If configured, execute the custom preHook command. This command is custom to each Pod, and depends on the code that the container is running. After the preHook, send a SIGTERM to the Pod.
  4. (Simultaneous with 3) Wait the terminationGracePeriodSeconds . The default is 30s, but can be configured.
  5. Send a SIGKILL to the Pod

For many services, the default setup will work great out-of-the-box. For example, a default Sidekiq worker setup will use the SIGTERM to wait 8 seconds for Jobs to finish, and any that don’t finish in 8 seconds are killed and added back to the queue.

For other systems, some configuration may be needed. For example, in PgBouncer a SIGTERM means immediate shutdown. For a more graceful termination, first send a SIGINT in the preHook command to allow queries to finish.


Kubernetes provides all the right abstractions and policies to do autoscaling right, but it isn’t just flip-the-switch. Policies and configurations are required to use it effectively. Even if you aren’t using autoscaling, these policies are good to have in place for doing continuous delivery, performing upgrades, and achieving auto-healing functionality.

More reading: