Planning for Microsoft Azure Kubernetes Services (AKS) Upgrades

Home + Resources + Blog + Posts

Kyle Nicolo

Cloud Operations Lead
January 24, 2023

It can certainly feel intimidating.  Your AKS cluster version is behind and you’re in danger of falling out of support.  Regular upgrades to your AKS clusters ensure that you have the latest features, bug fixes, and improvements to security.  AKS is one of the few Azure PaaS that require customer-initiated upgrades, and while they are significantly easier than traditional Kubernetes upgrades they still require some planning.

Approach

During our many upgrade cycles, we’ve essentially come up with two general approaches to AKS upgrades. Both have their pros and cons, but let’s run through the general concepts. 

It is worth noting that there is a third option of a full in place single command upgrade. We’ll call this “Method 0” as we can’t really recommend it except in some cases such as lab or test deployments in which you aren’t concerned if things go up in flames.  This method allows Azure to run the full upgrade of the Control Plane, system, and user nodepool(s) all in an automated fashion on its own.  It may sound nice, but we can tell you from experience it’s challenging to troubleshoot if any issues occur.

The first method we’ll call the “traditional approach”. This will closely resemble what you would find in Microsoft’s documentation if you were to research upgrading AKS.


Method 1: Traditional

Our example will show commands using Azure CLI.  First, we initiate the upgrade of our control-plane to our desired version:

az aks upgrade --resource-group $rg --name $aksName --subscription $sub --kubernetes-version $kubeVersion --control-plane-only

Once that completes, we would proceed with upgrading the individual nodepools one by one starting with the system nodepool:

az aks nodepool upgrade --name $nodepoolName --cluster-name $aksName --resource-group $rg --subscription $sub --kubernetes-version $kubeVersion --max-surge 25%

While each nodepool is upgrading, we keep an eye on the pods running on the respective nodepool and ensure they are restarted successfully on new nodes. Once a nodepool is complete, we move on to the next.  We also will periodically perform application or service health checks that are driven or impacted by the microservices running on the cluster to head off or at least identify issues before they cause impacts.

One of the problems we’ve noticed with the “traditional approach” is that when things don’t go as planned, your options can be limited and frequently you’re stuck sitting on your hands for an extended amount time while you wait for the upgrade command to fail out and allow you to adjust.  We’ve found that troublesome pods that aren’t in a healthy state—or a laundry list of other things—can cause issues with upgrades, many of which can be caught in pre-checks but aren’t integrated into the base process.

This brought us to our next approach which we’ll call “Nodepool Shuffle”.


Method 2: Nodepool Shuffle

With this approach, we reduce risk and the time spent watching the paint dry as the running command spins around in circles. The general idea here is that instead of allowing Azure to run the upgrade fully in the background, we take things into our own hands by deploying our new nodepools to the desired version and manually evict old nodes until they are empty and can be deleted.

We still start with upgrading the control-plane as we do in the traditional method. We’ll also often still do the traditional upgrade on the system nodepool, as those are generally very small and do not have many microservice/app pods running on them. Once that is complete, we deploy the new nodepool(s) to the new desired version that match the specs of the existing nodes.

# Control Plane Upgrade

az aks upgrade --resource-group $rg --name $aksName --kubernetes-version 1.23.12 --control-plane-only --subscription $subscription

# Upgrade Existing System Nodepool

az aks nodepool upgrade --name $systemNodepoolName --resource-group $rg --cluster-name $aksName --kubernetes-version 1.23.12 --max-surge 25% --subscription $subscription

Once our new nodepool is up, we cordon off each node in the original nodepool to prevent pods from being deployed or restarted on them.  Then we launch a “drain” operation, essentially evicting the node of all its running pods and causing them to restart on another node. We do this one node at a time while carefully watching that all pods restart successfully on the new nodepool. We’ve taken a lot of the leg work out of figuring out the commands and built them into our automated process, so they are provided for us such as in the example below.

# Deploy New User Nodepool

az aks nodepool add --name $newNodepoolName --resource-group $rg --cluster-name $aksName --subscription $subscription --kubernetes-version 1.23.12 --max-surge '25%' --mode User --node-count 4 --node-osdisk-size 30 --node-vm-size Standard_D4s_v3 --os-type Linux --max-pods 75 --enable-cluster-autoscaler --min-count 4 --max-count 86

## Cordon All Nodes

kubectl cordon node $nodeName01

kubectl cordon node $nodeName02

kubectl cordon node $nodeName03

kubectl cordon node $nodeName04

## Drain Node $nodeName01

kubectl drain node $nodeName01 --ignore-daemonsets --delete-emptydir-data

kubectl get pods -o wide --all-namespaces | grep $nodeName01 | grep -v kube-system | grep -v aad-pod-identity

kubectl get pods -o wide --all-namespaces | grep -v Running | grep -v Completed

… Once complete, proceed to drain next node in the pool until all are complete …

# Delete Original Nodepool

az aks nodepool delete --name $origNodepoolName --resource-group $rg --cluster-name $aksName --subscription $subscription

It's essential to do this in a controlled fashion so that you don’t drain too many original nodes at a time and cause adverse impacts to the running microservices.  Again, while doing this we will perform whatever application or service validation checks we can periodically to monitor any potential negative impacts.

The benefit of this approach is that it puts additional control in our hands. If we discover issues at any point, we can easily pause, resolve issues, or roll pods back to previous nodes running the “old” code. The downside is there is a fair bit more user interaction and keyboard surfing.


Recommendations

With a lot in motion and moving rather quickly, here are a few general recommendations we like to keep in mind:

  • Aim to run the latest patch release of the minor version you are running.
  • Preferences vary, but we generally recommend staying on the GA minor release back from the latest version.
  • Schedule regular upgrade cycles to maintain support compliance, and avoid having your running version deprecated.
    • Users have 30 days from version removal to upgrade to a supported minor version release to continue receiving support. However, you will no longer be able to create clusters or node pools once the version is deprecated/removed.
  • Regardless of which approach, upgrade the control plane first, followed by individual nodepools.

Kubernetes Versions

AKS/Kubernetes versions can be released rather rapidly compared to many other services, let’s examine. The Kubernetes community releases minor versions roughly every three months. Starting with version 1.19, support is provided for 12 months. AKS supports three GA minor versions of Kubernetes:

  • The latest GA minor version that is released in AKS (which we'll refer to as N).
  • Two previous minor versions.
    • Each supported minor version also supports a maximum of two (2) stable patches.

Version Format

[major].[minor].[patch]

Example:
1.22.6

Each number in the version indicates general compatibility with the previous version:

  • Major versions change when incompatible API updates or backwards compatibility may be broken.
  • Minor versions change when functionality updates are made that are backwards compatible to the other minor releases.
  • Patch versions change when backwards-compatible bug fixes are made.

Release and deprecation process

Upcoming version releases and deprecations can be viewed on the AKS Kubernetes Release Calendar .


Current Release Calendar

Current Release Calendar













Summary

While certainly not without their hiccups, AKS upgrades are usually a smooth and painless process. Keeping up with release cycles will prevent you from falling too far behind or having to do frequent upgrades in a short period.  With some structured planning you can feel more comfortable with the process, and it won’t be daunting anymore. Plus, who doesn't want to take advantage of all those great feature releases!

Have questions or comments?  Feel free to contact us directly at ecms@eplus.com anytime. Be sure to check out the full ePlus Cloud Managed Services blog.  Happy Clouds!

Ready to learn more?

Preparation and success go hand in hand.
Connect with us or use the form.
+1 888-482-1122