Cloud Operations Lead
January 24, 2023
It can certainly feel intimidating. Your AKS cluster version is behind and you’re in danger of falling out of support. Regular upgrades to your AKS clusters ensure that you have the latest features, bug fixes, and improvements to security. AKS is one of the few Azure PaaS that require customer-initiated upgrades, and while they are significantly easier than traditional Kubernetes upgrades they still require some planning.
During our many upgrade cycles, we’ve essentially come up with two general approaches to AKS upgrades. Both have their pros and cons, but let’s run through the general concepts.
It is worth noting that there is a third option of a full in place single command upgrade. We’ll call this “Method 0” as we can’t really recommend it except in some cases such as lab or test deployments in which you aren’t concerned if things go up in flames. This method allows Azure to run the full upgrade of the Control Plane, system, and user nodepool(s) all in an automated fashion on its own. It may sound nice, but we can tell you from experience it’s challenging to troubleshoot if any issues occur.
The first method we’ll call the “traditional approach”. This will closely resemble what you would find in Microsoft’s documentation if you were to research upgrading AKS.
Our example will show commands using Azure CLI. First, we initiate the upgrade of our control-plane to our desired version:
az aks upgrade --resource-group $rg --name $aksName --subscription $sub --kubernetes-version $kubeVersion --control-plane-only
Once that completes, we would proceed with upgrading the individual nodepools one by one starting with the system nodepool:
az aks nodepool upgrade --name $nodepoolName --cluster-name $aksName --resource-group $rg --subscription $sub --kubernetes-version $kubeVersion --max-surge 25%
While each nodepool is upgrading, we keep an eye on the pods running on the respective nodepool and ensure they are restarted successfully on new nodes. Once a nodepool is complete, we move on to the next. We also will periodically perform application or service health checks that are driven or impacted by the microservices running on the cluster to head off or at least identify issues before they cause impacts.
One of the problems we’ve noticed with the “traditional approach” is that when things don’t go as planned, your options can be limited and frequently you’re stuck sitting on your hands for an extended amount time while you wait for the upgrade command to fail out and allow you to adjust. We’ve found that troublesome pods that aren’t in a healthy state—or a laundry list of other things—can cause issues with upgrades, many of which can be caught in pre-checks but aren’t integrated into the base process.
This brought us to our next approach which we’ll call “Nodepool Shuffle”.
With this approach, we reduce risk and the time spent watching the paint dry as the running command spins around in circles. The general idea here is that instead of allowing Azure to run the upgrade fully in the background, we take things into our own hands by deploying our new nodepools to the desired version and manually evict old nodes until they are empty and can be deleted.
We still start with upgrading the control-plane as we do in the traditional method. We’ll also often still do the traditional upgrade on the system nodepool, as those are generally very small and do not have many microservice/app pods running on them. Once that is complete, we deploy the new nodepool(s) to the new desired version that match the specs of the existing nodes.
# Control Plane Upgrade
az aks upgrade --resource-group $rg --name $aksName --kubernetes-version 1.23.12 --control-plane-only --subscription $subscription
# Upgrade Existing System Nodepool
az aks nodepool upgrade --name $systemNodepoolName --resource-group $rg --cluster-name $aksName --kubernetes-version 1.23.12 --max-surge 25% --subscription $subscription
Once our new nodepool is up, we cordon off each node in the original nodepool to prevent pods from being deployed or restarted on them. Then we launch a “drain” operation, essentially evicting the node of all its running pods and causing them to restart on another node. We do this one node at a time while carefully watching that all pods restart successfully on the new nodepool. We’ve taken a lot of the leg work out of figuring out the commands and built them into our automated process, so they are provided for us such as in the example below.
# Deploy New User Nodepool
az aks nodepool add --name $newNodepoolName --resource-group $rg --cluster-name $aksName --subscription $subscription --kubernetes-version 1.23.12 --max-surge '25%' --mode User --node-count 4 --node-osdisk-size 30 --node-vm-size Standard_D4s_v3 --os-type Linux --max-pods 75 --enable-cluster-autoscaler --min-count 4 --max-count 86
## Cordon All Nodes
kubectl cordon node $nodeName01
kubectl cordon node $nodeName02
kubectl cordon node $nodeName03
kubectl cordon node $nodeName04
## Drain Node $nodeName01
kubectl drain node $nodeName01 --ignore-daemonsets --delete-emptydir-data
kubectl get pods -o wide --all-namespaces | grep $nodeName01 | grep -v kube-system | grep -v aad-pod-identity
kubectl get pods -o wide --all-namespaces | grep -v Running | grep -v Completed
… Once complete, proceed to drain next node in the pool until all are complete …
# Delete Original Nodepool
az aks nodepool delete --name $origNodepoolName --resource-group $rg --cluster-name $aksName --subscription $subscription
It's essential to do this in a controlled fashion so that you don’t drain too many original nodes at a time and cause adverse impacts to the running microservices. Again, while doing this we will perform whatever application or service validation checks we can periodically to monitor any potential negative impacts.
The benefit of this approach is that it puts additional control in our hands. If we discover issues at any point, we can easily pause, resolve issues, or roll pods back to previous nodes running the “old” code. The downside is there is a fair bit more user interaction and keyboard surfing.
With a lot in motion and moving rather quickly, here are a few general recommendations we like to keep in mind:
AKS/Kubernetes versions can be released rather rapidly compared to many other services, let’s examine. The Kubernetes community releases minor versions roughly every three months. Starting with version 1.19, support is provided for 12 months. AKS supports three GA minor versions of Kubernetes:
[major].[minor].[patch]
Example:
1.22.6
Each number in the version indicates general compatibility with the previous version:
Upcoming version releases and deprecations can be viewed on the AKS Kubernetes Release Calendar .
While certainly not without their hiccups, AKS upgrades are usually a smooth and painless process. Keeping up with release cycles will prevent you from falling too far behind or having to do frequent upgrades in a short period. With some structured planning you can feel more comfortable with the process, and it won’t be daunting anymore. Plus, who doesn't want to take advantage of all those great feature releases!
Have questions or comments? Feel free to contact us directly at ecms@eplus.com anytime. Be sure to check out the full ePlus Cloud Managed Services blog. Happy Clouds!
Preparation and success go hand in hand.
Connect with us or use the form.
+1 888-482-1122