Update the default node pool.
Increasing or decreasing the node count of an existing pool is as easy as modifying the appropriate
file in _infra/terraform
and running tofu apply
.
However, updating the default_node_pool node types requires special consideration.
Also, due to a limitation in terraform whereby the default_node_pool
stanza is considered immutable,
we have to "trick" terraform into allowing us to change the default_node_pool
without it triggering
a cluster rebuild.
See: https://pumpingco.de/blog/modify-aks-default-node-pool-in-terraform-without-redeploying-the-cluster/
(Note: We use slightly different steps to the article above to keep new node configuration in terraform)
Steps
-
Define a node pool that will be the new default.
Be sure to enable
mode = "System"
to allow system nodes to schedule on this pool.Note that the new pool name cannot be changed later.
# _infra/terraform/glhf/azure-aks.tf
resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
name = "newdefault"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
vm_size = "Standard_D4as_v6"
node_count = 4
upgrade_settings {
max_surge = "10%"
drain_timeout_in_minutes = 0
node_soak_duration_in_minutes = 0
}
mode = "System"
} -
Deploy the new node pool.
tofu plan
# Verify `tofu plan` diffs only include new node pool.
tofu applytipIf you get a quota error, you may need to request Quota increases in the Azure Portal.
-
Cordon then drain the old pool (replace
default
with the correct pool name)kubectl cordon -l agentpool=default
# The following command should exit without errors after all nodes are successfully drained.
kubectl drain -l agentpool=default --ignore-daemonsets --delete-emptydir-data -
Monitor the migration and wait for pods to gracefully terminate.
# Use the :pods or :nodes views in k9s.
# Wait for all non-daemon set pods to terminate and reschedule.
# Hint: use shift-o to short by Node.
k9s -
Delete the old node pool
# This must be done with the Azure CLI to avoid terraform trying to rebuild the cluster.
az aks nodepool delete --cluster-name glhf-aks-cluster --resource-group glhf-eastus-rg --name default -
Update the
default_node_pool
with the new node pool details, and remove the extra node pool definition.resource "azurerm_kubernetes_cluster" "aks_cluster" {
# ... (same as previous)
default_node_pool {
# Update with new deetails
name = "newdefault"
node_count = 4
vm_size = "Standard_D4as_v6"
# ... (same as previous)
}
# ... (same as previous)
}
# Remove old definition for same pool.
# resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
# ...
# } -
Manually tell terraform that we have "deleted" the resource we created.
tofu state rm azurerm_kubernetes_cluster_node_pool.new_default
-
Confirm
tofu plan
reflects no differences.tofu plan
# ...
# No changes. Your infrastructure matches the configuration.