Skip to main content

Update the default node pool.

Increasing or decreasing the node count of an existing pool is as easy as modifying the appropriate file in _infra/terraform and running tofu apply.

However, updating the default_node_pool node types requires special consideration.

Also, due to a limitation in terraform whereby the default_node_pool stanza is considered immutable, we have to "trick" terraform into allowing us to change the default_node_pool without it triggering a cluster rebuild.

See: https://pumpingco.de/blog/modify-aks-default-node-pool-in-terraform-without-redeploying-the-cluster/

(Note: We use slightly different steps to the article above to keep new node configuration in terraform)

Steps

  1. Define a node pool that will be the new default.

    Be sure to enable mode = "System" to allow system nodes to schedule on this pool.

    Note that the new pool name cannot be changed later.

    # _infra/terraform/glhf/azure-aks.tf
    resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
    name = "newdefault"
    kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
    vm_size = "Standard_D4as_v6"
    node_count = 4

    upgrade_settings {
    max_surge = "10%"
    drain_timeout_in_minutes = 0
    node_soak_duration_in_minutes = 0
    }

    mode = "System"
    }
  2. Deploy the new node pool.

    tofu plan
    # Verify `tofu plan` diffs only include new node pool.
    tofu apply
    tip

    If you get a quota error, you may need to request Quota increases in the Azure Portal.

  3. Cordon then drain the old pool (replace default with the correct pool name)

    kubectl cordon -l agentpool=default
    # The following command should exit without errors after all nodes are successfully drained.
    kubectl drain -l agentpool=default --ignore-daemonsets --delete-emptydir-data
  4. Monitor the migration and wait for pods to gracefully terminate.

    # Use the :pods or :nodes views in k9s.
    # Wait for all non-daemon set pods to terminate and reschedule.
    # Hint: use shift-o to short by Node.
    k9s
  5. Delete the old node pool

    # This must be done with the Azure CLI to avoid terraform trying to rebuild the cluster.
    az aks nodepool delete --cluster-name glhf-aks-cluster --resource-group glhf-eastus-rg --name default
  6. Update the default_node_pool with the new node pool details, and remove the extra node pool definition.

    resource "azurerm_kubernetes_cluster" "aks_cluster" {
    # ... (same as previous)

    default_node_pool {
    # Update with new deetails
    name = "newdefault"

    node_count = 4
    vm_size = "Standard_D4as_v6"

    # ... (same as previous)
    }
    # ... (same as previous)
    }

    # Remove old definition for same pool.
    # resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
    # ...
    # }
  7. Manually tell terraform that we have "deleted" the resource we created.

    tofu state rm azurerm_kubernetes_cluster_node_pool.new_default
  8. Confirm tofu plan reflects no differences.

    tofu plan
    # ...
    # No changes. Your infrastructure matches the configuration.