Skip to main content

(LEGACY) Update the default node pool.

note

Now that we use AWS EKS, these documents no longer apply.

Increasing or decreasing the node count of an existing pool is as easy as modifying the appropriate file in _infra/terraform and running tofu apply.

However, updating the default_node_pool node types requires special consideration.

Also, due to a limitation in terraform whereby the default_node_pool stanza is considered immutable, we have to "trick" terraform into allowing us to change the default_node_pool without it triggering a cluster rebuild.

See: https://pumpingco.de/blog/modify-aks-default-node-pool-in-terraform-without-redeploying-the-cluster/

(Note: We use slightly different steps to the article above to keep new node configuration in terraform)

Steps

  1. Define a node pool that will be the new default.

    Be sure to enable mode = "System" to allow system nodes to schedule on this pool.

    Note that the new pool name cannot be changed later.

    # _infra/terraform/glhf/azure-aks.tf
    resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
    name = "newdefault"
    kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
    vm_size = "Standard_D4as_v6"
    node_count = 4

    upgrade_settings {
    max_surge = "10%"
    drain_timeout_in_minutes = 0
    node_soak_duration_in_minutes = 0
    }

    mode = "System"
    }
  2. Deploy the new node pool.

    tofu plan
    # Verify `tofu plan` diffs only include new node pool.
    tofu apply
    tip

    If you get a quota error, you may need to request Quota increases in the Azure Portal.

  3. Cordon then drain the old pool (replace default with the correct pool name)

    kubectl cordon -l agentpool=default
    # The following command should exit without errors after all nodes are successfully drained.
    kubectl drain -l agentpool=default --ignore-daemonsets --delete-emptydir-data
  4. Monitor the migration and wait for pods to gracefully terminate.

    # Use the :pods or :nodes views in k9s.
    # Wait for all non-daemon set pods to terminate and reschedule.
    # Hint: use shift-o to short by Node.
    k9s
  5. Delete the old node pool

    # This must be done with the Azure CLI to avoid terraform trying to rebuild the cluster.
    az aks nodepool delete --cluster-name glhf-aks-cluster --resource-group glhf-eastus-rg --name default
  6. Update the default_node_pool with the new node pool details, and remove the extra node pool definition.

    resource "azurerm_kubernetes_cluster" "aks_cluster" {
    # ... (same as previous)

    default_node_pool {
    # Update with new deetails
    name = "newdefault"

    node_count = 4
    vm_size = "Standard_D4as_v6"

    # ... (same as previous)
    }
    # ... (same as previous)
    }

    # Remove old definition for same pool.
    # resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
    # ...
    # }
  7. Manually tell terraform that we have "deleted" the resource we created.

    tofu state rm azurerm_kubernetes_cluster_node_pool.new_default
  8. Confirm tofu plan reflects no differences.

    tofu plan
    # ...
    # No changes. Your infrastructure matches the configuration.