Update the default node pool.

Increasing or decreasing the node count of an existing pool is as easy as modifying the appropriate file in _infra/terraform and running tofu apply.

However, updating the default_node_pool node types requires special consideration.

Also, due to a limitation in terraform whereby the default_node_pool stanza is considered immutable, we have to "trick" terraform into allowing us to change the default_node_pool without it triggering a cluster rebuild.

See: https://pumpingco.de/blog/modify-aks-default-node-pool-in-terraform-without-redeploying-the-cluster/

(Note: We use slightly different steps to the article above to keep new node configuration in terraform)

Steps

Define a node pool that will be the new default.

Be sure to enable mode = "System" to allow system nodes to schedule on this pool.

Note that the new pool name cannot be changed later.

# _infra/terraform/glhf/azure-aks.tf
resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
  name                  = "newdefault"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
  vm_size              = "Standard_D4as_v6"
  node_count           = 4

  upgrade_settings {
    max_surge                     = "10%"
    drain_timeout_in_minutes      = 0
    node_soak_duration_in_minutes = 0
  }

  mode = "System"
}

Deploy the new node pool.
```
tofu plan
# Verify `tofu plan` diffs only include new node pool.
tofu apply
```
tip
If you get a quota error, you may need to request Quota increases in the Azure Portal.

Cordon then drain the old pool (replace default with the correct pool name)

kubectl cordon -l agentpool=default
# The following command should exit without errors after all nodes are successfully drained.
kubectl drain -l agentpool=default --ignore-daemonsets --delete-emptydir-data

Monitor the migration and wait for pods to gracefully terminate.

# Use the :pods or :nodes views in k9s.
# Wait for all non-daemon set pods to terminate and reschedule.
# Hint: use shift-o to short by Node.
k9s

Delete the old node pool

# This must be done with the Azure CLI to avoid terraform trying to rebuild the cluster.
az aks nodepool delete --cluster-name glhf-aks-cluster --resource-group glhf-eastus-rg --name default

Update the default_node_pool with the new node pool details, and remove the extra node pool definition.

resource "azurerm_kubernetes_cluster" "aks_cluster" {
  # ... (same as previous)

  default_node_pool {
    # Update with new deetails
    name       = "newdefault"

    node_count = 4
    vm_size    = "Standard_D4as_v6"

    # ... (same as previous)
  }
  # ... (same as previous)
}

# Remove old definition for same pool.
# resource "azurerm_kubernetes_cluster_node_pool" "new_default" {
#   ...
# }

Manually tell terraform that we have "deleted" the resource we created.
```
tofu state rm azurerm_kubernetes_cluster_node_pool.new_default
```

Confirm tofu plan reflects no differences.

tofu plan
# ...
# No changes. Your infrastructure matches the configuration.