jayunit100: breaking a CAPV Log Jam: What to do if you have zombie VSphereMachines?

CAPV Controller was down today. and I was trying to delete some vspheremachines, but ... I couldnt!

I filed an issue today, because I saw a panic do to a capv failure where some objects in a cluster were missing, but capv expected them to be there https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2302.

As I was trying to delete a cluster, I figured i better start manually deleting vspheremachines since capv wasnt in a good state (normally, its capv's job to do this cleanup).

kubectl edit vphsermachine tkg-vsphere-default-v1.1.0-control-plane

So I deleted the finalizer and then it still failed bc...

error: vspheremachines.infrastructure.cluster.x-k8s.io "windows-cluster-control-plane-7fw2m-49c99" could not be patched: Internal error occurred: failed calling webhook "default.vspheremachine.infrastructure.cluster.x-k8s.io": failed to call webh ook: Post "https://capv-webhook-service.capv-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine?timeout=10s": dial tcp 100.64.55.82:443: connect: connection refused

Well I guess theres a MutatingWebhook up in there.

So ...

1121 kubectl get validatingwebhookconfigurations capv-validating-webhook-configuration -o yaml > old_webhook

1122 kubectl delete validatingwebhookconfiguration apv-validating-webhook-configuration

1129 kubectl get capv-mutating-webhook-configuration -o yaml > old_webhook_mut

1130 kubectl get mutatingwebhookconfiguration capv-mutating-webhook-configuration -o yaml > old_webhook_mut

1131 kubectl delete -f old_webhook_mut

THEN make sure to delete the vspherecluster that underlies these clusters - in a broken CAPI installation there may be cluster shrapnel floating around.

And then I could finish deleting all those Zombie VsphereMachines

NOTE: In my case my ClusterClass observedGeneration != generation, and that was because the ClusterClass object was paused ! this shoudl never happen though. If your clusterclass gets paused, edit it and remove the paused annotation ! cluster.x-k8s.io/paused: ""

30.8.23

breaking a CAPV Log Jam: What to do if you have zombie VSphereMachines?

No comments:

Post a Comment