30.8.23

breaking a CAPV Log Jam: What to do if you have zombie VSphereMachines?

CAPV Controller was down today. and I was trying to delete some vspheremachines, but ...  I couldnt!

I filed an issue today, because I saw a panic do to a capv failure where some objects in a cluster were missing, but capv expected them to be there https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2302. 


As I was trying to delete a cluster, I figured i better start manually deleting vspheremachines since capv wasnt in a good state (normally, its capv's job to do this cleanup). 

kubectl edit vphsermachine tkg-vsphere-default-v1.1.0-control-plane                                                            

So I deleted the finalizer and then it still failed bc...

error: vspheremachines.infrastructure.cluster.x-k8s.io "windows-cluster-control-plane-7fw2m-49c99" could not be patched: Internal error occurred: failed calling webhook "default.vspheremachine.infrastructure.cluster.x-k8s.io": failed to call webh ook: Post "https://capv-webhook-service.capv-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine?timeout=10s": dial tcp 100.64.55.82:443: connect: connection refused


Well I guess theres a MutatingWebhook up in there.

So ...

 1121  kubectl get validatingwebhookconfigurations capv-validating-webhook-configuration  -o yaml > old_webhook                                                                                                                                       

 1122  kubectl delete validatingwebhookconfiguration apv-validating-webhook-configuration                                                                                                                                                             


 1129  kubectl get capv-mutating-webhook-configuration -o yaml > old_webhook_mut                                                                                                                                                                      

 1130  kubectl get mutatingwebhookconfiguration capv-mutating-webhook-configuration -o yaml > old_webhook_mut                                                                                                                                         

 1131  kubectl delete -f old_webhook_mut                                                                                                        

THEN make sure to delete the vspherecluster that underlies these clusters - in a broken CAPI installation there may be cluster shrapnel floating around.


And then I could finish deleting all those Zombie VsphereMachines


NOTE: In my case my ClusterClass observedGeneration != generation, and that was because the ClusterClass object was paused ! this shoudl never happen though.  If your clusterclass gets paused, edit it and remove the paused annotation !    cluster.x-k8s.io/paused: ""                                                                                                                                                                                                                       

No comments:

Post a Comment