31.8.22

Modifying cluster API for wider deployments in unstable networks

 updating Tanzu/CAPI to support wide deployments... 3 parametersss  

Preventing MHC from killing a healthy node on a disconnected or slowly network cluster.


Setting MHC_FALSE_STATUS_TIMEOUT 


The file at the right add a new parameter … MHC_FALSE_STATUS_TIMEOUT to a large value, such as 40 minutes.  This quadruples the amount of time it takes for CAPI Machine Health checks to assume a node has not come up, and recreate it.  It thus increases toleration for long disconnected or flakey networks.  



This parameter is in the YAML file that you use to define your cluster , i.e., the yaml file that you use as input to tanzu cluster create.


CLUSTER_PLAN: prod

CNI: antrea

INFRASTRUCTURE_PROVIDER: vsphere

KUBERNETES_VERSION: v1.23.8+vmware.2

OS_ARCH: amd64

OS_NAME: photon

OS_VERSION: '3'

_VSPHERE_CONTROL_PLANE_ENDPOINT: 10.92.160.149

MHC_FALSE_STATUS_TIMEOUT: 40m

Preventing MHC from killing a node, before it is born.


Setting NODE_STARTUP_TIMEOUT



Making our MHC_FALSE status was important in order to prevent machine health checks from deleting a healthy node… but sometimes, a node takes a long time to come up.  In this case, MHCs can delete machines pre-emptively (i.e. maybe something went wrong in bootstrapping).  In edge scenarios, you may want this to be timeout to be more forgiving .  


Below, we quadruple the timeout from 15 minutes to 60, similar to what we did for MHC_FALSE_STATUS_TIMEOUTs.


CLUSTER_PLAN: prod

CNI: antrea

INFRASTRUCTURE_PROVIDER: vsphere

KUBERNETES_VERSION: v1.23.8+vmware.2

OS_ARCH: amd64

OS_NAME: photon

OS_VERSION: '3'

_VSPHERE_CONTROL_PLANE_ENDPOINT: 10.92.160.149

NODE_STARTUP_TIMEOUT: 60m

Preventing etcd clients on the  management cluster from prematurely failing while scanning health of etcd on the worker clusters.


  etcd-dial-timeout-duration in CAPv management clusters




kubectl edit  capi-kubeadm-control-plane-controller-manager -n capi-system


Modify the following arguments to be “long”, i.e. 40s.  The default value for this normally is 10 seconds, thus, we can quadruple this value similar to what was done for other parameters above…


    - args:

        - --leader-elect

        - --metrics-bind-addr=localhost:8080

        - --feature-gates=ClusterTopology=false

        - --etcd-dial-timeout-duration=40s

        command:

        - /manager

        image: projects.registry.vmware.com/tkg/cluster-api/kubeadm-control-plane-controller:v1.0.1_vmware.1


No comments:

Post a Comment