jayunit100: Job rescheduling in k8s is done w/ Exponential backoff, but its not purely exponential WRT time

Someone asked today if kubernetes jobs backoff exponentially. They do but, the pattern of how they backoff is dependent, it looks like, based on HOW FAST your scheduler iswhich depends on things like how many nodes you have, how much your job needs to run (PVs, CPUs, memory) and so on .

As an example the same job run on two clusters, w/ different #s of nodes, we can see that a bunch of pods (in the 2nd picture) are bunched together, made about 3500 seconds ago, but then there was a rapid dropoff, and once backoff kicked in, no pods were scheduled together after that... until we reach the 100 mark, which is the backoffLimit

The Y AXIS here is how OLD a given pod is, and we just have a histogram on the x axis, one bar per pod that is created by a single job. We can see that depending on cluster type, the exact start times of pods in an Jobs exponential backoff is widely variant, but ultimately does converge and sort of drop off at an exponential rate. The more nodes and more consistent scheduling you have (i.e. you have plenty of CPU and nodes to quickly place pods) the more clean the exponential backoff is , i think.... Top cluster is resource constrained, bottom cluster was not. Top cluster was also 1.23, and bottom was 1.24. But in any case, we can see the overall convergent result is roughly similar....

In the below graphs, im graphing the age of all the pods in the cluster rand sorting them in descending order. The patterns in age, when you have many pods clustered at the same age... i think that means that the scheduler is faster, and thus, the exponential backoff isnt seen until after several retryies.

To reproducee this, just make a job.yaml file that dies every time it runs, .... I borrowed this from a collegue at work ! Ryan Richards !

kubo@VygjCMuUh6GjF:~/jay$ cat job.yaml

apiVersion: batch/v1

kind: Job

metadata:

name: backoff-exp

spec:

backoffLimit: 100

template:

spec:

containers:

- name: die

image: ubuntu

command: [ "/bin/false" ]

imagePullPolicy: IfNotPresent

restartPolicy: Never

but if you have a slow scheduler, you see fewer pods that are bunched together w/ the same age, bc there was a longer time , i guess, for the backoff to begin taking effect.

You can generate these graphs by then running:

kubectl get pods -o wide | awk '

BEGIN {

a["d"] = 24*(\

a["h"] = 60*(\

a["m"] = 60*(\

a["s"] = 1)));

}

{

print fx($5); # kubernetes time elapsed in seconds

}

function fx(ktym, f,u,n,t,i) {

n = split(ktym, f, /[dhms]/, u)

t = 0

for (i=1; i<n; i++) {

t += f[i] * a[u[i]]

}

return t

}

'|sort

and posting the output into a graph:

1140

116

1500

1860

2220

2580

2940

3300

3660

4020

4380

4500

4620

4680

476

780

8.12.22

Job rescheduling in k8s is done w/ Exponential backoff, but its not purely exponential WRT time

No comments:

Post a Comment