Thanks to Tim St. Claire at Red Hat for showing me some of these tricks.
First create a YAML file describing your POD.
The first mistake I made was not putting my image on dockerhub. Kubernetes will not magically share docker images across a cluster. The second error i made with my pod file, was assuming I could bridge multiple command arguments into one string. To define your docker CMD in a pod file, you currently need to do it like this:
![]() |
| Kubernetes might allow one liners eventually, but right now, make sure you separate args for your docker commands! |
Looks reasonable, to me. In any case, however, when I ran the container, I saw some very ugly failures. I will trace through the steps to find the failures.
NOW Launch your POD.
Once the YAML file is created, you have to launch a pod. This is easy. Assume the name of the file above was spark-kube.yaml. Then we run...
kubectl create --validate=true -f ./spark-kube.yaml
NOW Check on your POD.
To do this, you have to run "kubectl get pods". That command will show you the pod to which your container is assigned. So in my case, this command yielded a hostname, which I could ssh into like so...
WHAT IF IT FAILS ?
Thanks to "get pods", we can see where the failure happened. So, we now have to go and search around on the assigned node. Old fashioned ssh and log mining here, no fancy tricks in the kube API (yet) that we can use.
So, the next thing I did, was ssh into host07-rack10.scale.
From there, you can run journalctl -u -f kubelet
That command will follow the logs, and it will show you where failures are happening.
You can also run docker ps -a to see if the container is being launched, at all.
In my case, I saw that the journalctl logs buried a message about a "no such container" existing. It looked like this ....
![]() |
| Where's waldo !? |
So, at this point its clear that there is a container that seems to be created, however, kubelet cant get any information about it. Well, lets just go into docker, then, and see whether docker knows anything?
![]() |
| Yup, its in there alright (4th from the bottom). But no status. |
Interestingly It looks like kubernetes is trying to relaunch the same container over and over, which keeps failing and never has any status at all.
Finally, as a last resort... you can run
journalctl -f -u docker
This will give you information that might have been lost (failures when exec runs)...
At this point, once I got stuck, because neither docker nor kubernetes actually are yielding any good information about why the container died... my specific error was available in the docker daemon logs itself.
So, this should allow you to trace down pretty much any failure in a distributed container orchestrated system, even w/ 5 layers of magic in the way .
In my case, I used this process to discover, ultimately, that there is a minor feature missing in interpretation of the "command" field in kubernetes pod declarations https://github.com/GoogleCloudPlatform/kubernetes/issues/3575 .
But, for most folks, that is just a minor detail, the real take home is that you can debug your kubernetes containers : Just ssh into your nodes and use a few kubectl , journalctl , and docker ps, commands.
This workflow is pretty manual. Someday we might have something more advanced... But this seems to be the simplest way to trace failed pod/container launches in k8 at the moment. Let me know if I'm missing something :)




No comments:
Post a Comment