Argo Workflows is a tool for running a series (or a graph) of containers on Kubernetes, tying them together into a workflow.
It's a relatively young project, so things like error messages and documentation still need some work.
In the meantime, here are some steps you can take when your Workflow doesn't behave as expected.
Inspect the Workflow with Argo CLI
For simple issues, the Argo CLI will include a short description in the MESSAGE
column of its argo get
output.
$ argo get workflow-template-dag-diamond
Name: workflow-template-dag-diamond
Namespace: default
ServiceAccount: default
Status: Succeeded
Conditions:
Completed True
Created: Mon Mar 01 08:51:26 -0500 (7 minutes ago)
Started: Mon Mar 01 08:51:26 -0500 (7 minutes ago)
Finished: Mon Mar 01 08:51:36 -0500 (7 minutes ago)
Duration: 10 seconds
Progress: 1/1
ResourcesDuration: 5s*(1 cpu),5s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ workflow-template-dag-diamond diamond
└─✔ A workflow-template-whalesay-template/whalesay-template workflow-template-dag-diamond-2997968480 6s
If there's a message that's just not quite detailed enough to figure out the problem, copy the PODNAME
of the failed step, and skip to the section about using kubectl to describe the Pod.
If there's no useful message, try describing the Workflow.
Use kubectl to describe the Workflow
Sometimes there's a problem with the whole Workflow that doesn't fit nicely in argo get
's MESSAGE
column.
kubectl describe workflow
will print a lot more details about the Workflow, including a list of Events. The Events often contain details about what went wrong.
$ kubectl describe workflow workflow-template-dag-diamond
Name: workflow-template-dag-diamond
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WorkflowRunning 8m42s workflow-controller Workflow Running
Normal WorkflowNodeSucceeded 8m32s workflow-controller Succeeded node workflow-template-dag-diamond.A
Normal WorkflowNodeSucceeded 8m32s workflow-controller Succeeded node workflow-template-dag-diamond
Normal WorkflowSucceeded 8m32s workflow-controller Workflow completed
Use kubectl to describe the Pod
If a Pod fails, there are a number of places which may hold clues. One is the Events associated with the Pods.
Pod names may be unpredictable (they often have random suffixes like -93750129
), so use argo get
to get the name of the suspect Pod.
Then use kubectl describe po
to see the Pod details, including the Events.
$ kubectl describe po workflow-template-dag-diamond-2997968480
Name: workflow-template-dag-diamond-2997968480
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 36s default-scheduler Successfully assigned default/workflow-template-dag-diamond-2997968480 to docker-desktop
Normal Pulled 35s kubelet Container image "argoproj/argoexec:v2.12.9" already present on machine
Normal Created 35s kubelet Created container wait
Normal Started 35s kubelet Started container wait
Warning Failed 31s kubelet Failed to pull image "docker/whalesay": rpc error: code = Unknown desc = Error response from daemon: Head https://registry-1.docker.io/v2/docker/whalesay/manifests/latest: x509: certificate is valid for auth.docker.io, not registry-1.docker.io
Normal Pulling 17s (x2 over 35s) kubelet Pulling image "docker/whalesay"
Warning Failed 16s (x2 over 31s) kubelet Error: ErrImagePull
Warning Failed 16s kubelet Failed to pull image "docker/whalesay": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: x509: certificate is valid for auth.docker.io, not registry-1.docker.io
Normal BackOff 5s (x2 over 30s) kubelet Back-off pulling image "docker/whalesay"
Warning Failed 5s (x2 over 30s) kubelet Error: ImagePullBackOff
This Workflow failed because a proxy issue is preventing pulls from Docker Hub.
Sometimes the problem doesn't show up in the Events, because the failure is inside one of the step's containers.
Use kubectl to read the Pod logs
Pods run as part of Argo Workflows have two or three containers: wait
, main
, and sometimes init
.
The wait
sidecar is injected by Argo to keep an eye on the main
container (your code) and communicate with the Argo Workflow controller (another Pod) about the step's progress.
The main
container is the one you set up when you defined the Workflow in yaml. (Look for the image
, command
, args
, and source
items to see part of this Pod's configuration.)
The init
container, if present, is also injected by Argo. It does things like pulling artifacts into the Pod.
To read the logs, use kubectl logs
. For example:
$ kubectl logs workflow-template-dag-diamond-2997968480 init
error: container init is not valid for pod workflow-template-dag-diamond-2997968480
C02D507EMD6P:test_workflows ekmed$ kubectl logs workflow-template-dag-diamond-2997968480 wait
time="2021-03-01T14:09:18.339Z" level=info msg="Starting Workflow Executor" version=v2.12.9
time="2021-03-01T14:09:18.346Z" level=info msg="Creating a docker executor"
time="2021-03-01T14:09:18.346Z" level=info msg="Executor (version: v2.12.9, build_date: 2021-02-16T22:51:48Z) initialized (pod: default/workflow-template-dag-diamond-2997968480) with template:\n{\"name\":\"whalesay-template\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"message\",\"value\":\"A\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"docker/whalesay\",\"command\":[\"cowsay\"],\"resources\":{}}}"
time="2021-03-01T14:09:18.346Z" level=info msg="Waiting on main container"
time="2021-03-01T14:14:17.998Z" level=info msg="Alloc=4699 TotalAlloc=14633 Sys=70080 NumGC=6 Goroutines=7"
The logs from init
and wait
may be a bit difficult to read, because they come from Argo. The logs for main
will be from your configured image
, so they'll probably be more familiar.
Use kubectl to read the Workflow controller logs
Argo comes with a Pod called the "Workflow controller" to sort of usher a Workflow through the process of running all its steps.
If all the other debugging techniques fail, the Workflow controller logs may hold helpful information.
First, find the Pod name. If you used the default Argo installation command, the Pod will be in the argo
namespace.
$ kubectl get po -n argo
NAME READY STATUS RESTARTS AGE
argo-server-6bb488c6c8-ff88g 1/1 Running 0 40m
workflow-controller-57db6b46f-7qfr9 1/1 Running 0 40m
$ kubectl logs workflow-controller-57db6b46f-7qfr9 -n argo
... lots of stuff here ...
Ask for help
If none of these solves your problem, ask a question on StackOverflow, start a discussion on GitHub, or ask in the Argo Slack.
These are just a few of my go-to tools. If I'm missing anything, please comment!
Top comments (1)
Good article! I have something to add which helps me a lot in debugging workflows.
Often times I need to get a shell inside the main pod of the stage I'm trying to debug. To do this I temporarily replace/add the
command
field of that workflow container to an infinite command liketail -f /dev/null
orsleep infinity
and submit the workflow.Now the workflow will get stuck in the stage running the infinite command and I'm free to exec into the main container of that pod with something like
kubectl -n argo exec -itc main <pod_name> -- bash
(replace bash with any other command that your container can run).After this, I can manually run the code the stage was intended to run with an interactive debugger or explore how the container environment is setup. This is useful when your code depends on artifacts that were produced on past stages or for debugging directly on the environment that is giving problems.