Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying workflows have short deadline ignoring activeDeadlineSeconds #13044

Closed
4 tasks done
leesungbin opened this issue May 13, 2024 · 2 comments · Fixed by #13068
Closed
4 tasks done

Retrying workflows have short deadline ignoring activeDeadlineSeconds #13044

leesungbin opened this issue May 13, 2024 · 2 comments · Fixed by #13068
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy P3 Low priority type/bug

Comments

@leesungbin
Copy link
Contributor

leesungbin commented May 13, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I expect that the activeDeadlineSeconds parameter should also apply to retrying workflow pod containers. However, it appears that the retrying container has a short deadline of approximately 1 minute.

Additionally, when I set the activeDeadlineSeconds at the template level, the deadline for the first workflow's wait container is set to 0001-01-01 00:00:00 +0000 UTC, while the second workflow's wait container has a deadline of about 1 minute.

Is this behavior intentional?

Below are the logs resulting from setting the workflow level activeDeadlineSeconds.

Name:                retry-sample-knssl
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Message:             Max duration limit exceeded
Conditions:
 PodRunning          False
 Completed           True
Created:             Mon May 13 11:42:44 +0900 (3 minutes ago)
Started:             Mon May 13 11:42:44 +0900 (3 minutes ago)
Finished:            Mon May 13 11:45:55 +0900 (now)
Duration:            3 minutes 11 seconds
Progress:            0/2
ResourcesDuration:   17s*(1 cpu),4m29s*(100Mi memory)

STEP                        TEMPLATE       PODNAME                                      DURATION  MESSAGE
 ✖ retry-sample-knssl       retry-example                                                         Max duration limit exceeded
 ├─✖ retry-sample-knssl(0)  retry-example  retry-sample-knssl-retry-example-2352189541  2m        Error (exit code 10)
 └─✖ retry-sample-knssl(1)  retry-example  retry-sample-knssl-retry-example-2151005208  47s       Error (exit code 143)

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-sample-
spec:
  activeDeadlineSeconds: 3600
  entrypoint: retry-example
  templates:
    - name: retry-example
      retryStrategy:
        limit: 2
        backoff:
          duration: 10s
          factor: 2
          maxDuration: 3m
      container:
        image: alpine
        command: [sh, -c]
        args: ["sleep 120; exit 10"]

Logs from the workflow controller

time="2024-05-13T02:42:44.653Z" level=info msg="Processing workflow" Phase= ResourceVersion=1899 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.661Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=retry-sample-knssl
time="2024-05-13T02:42:44.661Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.661Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="Retry node retry-sample-knssl initialized Running" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.662Z" level=info msg="Pod node retry-sample-knssl-2352189541 initialized Pending" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.686Z" level=info msg="Created pod: retry-sample-knssl(0) (retry-sample-knssl-retry-example-2352189541)" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.686Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.686Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:44.694Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1902 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.659Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1902 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:54.660Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.660Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.661Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=retry-sample-knssl-2352189541 old.message= old.phase=Pending old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:42:54.661Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:54.661Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:42:54.687Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1928 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.695Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1928 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:04.697Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.697Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.698Z" level=info msg="node unchanged" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:04.701Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:04.701Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:04.728Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1935 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.542Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1935 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:27.544Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.544Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.545Z" level=info msg="node unchanged" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:43:27.552Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:27.552Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:43:27.578Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1935 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.808Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1935 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.809Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.809Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.810Z" level=info msg="Pod failed: Error (exit code 10)" displayName="retry-sample-knssl(0)" namespace=argo pod=retry-sample-knssl-retry-example-2352189541 templateName=retry-example workflow=retry-sample-knssl
time="2024-05-13T02:44:58.810Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 10)" new.phase=Failed new.progress=0/1 nodeID=retry-sample-knssl-2352189541 old.message= old.phase=Running old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.811Z" level=info msg="node has maxDuration set, setting executionDeadline to: Mon May 13 02:45:44 +0000 (45 seconds from now)" namespace=argo node=retry-sample-knssl workflow=retry-sample-knssl
time="2024-05-13T02:44:58.811Z" level=info msg="Retry Policy: OnFailure (onFailed: true, onError false)" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.812Z" level=info msg="1 child nodes of retry-sample-knssl failed. Trying again..." namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.812Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.812Z" level=info msg="Pod node retry-sample-knssl-2151005208 initialized Pending" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.856Z" level=info msg="Created pod: retry-sample-knssl(1) (retry-sample-knssl-retry-example-2151005208)" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.856Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.857Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:44:58.890Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1999 workflow=retry-sample-knssl
time="2024-05-13T02:44:58.897Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-sample-knssl-retry-example-2352189541/labelPodCompleted
time="2024-05-13T02:45:08.863Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1999 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:08.864Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.865Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.865Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2151005208 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.866Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=retry-sample-knssl-2151005208 old.message= old.phase=Pending old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:45:08.868Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:08.868Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:08.892Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2022 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.935Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2022 namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.938Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.938Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2352189541 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.939Z" level=info msg="task-result changed" namespace=argo nodeID=retry-sample-knssl-2151005208 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.940Z" level=info msg="Pod failed: Error (exit code 143)" displayName="retry-sample-knssl(1)" namespace=argo pod=retry-sample-knssl-retry-example-2151005208 templateName=retry-example workflow=retry-sample-knssl
time="2024-05-13T02:45:55.940Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 143)" new.phase=Failed new.progress=0/1 nodeID=retry-sample-knssl-2151005208 old.message= old.phase=Running old.progress=0/1 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="Max duration limit exceeded. Failing..." namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="node retry-sample-knssl phase Running -> Failed" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="node retry-sample-knssl message: Max duration limit exceeded" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="node retry-sample-knssl finished: 2024-05-13 02:45:55.941690803 +0000 UTC" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.941Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg="Updated message  -> Max duration limit exceeded" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.942Z" level=info msg="Marking workflow completed" namespace=argo workflow=retry-sample-knssl
time="2024-05-13T02:45:55.948Z" level=info msg="cleaning up pod" action=deletePod key=argo/retry-sample-knssl-1340600742-agent/deletePod
time="2024-05-13T02:45:55.966Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=2054 workflow=retry-sample-knssl
time="2024-05-13T02:45:55.994Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-sample-knssl-retry-example-2151005208/labelPodCompleted

Logs from in your workflow's wait container

* first workflow (has deadline with 1 hour)
time="2024-05-13T02:44:48.780Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-05-13T02:44:48.780Z" level=info msg="No output parameters"
time="2024-05-13T02:44:48.780Z" level=info msg="No output artifacts"
time="2024-05-13T02:44:48.781Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: retry-sample-knssl/retry-sample-knssl-retry-example-2352189541/main.log"
time="2024-05-13T02:44:48.781Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-05-13T02:44:48.781Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=retry-sample-knssl/retry-sample-knssl-retry-example-2352189541/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-05-13T02:44:48.791Z" level=info msg="Save artifact" artifactName=main-logs duration=10.050083ms error="<nil>" key=retry-sample-knssl/retry-sample-knssl-retry-example-2352189541/main.log
time="2024-05-13T02:44:48.791Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-05-13T02:44:48.791Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-05-13T02:44:48.803Z" level=info msg="Alloc=7730 TotalAlloc=13965 Sys=23909 NumGC=5 Goroutines=10"

* second workflow(has short deadline less than 1 minute)
time="2024-05-13T02:45:00.792Z" level=info msg="Starting Workflow Executor" version=v3.5.6
time="2024-05-13T02:45:00.795Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-05-13T02:45:00.795Z" level=info msg="Executor initialized" deadline="2024-05-13 02:45:44 +0000 UTC" includeScriptOutput=false namespace=argo podName=retry-sample-knssl-retry-example-2151005208 templateName=retry-example version="&Version{Version:v3.5.6,BuildDate:2024-04-19T20:56:53Z,GitCommit:555030053825dd61689a086cb3c2da329419325a,GitTag:v3.5.6,GitTreeState:clean,GoVersion:go1.21.9,Compiler:gc,Platform:linux/arm64,}"
time="2024-05-13T02:45:00.806Z" level=info msg="Starting deadline monitor"
time="2024-05-13T02:45:44.001Z" level=info msg="Step exceeded its deadline"
time="2024-05-13T02:45:44.004Z" level=info msg="Killing containers"
time="2024-05-13T02:45:45.866Z" level=info msg="Main container completed" error="<nil>"
time="2024-05-13T02:45:45.866Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-05-13T02:45:45.866Z" level=info msg="No output parameters"
time="2024-05-13T02:45:45.866Z" level=info msg="No output artifacts"
time="2024-05-13T02:45:45.867Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: retry-sample-knssl/retry-sample-knssl-retry-example-2151005208/main.log"
@agilgur5 agilgur5 added area/controller Controller issues, panics P3 Low priority area/retryStrategy Template-level retryStrategy labels May 13, 2024
leesungbin added a commit to leesungbin/argo-workflows that referenced this issue May 14, 2024
leesungbin added a commit to leesungbin/argo-workflows that referenced this issue May 14, 2024
@leesungbin
Copy link
Contributor Author

I want to confirm if I am misunderstanding the maxDuration parameter.

In the documentation, it states, "maxDuration is the maximum amount of time allowed for a workflow in the backoff strategy." Based on this, I thought that maxDuration only affects the backoff process.

However, in the code, I noticed that opts.executionDeadline is overridden by maxDurationDeadline.

opts.executionDeadline = maxDurationDeadline

Is this intentional?
If so, I don't think this is a bug.
If not, I believe some tests should also be updated. (ex, func TestPropagateMaxDurationProcess in workflow/controller/operator_test.go)

@agilgur5
Copy link
Member

We may want to reconsider this later within the context of #10341, i.e. if pending and active time are split in two. The backoff maxDuration may make more sense for the pending state / pendingTimeout only, while activeDeadlineSeconds may make more sense to only apply to the active state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy P3 Low priority type/bug
Projects
None yet
2 participants