-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dapr Workflows cannot be terminated if they are running lots of activities #7706
Comments
if you look at the dapr sidecar logs you will see entries like
what is happening with this test it that it starts the worker that connects via grpc with the dapr sidecar, and then it schedules the workflow so it starts running, and as soon as the workflow starts running your test exits which also exits the worker and the grpc connection to the sidecar closes. To my understanding, the workflow engine cannot move forward with the event log for this workflow, because it cannot send commands to the application. If it cannot move forward the event log it cannot process the workflow terminate command and the workflow gets stuck retrying any previous command (which in this case was an activity execution most likely) I don't have sufficient knowledge on the workflow engine to propose a solution but to me it looks like there is a bit of a disconnect between the workflow actor and the engine, the actor tries to send work to the engine so it sends it to the app, but if the engine is not connected to the app nothing works. Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if its safe to terminate the workflow from the backend POV if the connection to the application is absent. |
cc @cgillum |
Yes, I believe @famarting is correct here. If the worker has disconnected from the sidecar, then it will be unable to receive and process the terminate message, leaving the workflow stuck in the Termination works by sending a message to a workflow. When the workflow receives the terminate message, it transitions itself into a completed state with the
I think this can be considered as an optimization, but it would need to be implemented carefully to ensure that the workflow state is correctly updated in the same way as when a workflow transitions itself into a terminated state, and that the OTel spans are properly emitted. |
In what area(s)?
/area runtime
What version of Dapr?
1.13.2
Expected Behavior
If a workflow is started, that creates tons of activities. It keeps running forever, but it cannot be terminated. Leaving the workflow in a forever Running state.
Actual Behavior
Workflows, no matter if they are running tons of activities, should be able to be terminated by calling the terminateWorkflow API.
One approach that can be implemented here, is to pause the workflow if it execute activities in a loop to avoid unwanted recursion.
Steps to Reproduce the Problem
Release Note
RELEASE NOTE:
The text was updated successfully, but these errors were encountered: