Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many applications causing webhook to timeout #14269

Open
3 tasks done
KevinM2k opened this issue Jun 29, 2023 · 11 comments · May be fixed by #15326 or #18173
Open
3 tasks done

Too many applications causing webhook to timeout #14269

KevinM2k opened this issue Jun 29, 2023 · 11 comments · May be fixed by #15326 or #18173
Labels
bug Something isn't working

Comments

@KevinM2k
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

We have over 1500 applications, in Gitlab when we have configured a webhook it has a maximum timeout of 10s. Running the webhook locally its taking around 15s, meaning gitlab is timing out and disabling the webhook.

Is there anyway of speeding this up?

To Reproduce

Create over 1500 applications in argocd, call the webhook url within postman with a valid push event, time how long it takes.

Expected behavior

Returns with a response within 10s

Version

v2.6.11+697fd7c

@KevinM2k KevinM2k added the bug Something isn't working label Jun 29, 2023
@hlastras
Copy link

hlastras commented Aug 2, 2023

We had similar problems in the past. Clusters with ~500 apps, the webhook request take about 10 seconds to process all. With the latest version that time has been decreased to about 5s.

If I am not mistaken in a webhook request ArgoCD processes all applications sequentially instead to parallelise it.

Is there any reason not to do it in parallel?

@crenshaw-dev
Copy link
Collaborator

I'm fairly certain it can and should be paralleled.

I'm surprised it takes more than a second. Feels a bit like the processing loop is probably doing some network-bound work as part of each iteration. That would probably be worth some investigation.

@phanama
Copy link
Contributor

phanama commented Aug 26, 2023

Faced the same issue, webhook calls taking >10s to finish.

It is surely doing quite some network-bound work here, potentially calling argo.RefreshApp() or storePreviouslyCachedManifests() in each iteration.

for _, app := range filteredApps {
for _, source := range app.Spec.GetSources() {
if sourceRevisionHasChanged(source, revision, touchedHead) && sourceUsesURL(source, webURL, repoRegexp) {
if appFilesHaveChanged(&app, changedFiles) {
namespacedAppInterface := a.appClientset.ArgoprojV1alpha1().Applications(app.ObjectMeta.Namespace)
_, err = argo.RefreshApp(namespacedAppInterface, app.ObjectMeta.Name, v1alpha1.RefreshTypeNormal)
if err != nil {
log.Warnf("Failed to refresh app '%s' for controller reprocessing: %v", app.ObjectMeta.Name, err)
continue
}
// No need to refresh multiple times if multiple sources match.
break
} else if change.shaBefore != "" && change.shaAfter != "" {
if err := a.storePreviouslyCachedManifests(&app, change, trackingMethod, appInstanceLabelKey); err != nil {
log.Warnf("Failed to store cached manifests of previous revision for app '%s': %v", app.Name, err)
}
}
}
}
}

@phanama phanama linked a pull request Sep 2, 2023 that will close this issue
12 tasks
@alexymantha
Copy link
Member

While doing the refreshes concurrently should solve the issue since it will be much faster, I was looking at the code and it seems like the result of the operation is not used in any way:

a.HandleEvent(payload)

I'm wondering if we should instead return a response as soon as the request is validated and do the actual processing in parallel. I don't see why we need to make the webhook wait for the operation to complete if we don't need the result in the response.

@phanama
Copy link
Contributor

phanama commented Sep 13, 2023

I also imagined similar thing and thought that we'd ideally need some kind of queue for the webhook requests, but this would make the argocd-server somewhat stateful. We could optionally use the existing redis as the queue and run background workers inside the argocd-server, not stateful but not sure if we want to do this (additionally redis is not just a cache now).

I'm wondering if we should instead return a response as soon as the request is validated and do the actual processing in parallel.

Thinking about it again, maybe we could just put all processing to the background and then return 200 immediately to webhook clients. The webhook handler anyway just runs to tell argocd-app-controller to refresh specific apps by adding the refresh-annotation. The actual "queue" is all those apps with their refresh-annotation set.
But in this setup, we confidently assume that the argocd-server will always finish putting apps into the "queue". Not sure if this is ok, especially if webhook clients might want to do something (e.g. retry or notify/alerts) in case the webhook server crashes while still adding refresh-annotations to apps.

Also, I think we still want the would-be-background app-refreshes to finish quickly (with concurrent processing).

@taliastocks
Copy link
Contributor

I'm running into this too. I don't know that much about ArgoCD internals, but I think refreshing the apps in a background thread and returning a response immediately could be a decent solution - even if the processing fails to complete, it's not the end of the world - apps get refreshed periodically anyway, so "best effort" seems alright (to me anyway).

@Savasw
Copy link

Savasw commented Feb 15, 2024

We are running into this too where our webhook requests are being timed out at argocd. Has there been any update on the fix? I see the PR #15326 has no updates since 4 months.

@morawat
Copy link

morawat commented May 2, 2024

Any update on this bug? or any workaround like increase the timeout?

dhruvang1 added a commit to dhruvang1/argo-cd that referenced this issue May 12, 2024
dhruvang1 added a commit to dhruvang1/argo-cd that referenced this issue May 12, 2024
dhruvang1 added a commit to dhruvang1/argo-cd that referenced this issue May 12, 2024
@dhruvang1
Copy link
Contributor

dhruvang1 commented May 12, 2024

I have raised a PR to do processing in background in #18173

@ricardojdsilva87
Copy link

Hello everyone,
Just to give an update, we are getting the same situation using GitHub Enterprise but it seems that the webhook events are processed in a few seconds by ArgoCD. We can see the commits after a few seconds of pushing them.
In this case I think it's just that the response takes more than the 10 seconds needed to be sent to Github or Gitlab so it is considered as not delivered, altought the payload is processed.
Thanks for the PR!

@solidnerd
Copy link
Contributor

solidnerd commented May 21, 2024

We are seeing also the problem on GitLab.com we have over 500 Apps and the webhook takes mostly between 10.2 - 10.1 seconds. It would be nice to have implemented upstream.

This results in the enad that GitLab will disable our Webhook integration.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet