fix: detect and report cluster connection errors #559

chetan-rns · 2023-12-05T12:55:06Z

Currently, Argo CD doesn't report connection errors on the cluster UI page unless the cache is invalidated (manually/periodically every 24 hours) or there is a sync error. During auto-sync, this error is only updated if there is a new commit to sync. This PR adds a new field ConnectionStatus that gets updated whenever Argo CD fails to access the cluster. The ClusterInfoUpdater will periodically fetch this cluster info and update the appstate cache.

codecov · 2023-12-05T12:58:13Z

Codecov Report

Attention: Patch coverage is 22.32143% with 87 lines in your changes are missing coverage. Please review.

Project coverage is 53.95%. Comparing base (5fd9f44) to head (cea6dcc).
Report is 4 commits behind head on master.

❗ Current head cea6dcc differs from pull request most recent head b3e1c67. Consider uploading reports for the commit b3e1c67 to get more accurate results

Files	Patch %	Lines
pkg/cache/cluster.go	24.27%	76 Missing and 2 partials ⚠️
pkg/cache/settings.go	0.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #559      +/-   ##
==========================================
- Coverage   54.71%   53.95%   -0.77%     
==========================================
  Files          41       41              
  Lines        4834     4934     +100     
==========================================
+ Hits         2645     2662      +17     
- Misses       1977     2058      +81     
- Partials      212      214       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/cache/cluster.go

jgwest · 2024-01-08T14:46:18Z

pkg/cache/cluster.go

@@ -615,6 +630,29 @@ func (c *clusterCache) watchEvents(ctx context.Context, api kube.APIResourceInfo
 if errors.IsNotFound(err) {
 c.stopWatching(api.GroupKind, ns)
 }
+ var connectionUpdated bool
+ if err != nil {


Hey Chetan, me again, a few more thoughts while thinking about the ramifications of the cluster invalidation c.Invalidate() and c.EnsureSynced() calls. (Feel free to poke holes in my logic!)

The watch function (defined here) is called for each watched resource, so if there are 100 resources, it will be called once for each, to establish the watch.

What about the scenario where only 1 watch (or a small number are failing), for example because the ServiceAccount does not have the correct permissions to watch the resource?

If one watch fails 1 out of 100, will the cluster oscillate between Successful and Failed?

It might look like this:

Watch 1: succeeds, set cluster status to Successful

Watch 2: succeeds, set cluster status to Successful

Watch 3: succeeds, set cluster status to Successful

(...)

Watch 80: succeeds, set cluster status to Failed

Perhaps the ideal behaviour would be to report success/failure only after all were established. (Easier said than done, I'm sure)

Another question: since it appears we are invalidating the cluster cache on any failure:

Will this cause cache invalidation and re-sync to occur much more often, if the network connection between Argo CD and the target (watched) cluster happens to be unstable? (for example, imagine a case where 1% of connections fail. Would this magnify to invalidating and forcing a resync of all the resources, if only 1% of connections were failing?).

Hi Jonathan, your concerns are valid. Since watches could fail for any reason it is difficult to determine the cluster status based on the errors returned during watch. These errors could be transient making it complicated to keep track of their states. Also, the error message could be generic to determine why exactly the watch failed.

To fix this issue, I have included a goroutine that periodically checks for watch errors, pings the remote cluster(to get the version), and updates the status. By following this approach we don't rely completely on the watches and prevent the status from oscillating in the case of transient watch failures. We use the watch status to prevent the number of pings to the remote clusters by pinging the cluster only when the watches fail. I've also added checks to ensure we are invalidating the cluster cache only when the status changes which shouldn't happen frequently.

Let me know what you think. The periodic goroutine approach does handle some of the edge cases that may arise due to the dynamic nature of the watches. But also I'm open to hearing other approaches to solve this problem.

jgwest

I would love to rewrite the cluster cache logic entirely, and replace the locks with go channels (which are much easier to reason against, when done right!), but that's probably not happening any time soon. 😄

pkg/cache/cluster.go

jgwest · 2024-02-06T13:19:52Z

pkg/cache/cluster.go

+ for {
+ select {
+ case <-ticker.C:
+ watchErrors := c.watchFails.len()
+ // Ping the cluster for connection verification if there are watch failures or
+ // if the cluster has recovered back from watch failures.
+ if watchErrors > 0 || (watchErrors == 0 && c.connectionStatus == ConnectionStatusFailed) {
+ c.log.V(1).Info("verifying cluster connection", "watches", watchErrors)
+
+ _, err := c.kubectl.GetServerVersion(c.config)
+ if err != nil {
+ if c.connectionStatus != ConnectionStatusFailed {
+ c.updateConnectionStatus(ConnectionStatusFailed)
+ }
+ } else if c.connectionStatus != ConnectionStatusSuccessful {
+ c.updateConnectionStatus(ConnectionStatusSuccessful)
+ }
+ }
+ case <-ctx.Done():
+ ticker.Stop()
+ return
+ }
+ }
+
+}


Within this function:

cluster cache lock should be owned before reading/writing from c.connectionStatus

Now you might be tempted to wrap the entire case in a lock/unlock, BUT, we probably don't want to own the lock while calling GetServerVersion (it's nearly always bad to block a lock on I/O), so that makes things a bit more complicated.

I've refactored the function with locks to read the c.connectionStatus. Managed to avoid locking around GetServerVersion. Hopefully, this should fix the issue.

pkg/cache/cluster.go

jgwest · 2024-02-06T14:14:35Z

pkg/cache/cluster.go

+}
+
+func (c *clusterCache) clusterConnectionService(ctx context.Context) {
+ clusterConnectionTimeout := 10 * time.Second


I'm wondering if 10 seconds is maybe too fast. My main concern is that updateConnectionStatus is a very heavy call, because (among other things) it will invalidate the cache and cancel all the watches.

Perhaps a minute? Perhaps longer? WDYT?

I've used 10 seconds to poll the status of watch failures. We might have to ping the remote cluster every 10s if the watches are failing. But the function updateConnectionStatus will run only once when the status changes. For the subsequent polls, the condition checks ensure we don't call it often.

These are the conditions that we check before invalidating the cache:
1. Check if there are watch failures or if it has recovered back(no watch errors but the connection status is 'Failed')
2. If either of the above conditions are met we ping the remote cluster to get the version.
3. If there is no error, but the connection status is Failed we invalidate the cache.
4. If there is an error, but the connection status is Successful we invalidate the cache.

I'm open to updating to a longer duration. But the tradeoff is we can't recognize the failures early. This should be okay I suppose, since Argo CD will not be primarily used for cluster monitoring.

I'd recommend making the interval configurable per cache, so it can be parametrized from a consumer

Made the interval configurable 👍

jannfis · 2024-02-09T14:06:27Z

pkg/cache/cluster.go

+ if watchErrors > 0 || watchesRecovered {
+ c.log.V(1).Info("verifying cluster connection", "watches", watchErrors)
+
+ _, err := c.kubectl.GetServerVersion(c.config)


I feel evaluation of this call should be done with a certain degree of tolerance for intermittent failures, e.g. a (configurable) retry on certain errors such as a flaky network.

Since we call this using a ticker, it will be evaluated every X seconds (default 10). So, it will be called again after an interval in the case of intermittent failures. Do we still need to explicitly retry? In that case, can we reuse the existing listRetryFunc?

It depends - will a failing version call lead to cluster cache invalidation, and then be re-built in the next interval?

There is a rare possibility of frequent invalidations if both the GetServerversion() and the watches fail/recover between successive ticker intervals. I have updated the call within a retry that checks for transient network errors using isTransientNetworkErr(). It should retry in the case of flaky network errors without invalidating the cache.

sonarcloud · 2024-02-19T11:14:36Z

Quality Gate failed

Failed conditions
C Security Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

sonarcloud · 2024-04-04T10:35:13Z

Quality Gate failed

Failed conditions
C Security Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

jannfis

LGTM, thank you @chetan-rns !

Signed-off-by: Chetan Banavikalmutt <[email protected]>

- Refactor the the clusterConnectionService to use locks before reading the status - Rename the function that monitors the cluster connection status - Fix typo Signed-off-by: Chetan Banavikalmutt <[email protected]>

Signed-off-by: Chetan Banavikalmutt <[email protected]>

sonarcloud · 2024-05-09T13:10:52Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
1.7% Duplication on New Code

See analysis details on SonarCloud

chetan-rns force-pushed the update-cluster-status branch 2 times, most recently from 6dff0cb to 223a309 Compare December 5, 2023 13:01

chetan-rns changed the title ~~bug: detect and report cluster connection errors~~ fix: detect and report cluster connection errors Dec 5, 2023

jgwest requested changes Dec 18, 2023

View reviewed changes

pkg/cache/cluster.go Outdated Show resolved Hide resolved

chetan-rns requested a review from jgwest January 3, 2024 13:29

jgwest requested changes Jan 8, 2024

View reviewed changes

chetan-rns requested a review from jgwest January 29, 2024 14:32

jgwest requested changes Feb 6, 2024

View reviewed changes

chetan-rns force-pushed the update-cluster-status branch from ed68b74 to 76535b7 Compare February 9, 2024 05:12

jannfis reviewed Feb 9, 2024

View reviewed changes

chetan-rns requested a review from jgwest February 19, 2024 11:17

chetan-rns force-pushed the update-cluster-status branch from 8587b12 to cea6dcc Compare April 4, 2024 10:34

chetan-rns requested a review from jannfis April 4, 2024 10:53

jannfis approved these changes Apr 15, 2024

View reviewed changes

chetan-rns mentioned this pull request May 9, 2024

fix: periodically detect and update cluster connection status argoproj/argo-cd#18143

Open

14 tasks

chetan-rns added 7 commits May 9, 2024 18:37

bug: Argo CD should update the correct cluster connectivity status

c2f4b02

Signed-off-by: Chetan Banavikalmutt <[email protected]>

Use lock before updating connection status

24d1408

Signed-off-by: Chetan Banavikalmutt <[email protected]>

update the connection status periodically using watch errors

4a32ce8

Signed-off-by: Chetan Banavikalmutt <[email protected]>

use context while updating cluster status

2b291f7

Signed-off-by: Chetan Banavikalmutt <[email protected]>

Address review comments

7f1b996

- Refactor the the clusterConnectionService to use locks before reading the status - Rename the function that monitors the cluster connection status - Fix typo Signed-off-by: Chetan Banavikalmutt <[email protected]>

Update the cluster connection interval as configurable

c129232

Signed-off-by: Chetan Banavikalmutt <[email protected]>

Retry reaching the server if there are transient errors

b3e1c67

Signed-off-by: Chetan Banavikalmutt <[email protected]>

chetan-rns force-pushed the update-cluster-status branch from cea6dcc to b3e1c67 Compare May 9, 2024 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect and report cluster connection errors #559

fix: detect and report cluster connection errors #559

chetan-rns commented Dec 5, 2023

codecov bot commented Dec 5, 2023 •

edited

jgwest Jan 8, 2024

chetan-rns Jan 29, 2024

jgwest left a comment

jgwest Feb 6, 2024

chetan-rns Feb 9, 2024

jgwest Feb 6, 2024

chetan-rns Feb 9, 2024

jannfis Feb 9, 2024

chetan-rns Feb 19, 2024

jannfis Feb 9, 2024

chetan-rns Feb 19, 2024

jannfis Mar 8, 2024

chetan-rns Apr 4, 2024

sonarcloud bot commented Feb 19, 2024

sonarcloud bot commented Apr 4, 2024

jannfis left a comment

sonarcloud bot commented May 9, 2024

fix: detect and report cluster connection errors #559

Are you sure you want to change the base?

fix: detect and report cluster connection errors #559

Conversation

chetan-rns commented Dec 5, 2023

codecov bot commented Dec 5, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgwest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 19, 2024

Quality Gate failed

sonarcloud bot commented Apr 4, 2024

Quality Gate failed

jannfis left a comment

Choose a reason for hiding this comment

sonarcloud bot commented May 9, 2024

Quality Gate passed

codecov bot commented Dec 5, 2023 •

edited