Refactor egress IP healthcheck #4286

jcaamano · 2024-04-17T15:58:00Z

This PR refactors the egress IP health check with the following objectives:

have the same runtime health check for egress ip and egress service
remove the hardcoded 5 second health check period and 2 minimum
attempts that was making the quickest possible reaction time to range
from 7s to 12s without providing finer control
health check each node in separate goroutines, preventing timeouts
on a node's health check to delay other node's health check
TENTATIVE: have different unavailability tiers, so that an ovnk rollout
does not cause unnecessary service disruption

Adds a new centralized controller in charge of health checks. Listens to
node events to cancel health check of deleted nodes or to become aware of
node configuration changes that impact the health check function.

Provides an API for parties interested in node health information,
like egress IP or egress service controllers. These parties will use this
API to express and upkeep the interest on the health information of
specific nodes, which will serve as an additional input to the health
check controller reconciliation loop. The API itself is level driven:
interested parties will get notified of health state changes but not of
the details, interested parties are expected to query the controller for
those details on their reconciliation loops.

The health check itself is done on a separate goroutine for each node.
The health check period is hardcoded to 1 second instead of the previous
5 seconds. The health check probe timeout is no longer governed by the
timeout configuration parameter and is also hardcoded to 1 second which
was the previous default. There is no hardcoded number of minimum
consecutive failed attempts to declare a node unreachable: the timeout
configuration parameter now measures the amount of time that needs to
pass since the last time a node was observed available before declaring
a node unreachable situating the range where this happens at (value, value+1).

Changes the default timeout configuration parameter value to 10s. With the
new implementation, an unreachable node is actually declared unreachable
after 10s-11s of being last observed available, which is similar to the previous
behavior where that range had minimum and default of 7s-12s. Of course, now
users have the option of changing this configuration down to 1, making this
range 1s-2s.

This is up to commit Change EgressIPReachabiltyTotalTimeout to 10s

After that, a compromise exercise is done to attempt to avoid disruption on ovnk
rollouts. A new UNAVAILABLE (vs UNREACHABLE) health state is introduced that
expresses a situation where the node is alive with an apparently functioning OVN dataplane
(given that the health check itself is done through the management port) even though
ovn-k is not running (similar interpretation we had with the legacy health check), where
other controllers might not want to unschedule their services to avoid unnecessary
disruptions. As mentioned, this is a compromise were we try to improve expected
scenarios potentially worsening the effect of other unexpected scenarios.
This new UNAVAILABLE health state is not supported in hypershift given
the implementation of proxied connections that we have there.
UP FOR DISCUSSION.

TODO: e2e coverage for this new UNAVAILABLE health state

netlify · 2024-04-17T15:58:17Z

✅ Deploy Preview for subtle-torrone-bb0c84 ready!

Name	Link
🔨 Latest commit	`2ea50d1`
🔍 Latest deploy log	https://app.netlify.com/sites/subtle-torrone-bb0c84/deploys/66279ea69381ab000816ce3d
😎 Deploy Preview	https://deploy-preview-4286--subtle-torrone-bb0c84.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

coveralls · 2024-04-17T16:29:27Z

coverage: 52.837% (+0.2%) from 52.682%
when pulling d90c2d1 on jcaamano:healthcheck
into 13fb201 on ovn-org:master.

Health check is used both by egress ip and egress services controllers and could potentially be used for more stuff in the future. It makes sense for the code to live in its own package. It's a copy and not a move due to the way egress service mocks those functions on their tests. This will be cleaned up on a later commit. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

So they can be cancelled. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

A queue of items that are popped not earlier than their assoctiated timeout. Will be used by health check function to schedule probing tasks against different nodes. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

A new centralized controller in charge of health checks. Will listen to node events to cancel health check of deleted nodes or to become aware of node configuration changes that impact the health check function. Will provide an API for parties interested in node health information, like egress IP or egress service controllers. These parties will use this API to express and upkeep the interest on the health information of specific nodes, which will serve as an additional input to the health check controller reconciliation loop. The API itself is level driven: interested parties will get notified of health state changes but not of the details, interested parties are expected to query the controller for those details on their reconciliation loops. The health check itself is done on a separate goroutine for each node. The health check period is hardcoded to 1 second instead of the previous 5 seconds. The health check probe timeout is no longer governed by the timeout configuration parameter and is also hardcoded to 1 second which was the previous default. There is no hardcoded number of minimum consecutive failed attempts to declare a node unreachable: the timeout configuration parameter now measures the amount of time that needs to pass since the last time a node was observed available before declaring a node unreachable situating the range where this happens at (value, value+1). Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

To: * return error instead of bool, and interpret this error towards the health state of a node * async connect *AND* probe instead of sync connect *OR* probe following the recommended GRPC usage patterns * remove hardcoded extra attempt that adds further delay * use proper semantics for the parameters: makes no sense to pass IPs and a client that might be connected to a different set of IPs as parameters to the reachability method. Define a client that will take IPs as a property to be used either for legacy probing or to build & use a GRPC probe client * cleanup logging Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

With the previous 1 second default and the health check implementation that was in place before the health check controller, an unreachable node would be declared unreachable after 7 to 12 seconds of actually becoming unreachable. Change the default to 10 seconds, so that with the new implementation a node is declared unreachable after 10 to 11 seconds of actually becoming unreachable in an attempt to keep the default behavior closer to what we had before. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

Expressing a situation where the node is alive with an apparently functioning OVN dataplane (given that the health check itself is done through the management port) even though ovn-k is not running. This new health state allows interpreting this new scenario for which other controllers might not want to unschedule their services. For example, ovn-kube rollouts. This new UNAVAILABLE health state is not supported in hypershift given the implementation of proxied connections that we have there. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

A node might become unavailable temporarily during an ovn-kube rollout without degrading the services that are already scheduled on it. Don't de-allocate in that case to prevent an unnecessary service disruption. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

tssurya

don't respond yet to this set, I wanna go till commit 6 so that I gain more context and might be able to answer my own questions.

tssurya · 2024-05-28T05:48:04Z

go-controller/pkg/clustermanager/healthcheck/healthcheck.go

thank you for the commit split + commit description, made it very easy to understand what you are doing.

tssurya · 2024-05-28T05:57:38Z

go-controller/pkg/clustermanager/healthcheck/healthcheck.go

@@ -38,7 +39,7 @@ func isReachableViaGRPC(mgmtIPs []net.IP, client healthcheck.EgressIPHealthClien
 return client.Probe(dialCtx)
 }

-func isReachableLegacy(node string, mgmtIPs []net.IP, totalTimeout int) bool {
+func isReachableLegacy(ctx context.Context, node string, mgmtIPs []net.IP, totalTimeout int) bool {


is it perhaps time to remove the legacy healthchecks and go with just grpc? I wonder why we didn't remove the older way till now

tssurya · 2024-05-28T06:05:31Z

go-controller/pkg/clustermanager/healthcheck/healthcheck.go

+ dialCtx, dialCancel := context.WithTimeout(ctx, timeout)
+ defer dialCancel()
+ var d net.Dialer
+ conn, err := d.DialContext(dialCtx, "tcp", net.JoinHostPort(ip.String(), "9"))


self-note, older DialTimeout was just using DialContext underneath so logic hasn't changed.

tssurya · 2024-05-28T06:15:41Z

go-controller/pkg/util/context.go

+func SleepWithContext(ctx context.Context, duration time.Duration) {
+ ctx, cancel := context.WithTimeout(ctx, duration)
+ defer cancel()
+ <-ctx.Done()


smart! two birds one stone. So you want to ensure we do nothing for the given time before retrying and then also ensure if the context gets cancelled cancel the run and return by interrupting the nothing mode? The other way here was to interrupt sleep by catching the Done but this works.

tssurya · 2024-05-28T06:25:06Z

go-controller/pkg/util/timeoutqueue/timequeue.go

can you expand on "Will be used by health check function to schedule probing tasks against different nodes." ? Will all the nodes be added to this queue along with the associated time to retry or something? -> maybe the future commits have more info

tssurya · 2024-05-28T06:37:35Z

go-controller/pkg/clustermanager/healthcheck/controller.go

+ jitter = 200
+)
+
+// nodeState contians the latest declared health state for a node and the latest


nit: contains

tssurya · 2024-05-28T06:39:27Z

go-controller/pkg/util/timeoutqueue/timequeue.go

+ popM sync.Mutex
+ pushM sync.Mutex
+ items heapImpl[T]
+ consumers map[chan struct{}]time.Time


can you also expand on what are consumers?

yeah I don't really get this design either. I thought it was a bunch of channels that were added for each push, then it cycles through them to decide when to "pop" but it doesn't look like thats how it works, and i only see values assigned during pop.

I'm not sure what the value is here in using channels.

tssurya · 2024-05-28T06:44:10Z

go-controller/pkg/clustermanager/healthcheck/controller.go

+
+const (
+ // period in seconds of each node probe
+ period = 1


do we need some initial scale testing for this? 1s each node might be too much on 500 nodes? TBH I have no idea if I am over thinking here which is why the question. Not sure if 5s was intentionally picked or not.

tssurya · 2024-05-28T06:46:56Z

go-controller/pkg/util/timeoutqueue/timequeue.go

+}
+
+// Push an item into the queue that can't be popped before its associated
+// timeout


nit: associated deadline (because you seem to be adding the timeout value to the current time)

tssurya · 2024-05-28T07:03:15Z

go-controller/pkg/util/timeoutqueue/timequeue.go

+}
+
+func (tq *TimeoutQueue[T]) pop(ctx context.Context) item[T] {
+ // We make extensive use of channels and timers. Suposedly they are cheap,


self-note: stopped review here. pick up from here.

trozet

Made it up to "Move egress service to new healthcheck controller"

trozet · 2024-06-21T18:34:01Z

go-controller/pkg/util/timeoutqueue/timequeue.go

+}
+
+func (tq *TimeoutQueue[T]) pop(ctx context.Context) item[T] {
+ // We make extensive use of channels and timers. Suposedly they are cheap,


nit: Supposedly

trozet · 2024-06-21T19:43:49Z

go-controller/pkg/util/timeoutqueue/timequeue.go

+ tq.push(item[T]{item: it, deadline: time.Now().Add(timeout)})
+}
+
+// item is the internal representation of an item with its asssociated deadline


nit associated

trozet · 2024-06-21T19:57:57Z

go-controller/pkg/util/timeoutqueue/timequeue.go

+ popM sync.Mutex
+ pushM sync.Mutex
+ items heapImpl[T]
+ consumers map[chan struct{}]time.Time


yeah I don't really get this design either. I thought it was a bunch of channels that were added for each push, then it cycles through them to decide when to "pop" but it doesn't look like thats how it works, and i only see values assigned during pop.

I'm not sure what the value is here in using channels.

trozet · 2024-06-21T20:11:49Z

go-controller/pkg/clustermanager/healthcheck/controller.go

+// probeTask represents a probing task for a node including the the context it
+// is executed under and the client used for the probe
+type probeTask struct {
+ node string


node here is targetNode right? In other words, nodes that are not this node.

trozet · 2024-06-21T20:19:58Z

go-controller/pkg/clustermanager/healthcheck/controller.go

@@ -0,0 +1,426 @@
+package healthcheck


so in this model, you are running health checks from cluster manager to the nodes. I think this is the same model we had before right? Just want to make sure I understand this correctly.

trozet · 2024-06-21T21:33:06Z

go-controller/pkg/clustermanager/healthcheck/controller.go

+ return false
+ }
+
+ updated := c.setNodeState(task.node, new, on, false)


avoid using "new" as its also a builtin in golang. Maybe newHealthState or something

trozet · 2024-06-21T21:36:25Z

go-controller/pkg/clustermanager/healthcheck/provider.go

+ // consumer might be informed about a node it is not interested in and
+ // should be able to ignore it in that case.
+ HealthStateChanged(node string)
+ // HealthStateChanged queries the interest of the consumer of being informed


// MonitorHealthState

trozet · 2024-06-21T21:38:03Z

go-controller/pkg/clustermanager/healthcheck/provider.go

+// Provider of node health state information
+type Provider interface {
+ // Register a consumer to be informed about node health state changes
+ Register(consumer Consumer)


how does a consumer deRegister?

trozet · 2024-06-21T21:39:53Z

go-controller/pkg/clustermanager/healthcheck/provider.go

+type Consumer interface {
+ // Name the consumer goes by for logging purposes
+ Name() string
+ // HealthStateChanged informs consumers of node health state changes. A


its not clear from this if the function informs consumers of the node health. It returns nothing. Also is this function triggered when a change happens, or is it left to the consumer to call it and it returns if something changed since it was last queried?

trozet · 2024-06-21T21:40:34Z

go-controller/pkg/clustermanager/healthcheck/provider.go

+ // about health state changes of the node. A node is monitored if any
+ // consumer is interested. Monitoring for a node might terminate as soon as
+ // no consumer is interested.
+ MonitorHealthState(node string) bool


so how does a consumer actually use this function? It's not clear to me what it is for.

jcaamano requested review from trozet, dcbw, girishmg and tssurya as code owners April 17, 2024 15:58

jcaamano force-pushed the healthcheck branch from 34700ee to 1c7692f Compare April 17, 2024 16:03

jcaamano force-pushed the healthcheck branch 3 times, most recently from 7ab198d to b202096 Compare April 17, 2024 17:12

tssurya added kind/enhancement All issues/PRs that are new enhancement requests feature/egress-ip Issues related to EgressIP feature labels Apr 18, 2024

tssurya assigned tssurya and kyrtapz Apr 18, 2024

jcaamano force-pushed the healthcheck branch 3 times, most recently from 369a805 to 2ea50d1 Compare April 23, 2024 11:42

jcaamano force-pushed the healthcheck branch from 2ea50d1 to c03f93a Compare May 9, 2024 11:30

jcaamano requested a review from a team as a code owner May 9, 2024 11:30

jcaamano force-pushed the healthcheck branch 7 times, most recently from 9667a8c to 57ae35b Compare May 10, 2024 09:32

jcaamano added 4 commits May 10, 2024 09:33

Add a context to health check functions

b890f6c

So they can be cancelled. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

Add TimeoutQueue utility

5d89055

A queue of items that are popped not earlier than their assoctiated timeout. Will be used by health check function to schedule probing tasks against different nodes. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

jcaamano added 2 commits May 10, 2024 09:33

Move egress service to new healthcheck controller

b1b8fb3

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

Move egress IP to new healthcheck controller

2d99e7f

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

jcaamano force-pushed the healthcheck branch from 57ae35b to 5c92017 Compare May 10, 2024 09:38

jcaamano added 6 commits May 10, 2024 10:02

Add health check controller tests

64ed347

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

jcaamano force-pushed the healthcheck branch from 5c92017 to c6d17b3 Compare May 10, 2024 10:03

tssurya requested review from kyrtapz and removed request for girishmg, trozet and dcbw May 14, 2024 10:01

tssurya reviewed May 28, 2024

View reviewed changes

tssurya requested a review from trozet June 7, 2024 15:09

trozet self-assigned this Jun 20, 2024

trozet requested changes Jun 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor egress IP healthcheck #4286

Refactor egress IP healthcheck #4286

jcaamano commented Apr 17, 2024

netlify bot commented Apr 17, 2024 •

edited

coveralls commented Apr 17, 2024 •

edited

tssurya left a comment

tssurya May 28, 2024

tssurya May 28, 2024

tssurya May 28, 2024

tssurya May 28, 2024

tssurya May 28, 2024

tssurya May 28, 2024

tssurya May 28, 2024

trozet Jun 21, 2024

tssurya May 28, 2024

tssurya May 28, 2024

tssurya May 28, 2024

trozet left a comment

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

trozet Jun 21, 2024

Refactor egress IP healthcheck #4286

Are you sure you want to change the base?

Refactor egress IP healthcheck #4286

Conversation

jcaamano commented Apr 17, 2024

netlify bot commented Apr 17, 2024 • edited

✅ Deploy Preview for subtle-torrone-bb0c84 ready!

coveralls commented Apr 17, 2024 • edited

tssurya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trozet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Apr 17, 2024 •

edited

coveralls commented Apr 17, 2024 •

edited