Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor egress IP healthcheck #4286

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

jcaamano
Copy link
Contributor

This PR refactors the egress IP health check with the following objectives:

  • have the same runtime health check for egress ip and egress service
  • remove the hardcoded 5 second health check period and 2 minimum
    attempts that was making the quickest possible reaction time to range
    from 7s to 12s without providing finer control
  • health check each node in separate goroutines, preventing timeouts
    on a node's health check to delay other node's health check
  • TENTATIVE: have different unavailability tiers, so that an ovnk rollout
    does not cause unnecessary service disruption

Adds a new centralized controller in charge of health checks. Listens to
node events to cancel health check of deleted nodes or to become aware of
node configuration changes that impact the health check function.

Provides an API for parties interested in node health information,
like egress IP or egress service controllers. These parties will use this
API to express and upkeep the interest on the health information of
specific nodes, which will serve as an additional input to the health
check controller reconciliation loop. The API itself is level driven:
interested parties will get notified of health state changes but not of
the details, interested parties are expected to query the controller for
those details on their reconciliation loops.

The health check itself is done on a separate goroutine for each node.
The health check period is hardcoded to 1 second instead of the previous
5 seconds. The health check probe timeout is no longer governed by the
timeout configuration parameter and is also hardcoded to 1 second which
was the previous default. There is no hardcoded number of minimum
consecutive failed attempts to declare a node unreachable: the timeout
configuration parameter now measures the amount of time that needs to
pass since the last time a node was observed available before declaring
a node unreachable situating the range where this happens at (value, value+1).

Changes the default timeout configuration parameter value to 10s. With the
new implementation, an unreachable node is actually declared unreachable
after 10s-11s of being last observed available, which is similar to the previous
behavior where that range had minimum and default of 7s-12s. Of course, now
users have the option of changing this configuration down to 1, making this
range 1s-2s.

This is up to commit Change EgressIPReachabiltyTotalTimeout to 10s

After that, a compromise exercise is done to attempt to avoid disruption on ovnk
rollouts. A new UNAVAILABLE (vs UNREACHABLE) health state is introduced that
expresses a situation where the node is alive with an apparently functioning OVN dataplane
(given that the health check itself is done through the management port) even though
ovn-k is not running (similar interpretation we had with the legacy health check), where
other controllers might not want to unschedule their services to avoid unnecessary
disruptions. As mentioned, this is a compromise were we try to improve expected
scenarios potentially worsening the effect of other unexpected scenarios.
This new UNAVAILABLE health state is not supported in hypershift given
the implementation of proxied connections that we have there.
UP FOR DISCUSSION.

TODO: e2e coverage for this new UNAVAILABLE health state

Copy link

netlify bot commented Apr 17, 2024

Deploy Preview for subtle-torrone-bb0c84 ready!

Name Link
🔨 Latest commit 2ea50d1
🔍 Latest deploy log https://app.netlify.com/sites/subtle-torrone-bb0c84/deploys/66279ea69381ab000816ce3d
😎 Deploy Preview https://deploy-preview-4286--subtle-torrone-bb0c84.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@coveralls
Copy link

coveralls commented Apr 17, 2024

Coverage Status

coverage: 52.837% (+0.2%) from 52.682%
when pulling d90c2d1 on jcaamano:healthcheck
into 13fb201 on ovn-org:master.

@jcaamano jcaamano force-pushed the healthcheck branch 3 times, most recently from 7ab198d to b202096 Compare April 17, 2024 17:12
@tssurya tssurya added kind/enhancement All issues/PRs that are new enhancement requests feature/egress-ip Issues related to EgressIP feature labels Apr 18, 2024
@jcaamano jcaamano force-pushed the healthcheck branch 3 times, most recently from 369a805 to 2ea50d1 Compare April 23, 2024 11:42
@jcaamano jcaamano requested a review from a team as a code owner May 9, 2024 11:30
@jcaamano jcaamano force-pushed the healthcheck branch 7 times, most recently from 9667a8c to 57ae35b Compare May 10, 2024 09:32
Health check is used both by egress ip and egress services controllers
and could potentially be used for more stuff in the future. It makes
sense for the code to live in its own package.

It's a copy and not a move due to the way egress service mocks those
functions on their tests. This will be cleaned up on a later commit.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
So they can be cancelled.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A queue of items that are popped not earlier than their assoctiated
timeout. Will be used by health check function to schedule probing tasks
against different nodes.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A new centralized controller in charge of health checks. Will listen to
node events to cancel health check of deleted nodes or to become aware of
node configuration changes that impact the health check function.

Will provide an API for parties interested in node health information,
like egress IP or egress service controllers. These parties will use this
API to express and upkeep the interest on the health information of
specific nodes, which will serve as an additional input to the health
check controller reconciliation loop. The API itself is level driven:
interested parties will get notified of health state changes but not of
the details, interested parties are expected to query the controller for
those details on their reconciliation loops.

The health check itself is done on a separate goroutine for each node.
The health check period is hardcoded to 1 second instead of the previous
5 seconds. The health check probe timeout is no longer governed by the
timeout configuration parameter and is also hardcoded to 1 second which
was the previous default. There is no hardcoded number of minimum
consecutive failed attempts to declare a node unreachable: the timeout
configuration parameter now measures the amount of time that needs to
pass since the last time a node was observed available before declaring
a node unreachable situating the range where this happens at (value, value+1).

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
To:

* return error instead of bool, and interpret this error towards the
  health state of a node
* async connect *AND* probe instead of sync connect *OR* probe following
  the recommended GRPC usage patterns
* remove hardcoded extra attempt that adds further delay
* use proper semantics for the parameters: makes no sense to pass IPs
  and a client that might be connected to a different set of IPs as
  parameters to the reachability method. Define a client that will
  take IPs as a property to be used either for legacy probing
  or to build & use a GRPC probe client
* cleanup logging

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
With the previous 1 second default and the health check implementation
that was in place before the health check controller, an unreachable
node would be declared unreachable after 7 to 12 seconds of actually
becoming unreachable.

Change the default to 10 seconds, so that with the new implementation a
node is declared unreachable after 10 to 11 seconds of actually
becoming unreachable in an attempt to keep the default behavior closer
to what we had before.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Expressing a situation where the node is alive with an apparently
functioning OVN dataplane (given that the health check itself is done
through the management port) even though ovn-k is not running.

This new health state allows interpreting this new scenario for which
other controllers might not want to unschedule their services. For
example, ovn-kube rollouts.

This new UNAVAILABLE health state is not supported in hypershift given
the implementation of proxied connections that we have there.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A node might become unavailable temporarily during an ovn-kube rollout
without degrading the services that are already scheduled on it. Don't
de-allocate in that case to prevent an unnecessary service disruption.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A node might become unavailable temporarily during an ovn-kube rollout
without degrading the services that are already scheduled on it. Don't
de-allocate in that case to prevent an unnecessary service disruption.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
@tssurya tssurya requested review from kyrtapz and removed request for girishmg, trozet and dcbw May 14, 2024 10:01
Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't respond yet to this set, I wanna go till commit 6 so that I gain more context and might be able to answer my own questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the commit split + commit description, made it very easy to understand what you are doing.

@@ -38,7 +39,7 @@ func isReachableViaGRPC(mgmtIPs []net.IP, client healthcheck.EgressIPHealthClien
return client.Probe(dialCtx)
}

func isReachableLegacy(node string, mgmtIPs []net.IP, totalTimeout int) bool {
func isReachableLegacy(ctx context.Context, node string, mgmtIPs []net.IP, totalTimeout int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it perhaps time to remove the legacy healthchecks and go with just grpc? I wonder why we didn't remove the older way till now

dialCtx, dialCancel := context.WithTimeout(ctx, timeout)
defer dialCancel()
var d net.Dialer
conn, err := d.DialContext(dialCtx, "tcp", net.JoinHostPort(ip.String(), "9"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-note, older DialTimeout was just using DialContext underneath so logic hasn't changed.

func SleepWithContext(ctx context.Context, duration time.Duration) {
ctx, cancel := context.WithTimeout(ctx, duration)
defer cancel()
<-ctx.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart! two birds one stone. So you want to ensure we do nothing for the given time before retrying and then also ensure if the context gets cancelled cancel the run and return by interrupting the nothing mode? The other way here was to interrupt sleep by catching the Done but this works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on "Will be used by health check function to schedule probing tasks against different nodes." ? Will all the nodes be added to this queue along with the associated time to retry or something? -> maybe the future commits have more info

jitter = 200
)

// nodeState contians the latest declared health state for a node and the latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: contains

popM sync.Mutex
pushM sync.Mutex
items heapImpl[T]
consumers map[chan struct{}]time.Time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also expand on what are consumers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't really get this design either. I thought it was a bunch of channels that were added for each push, then it cycles through them to decide when to "pop" but it doesn't look like thats how it works, and i only see values assigned during pop.

I'm not sure what the value is here in using channels.


const (
// period in seconds of each node probe
period = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need some initial scale testing for this? 1s each node might be too much on 500 nodes? TBH I have no idea if I am over thinking here which is why the question. Not sure if 5s was intentionally picked or not.

}

// Push an item into the queue that can't be popped before its associated
// timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: associated deadline (because you seem to be adding the timeout value to the current time)

}

func (tq *TimeoutQueue[T]) pop(ctx context.Context) item[T] {
// We make extensive use of channels and timers. Suposedly they are cheap,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-note: stopped review here. pick up from here.

@tssurya tssurya requested a review from trozet June 7, 2024 15:09
@trozet trozet self-assigned this Jun 20, 2024
Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it up to "Move egress service to new healthcheck controller"

}

func (tq *TimeoutQueue[T]) pop(ctx context.Context) item[T] {
// We make extensive use of channels and timers. Suposedly they are cheap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Supposedly

tq.push(item[T]{item: it, deadline: time.Now().Add(timeout)})
}

// item is the internal representation of an item with its asssociated deadline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit associated

popM sync.Mutex
pushM sync.Mutex
items heapImpl[T]
consumers map[chan struct{}]time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't really get this design either. I thought it was a bunch of channels that were added for each push, then it cycles through them to decide when to "pop" but it doesn't look like thats how it works, and i only see values assigned during pop.

I'm not sure what the value is here in using channels.

// probeTask represents a probing task for a node including the the context it
// is executed under and the client used for the probe
type probeTask struct {
node string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node here is targetNode right? In other words, nodes that are not this node.

@@ -0,0 +1,426 @@
package healthcheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in this model, you are running health checks from cluster manager to the nodes. I think this is the same model we had before right? Just want to make sure I understand this correctly.

return false
}

updated := c.setNodeState(task.node, new, on, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid using "new" as its also a builtin in golang. Maybe newHealthState or something

// consumer might be informed about a node it is not interested in and
// should be able to ignore it in that case.
HealthStateChanged(node string)
// HealthStateChanged queries the interest of the consumer of being informed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// MonitorHealthState

// Provider of node health state information
type Provider interface {
// Register a consumer to be informed about node health state changes
Register(consumer Consumer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does a consumer deRegister?

type Consumer interface {
// Name the consumer goes by for logging purposes
Name() string
// HealthStateChanged informs consumers of node health state changes. A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not clear from this if the function informs consumers of the node health. It returns nothing. Also is this function triggered when a change happens, or is it left to the consumer to call it and it returns if something changed since it was last queried?

// about health state changes of the node. A node is monitored if any
// consumer is interested. Monitoring for a node might terminate as soon as
// no consumer is interested.
MonitorHealthState(node string) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so how does a consumer actually use this function? It's not clear to me what it is for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/egress-ip Issues related to EgressIP feature kind/enhancement All issues/PRs that are new enhancement requests
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

5 participants