-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor egress IP healthcheck #4286
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for subtle-torrone-bb0c84 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
7ab198d
to
b202096
Compare
369a805
to
2ea50d1
Compare
9667a8c
to
57ae35b
Compare
Health check is used both by egress ip and egress services controllers and could potentially be used for more stuff in the future. It makes sense for the code to live in its own package. It's a copy and not a move due to the way egress service mocks those functions on their tests. This will be cleaned up on a later commit. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
So they can be cancelled. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A queue of items that are popped not earlier than their assoctiated timeout. Will be used by health check function to schedule probing tasks against different nodes. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A new centralized controller in charge of health checks. Will listen to node events to cancel health check of deleted nodes or to become aware of node configuration changes that impact the health check function. Will provide an API for parties interested in node health information, like egress IP or egress service controllers. These parties will use this API to express and upkeep the interest on the health information of specific nodes, which will serve as an additional input to the health check controller reconciliation loop. The API itself is level driven: interested parties will get notified of health state changes but not of the details, interested parties are expected to query the controller for those details on their reconciliation loops. The health check itself is done on a separate goroutine for each node. The health check period is hardcoded to 1 second instead of the previous 5 seconds. The health check probe timeout is no longer governed by the timeout configuration parameter and is also hardcoded to 1 second which was the previous default. There is no hardcoded number of minimum consecutive failed attempts to declare a node unreachable: the timeout configuration parameter now measures the amount of time that needs to pass since the last time a node was observed available before declaring a node unreachable situating the range where this happens at (value, value+1). Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
To: * return error instead of bool, and interpret this error towards the health state of a node * async connect *AND* probe instead of sync connect *OR* probe following the recommended GRPC usage patterns * remove hardcoded extra attempt that adds further delay * use proper semantics for the parameters: makes no sense to pass IPs and a client that might be connected to a different set of IPs as parameters to the reachability method. Define a client that will take IPs as a property to be used either for legacy probing or to build & use a GRPC probe client * cleanup logging Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
With the previous 1 second default and the health check implementation that was in place before the health check controller, an unreachable node would be declared unreachable after 7 to 12 seconds of actually becoming unreachable. Change the default to 10 seconds, so that with the new implementation a node is declared unreachable after 10 to 11 seconds of actually becoming unreachable in an attempt to keep the default behavior closer to what we had before. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Expressing a situation where the node is alive with an apparently functioning OVN dataplane (given that the health check itself is done through the management port) even though ovn-k is not running. This new health state allows interpreting this new scenario for which other controllers might not want to unschedule their services. For example, ovn-kube rollouts. This new UNAVAILABLE health state is not supported in hypershift given the implementation of proxied connections that we have there. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A node might become unavailable temporarily during an ovn-kube rollout without degrading the services that are already scheduled on it. Don't de-allocate in that case to prevent an unnecessary service disruption. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
A node might become unavailable temporarily during an ovn-kube rollout without degrading the services that are already scheduled on it. Don't de-allocate in that case to prevent an unnecessary service disruption. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't respond yet to this set, I wanna go till commit 6 so that I gain more context and might be able to answer my own questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the commit split + commit description, made it very easy to understand what you are doing.
@@ -38,7 +39,7 @@ func isReachableViaGRPC(mgmtIPs []net.IP, client healthcheck.EgressIPHealthClien | |||
return client.Probe(dialCtx) | |||
} | |||
|
|||
func isReachableLegacy(node string, mgmtIPs []net.IP, totalTimeout int) bool { | |||
func isReachableLegacy(ctx context.Context, node string, mgmtIPs []net.IP, totalTimeout int) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it perhaps time to remove the legacy healthchecks and go with just grpc? I wonder why we didn't remove the older way till now
dialCtx, dialCancel := context.WithTimeout(ctx, timeout) | ||
defer dialCancel() | ||
var d net.Dialer | ||
conn, err := d.DialContext(dialCtx, "tcp", net.JoinHostPort(ip.String(), "9")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self-note, older DialTimeout was just using DialContext underneath so logic hasn't changed.
func SleepWithContext(ctx context.Context, duration time.Duration) { | ||
ctx, cancel := context.WithTimeout(ctx, duration) | ||
defer cancel() | ||
<-ctx.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smart! two birds one stone. So you want to ensure we do nothing for the given time before retrying and then also ensure if the context gets cancelled cancel the run and return by interrupting the nothing mode? The other way here was to interrupt sleep by catching the Done but this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand on "Will be used by health check function to schedule probing tasks against different nodes." ? Will all the nodes be added to this queue along with the associated time to retry or something? -> maybe the future commits have more info
jitter = 200 | ||
) | ||
|
||
// nodeState contians the latest declared health state for a node and the latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: contains
popM sync.Mutex | ||
pushM sync.Mutex | ||
items heapImpl[T] | ||
consumers map[chan struct{}]time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also expand on what are consumers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't really get this design either. I thought it was a bunch of channels that were added for each push, then it cycles through them to decide when to "pop" but it doesn't look like thats how it works, and i only see values assigned during pop.
I'm not sure what the value is here in using channels.
|
||
const ( | ||
// period in seconds of each node probe | ||
period = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need some initial scale testing for this? 1s each node might be too much on 500 nodes? TBH I have no idea if I am over thinking here which is why the question. Not sure if 5s was intentionally picked or not.
} | ||
|
||
// Push an item into the queue that can't be popped before its associated | ||
// timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: associated deadline (because you seem to be adding the timeout value to the current time)
} | ||
|
||
func (tq *TimeoutQueue[T]) pop(ctx context.Context) item[T] { | ||
// We make extensive use of channels and timers. Suposedly they are cheap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self-note: stopped review here. pick up from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it up to "Move egress service to new healthcheck controller"
} | ||
|
||
func (tq *TimeoutQueue[T]) pop(ctx context.Context) item[T] { | ||
// We make extensive use of channels and timers. Suposedly they are cheap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Supposedly
tq.push(item[T]{item: it, deadline: time.Now().Add(timeout)}) | ||
} | ||
|
||
// item is the internal representation of an item with its asssociated deadline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit associated
popM sync.Mutex | ||
pushM sync.Mutex | ||
items heapImpl[T] | ||
consumers map[chan struct{}]time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't really get this design either. I thought it was a bunch of channels that were added for each push, then it cycles through them to decide when to "pop" but it doesn't look like thats how it works, and i only see values assigned during pop.
I'm not sure what the value is here in using channels.
// probeTask represents a probing task for a node including the the context it | ||
// is executed under and the client used for the probe | ||
type probeTask struct { | ||
node string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node here is targetNode right? In other words, nodes that are not this node.
@@ -0,0 +1,426 @@ | |||
package healthcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in this model, you are running health checks from cluster manager to the nodes. I think this is the same model we had before right? Just want to make sure I understand this correctly.
return false | ||
} | ||
|
||
updated := c.setNodeState(task.node, new, on, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid using "new" as its also a builtin in golang. Maybe newHealthState or something
// consumer might be informed about a node it is not interested in and | ||
// should be able to ignore it in that case. | ||
HealthStateChanged(node string) | ||
// HealthStateChanged queries the interest of the consumer of being informed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// MonitorHealthState
// Provider of node health state information | ||
type Provider interface { | ||
// Register a consumer to be informed about node health state changes | ||
Register(consumer Consumer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does a consumer deRegister?
type Consumer interface { | ||
// Name the consumer goes by for logging purposes | ||
Name() string | ||
// HealthStateChanged informs consumers of node health state changes. A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not clear from this if the function informs consumers of the node health. It returns nothing. Also is this function triggered when a change happens, or is it left to the consumer to call it and it returns if something changed since it was last queried?
// about health state changes of the node. A node is monitored if any | ||
// consumer is interested. Monitoring for a node might terminate as soon as | ||
// no consumer is interested. | ||
MonitorHealthState(node string) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so how does a consumer actually use this function? It's not clear to me what it is for.
This PR refactors the egress IP health check with the following objectives:
attempts that was making the quickest possible reaction time to range
from 7s to 12s without providing finer control
on a node's health check to delay other node's health check
does not cause unnecessary service disruption
Adds a new centralized controller in charge of health checks. Listens to
node events to cancel health check of deleted nodes or to become aware of
node configuration changes that impact the health check function.
Provides an API for parties interested in node health information,
like egress IP or egress service controllers. These parties will use this
API to express and upkeep the interest on the health information of
specific nodes, which will serve as an additional input to the health
check controller reconciliation loop. The API itself is level driven:
interested parties will get notified of health state changes but not of
the details, interested parties are expected to query the controller for
those details on their reconciliation loops.
The health check itself is done on a separate goroutine for each node.
The health check period is hardcoded to 1 second instead of the previous
5 seconds. The health check probe timeout is no longer governed by the
timeout configuration parameter and is also hardcoded to 1 second which
was the previous default. There is no hardcoded number of minimum
consecutive failed attempts to declare a node unreachable: the timeout
configuration parameter now measures the amount of time that needs to
pass since the last time a node was observed available before declaring
a node unreachable situating the range where this happens at (value, value+1).
Changes the default timeout configuration parameter value to 10s. With the
new implementation, an unreachable node is actually declared unreachable
after 10s-11s of being last observed available, which is similar to the previous
behavior where that range had minimum and default of 7s-12s. Of course, now
users have the option of changing this configuration down to 1, making this
range 1s-2s.
This is up to commit
Change EgressIPReachabiltyTotalTimeout to 10s
After that, a compromise exercise is done to attempt to avoid disruption on ovnk
rollouts. A new UNAVAILABLE (vs UNREACHABLE) health state is introduced that
expresses a situation where the node is alive with an apparently functioning OVN dataplane
(given that the health check itself is done through the management port) even though
ovn-k is not running (similar interpretation we had with the legacy health check), where
other controllers might not want to unschedule their services to avoid unnecessary
disruptions. As mentioned, this is a compromise were we try to improve expected
scenarios potentially worsening the effect of other unexpected scenarios.
This new UNAVAILABLE health state is not supported in hypershift given
the implementation of proxied connections that we have there.
UP FOR DISCUSSION.
TODO: e2e coverage for this new UNAVAILABLE health state