Namerd: Improve /dtab/delegator.json for discovery health checking #2013

mohsenrezaeithe · 2018-06-22T03:55:28Z

Issue Type:

Bug report
Feature request

What happened:
We've been building tools, checks, and monitoring for alerting on discovery problems, given a test prefix and service to resolve.

The /dtab/delegator.json is great, but doesn't concretely at the return code level identify issues with discovery, as does the admin UI (red vs. green):

vs.

...and the response body needs to be parsed to tell red from green, which is particularly challenging when using Namerd in a horizontally scalable and containerized environment, i.e. Kubernetes.

What you expected to happen:
In the process of designing the "right" monitoring for the service, I've been treating /dtab/delegator.json as a health endpoint, and if that's the right way of health checking the discovery aspect of Namerd, does it make sense to return appropriate return codes for successful and unsuccessful discoveries? Right now for hits and misses the service returns 200 OK, but I expect the misses to return a non-200 response.

How to reproduce it (as minimally and precisely as possible):
/<prefix>/<bogus_service_name> --> red
/<prefix>/<discovered_service_name> --> green

Anything else we need to know?:
I'm no particularly married to any specific return codes, but I was thinking this could be returning a 5xx code to identify server side issue for discovery.

Another option could be to standardize service mesh discovery (error) codes, given the fast adaptation of service mesh by the SOA users, but I don't have any info on how a new return code space can be introduced and standardized.

Environment:

linkerd/namerd version, config files: Namerd 1.3.6 with an example dtab as such for outgoing REST:

/srv/default=>/#/io.l5d.k8s.clu1DsHttp/default/http | /#/io.l5d.k8s.clu2DsHttp/default/http;
/svc=>/srv;

Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes 1.10.x
Cloud provider or hardware configuration: GCP

The text was updated successfully, but these errors were encountered:

adleong · 2018-06-26T21:36:56Z

@mrezaei00 This is an interesting idea. The delegator.json endpoint is intended to return the delegation tree for a particular path and dtab. My interpretation of the HTTP status code 200 to mean that the delegation tree was successfully able to be returned (whereas, for example, 404 means the delegation tree was not found and 500 means there was an error producing the delegation tree).

So I'm not sure if it makes sense to have an HTTP status code that describes the content of the delegation tree (ie whether the tree is neg or bound or fail).

Here is the logic that we use to color the boxes in the delegator UI: https://github.com/linkerd/linkerd/blob/master/admin/src/main/resources/io/buoyant/admin/js/src/delegator.js

It seems reasonable to move that logic onto the server side and have the status of each node returned as part of the json structure. Another possibility would be to encode the status (bound/neg/fail) of the tree in a response header.

What do you think?

mohsenrezaeithe · 2018-06-27T16:37:55Z

@adleong I think moving the logic to the server is a great idea, but I still think a health check can use some boost from the return code, and my main reason is to avoid implementing a "health check wrapper" just to get the service health checked. I feel like that pattern may quickly diverge as the service is used more and more. If the health of the service is decided on the server, then everyone uses the same core health check.

Another reason for including a corresponding response code would be the limitations that come with the monitoring systems out there. We use K8S in this case to do health checking through livenessProbe, and without a custom health script, we can't achieve proper liveness.

I also agree that 5xx is not the right choice, and 404(4xx) or new codes may be better choice(s).

adleong added this to Short-Term in Linkerd 1.x Backlog. See https://github.com/linkerd/linkerd2/blob/main/ROADMAP.md Jun 26, 2018

adleong added this to To do in Linkerd Kanban Jun 26, 2018

adleong removed this from To do in Linkerd Kanban Jul 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Namerd: Improve /dtab/delegator.json for discovery health checking #2013

Namerd: Improve /dtab/delegator.json for discovery health checking #2013

mohsenrezaeithe commented Jun 22, 2018 •

edited

adleong commented Jun 26, 2018

mohsenrezaeithe commented Jun 27, 2018

Namerd: Improve /dtab/delegator.json for discovery health checking #2013

Namerd: Improve /dtab/delegator.json for discovery health checking #2013

Comments

mohsenrezaeithe commented Jun 22, 2018 • edited

adleong commented Jun 26, 2018

mohsenrezaeithe commented Jun 27, 2018

mohsenrezaeithe commented Jun 22, 2018 •

edited