Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namerd: Improve /dtab/delegator.json for discovery health checking #2013

Open
1 of 2 tasks
mohsenrezaeithe opened this issue Jun 22, 2018 · 2 comments
Open
1 of 2 tasks

Comments

@mohsenrezaeithe
Copy link
Contributor

mohsenrezaeithe commented Jun 22, 2018

Issue Type:

  • Bug report
  • Feature request

What happened:
We've been building tools, checks, and monitoring for alerting on discovery problems, given a test prefix and service to resolve.

The /dtab/delegator.json is great, but doesn't concretely at the return code level identify issues with discovery, as does the admin UI (red vs. green):

vs.

...and the response body needs to be parsed to tell red from green, which is particularly challenging when using Namerd in a horizontally scalable and containerized environment, i.e. Kubernetes.

What you expected to happen:
In the process of designing the "right" monitoring for the service, I've been treating /dtab/delegator.json as a health endpoint, and if that's the right way of health checking the discovery aspect of Namerd, does it make sense to return appropriate return codes for successful and unsuccessful discoveries? Right now for hits and misses the service returns 200 OK, but I expect the misses to return a non-200 response.

How to reproduce it (as minimally and precisely as possible):
/<prefix>/<bogus_service_name> --> red
/<prefix>/<discovered_service_name> --> green

Anything else we need to know?:
I'm no particularly married to any specific return codes, but I was thinking this could be returning a 5xx code to identify server side issue for discovery.

Another option could be to standardize service mesh discovery (error) codes, given the fast adaptation of service mesh by the SOA users, but I don't have any info on how a new return code space can be introduced and standardized.

Environment:

  • linkerd/namerd version, config files: Namerd 1.3.6 with an example dtab as such for outgoing REST:
    /srv/default=>/#/io.l5d.k8s.clu1DsHttp/default/http | /#/io.l5d.k8s.clu2DsHttp/default/http;
    /svc=>/srv;
    
  • Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes 1.10.x
  • Cloud provider or hardware configuration: GCP
@adleong
Copy link
Member

adleong commented Jun 26, 2018

@mrezaei00 This is an interesting idea. The delegator.json endpoint is intended to return the delegation tree for a particular path and dtab. My interpretation of the HTTP status code 200 to mean that the delegation tree was successfully able to be returned (whereas, for example, 404 means the delegation tree was not found and 500 means there was an error producing the delegation tree).

So I'm not sure if it makes sense to have an HTTP status code that describes the content of the delegation tree (ie whether the tree is neg or bound or fail).

Here is the logic that we use to color the boxes in the delegator UI: https://github.com/linkerd/linkerd/blob/master/admin/src/main/resources/io/buoyant/admin/js/src/delegator.js

It seems reasonable to move that logic onto the server side and have the status of each node returned as part of the json structure. Another possibility would be to encode the status (bound/neg/fail) of the tree in a response header.

What do you think?

@mohsenrezaeithe
Copy link
Contributor Author

@adleong I think moving the logic to the server is a great idea, but I still think a health check can use some boost from the return code, and my main reason is to avoid implementing a "health check wrapper" just to get the service health checked. I feel like that pattern may quickly diverge as the service is used more and more. If the health of the service is decided on the server, then everyone uses the same core health check.

Another reason for including a corresponding response code would be the limitations that come with the monitoring systems out there. We use K8S in this case to do health checking through livenessProbe, and without a custom health script, we can't achieve proper liveness.

I also agree that 5xx is not the right choice, and 404(4xx) or new codes may be better choice(s).

@adleong adleong removed this from To do in Linkerd Kanban Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

2 participants