Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce high-level telemetry #2012

Open
2 of 6 tasks
Nealm03 opened this issue Apr 2, 2024 · 0 comments
Open
2 of 6 tasks

Produce high-level telemetry #2012

Nealm03 opened this issue Apr 2, 2024 · 0 comments

Comments

@Nealm03
Copy link

Nealm03 commented Apr 2, 2024

Feature description

One of the key features to understanding and debugging production code is identifying sources of issues, from errors to latency. It would be great if the "key" steps of Postgraphile's execution model could be exposed as out of the box metrics that could be switched on and ingested by popular metric stacks, eg: Open Telemetry / Prometheus.

Motivating example

  • We have noticed high latency in some requests - yet the database reports low utilisation and relatively quick response times. We'd like to identify where our bottleneck is.

Ideally there are some "significant events" that happen in the lifecycle of a request which we can measure and understand better. Perhaps:

  • Planning (internal vs plugins): exposing the relative latency custom plugins add to the system in the planning phase, as well as significant planning steps in the pipeline of processing a request. This would help engineers more easily track down any slowness incurred by custom functionality / or just better understanding the planning model and where usage patterns are not ideal.
  • Execution (I/O latency): exposing the async steps which reach out to the database / and perhaps custom resolver steps would be very helpful in identifying where things might be going slow. This would help engineers identify whether there's a misconfiguration with the connection pooling / or general networking overhead.
  • Response (might be considered part of the former): exposing response mapping and validation timing would be useful to correlate large requests. Anecdotally, response validation has often caused performance degradations in my experience and is largely symptomatic of a pathological requests.

Supporting development

  • am interested in building this feature myself
  • am interested in collaborating on building this feature
  • am willing to help testing this feature before it's released
  • am willing to write a test-driven test suite for this feature (before it exists)
  • am a Graphile sponsor ❤️
  • have an active support or consultancy contract with Graphile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🦉 Owl
Development

No branches or pull requests

1 participant