Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement predicate_query to validate which are the metrics to be collected #4503

Merged
merged 9 commits into from
May 29, 2024

Conversation

gabrosys
Copy link
Contributor

@gabrosys gabrosys commented May 9, 2024

What

Implement a validation logic to verify if a query to collect metrics should be executed or not.

How

  • Define a property predicate_query.
  • Define a method isCollectable that is invoked by collect passing the same tx (to reduce the performance consumption) that returns a bool.
  • Add a new e2e test, with related files, to check the scenarios of predicate_query.

Test

  • Execute the e2e-test-local with FEATURE_TYPE= observability and TEST_DEPTH=3.

Open points - Questions

  • Do you think there is a need to add any specific log when a query is not executed due to the false response of the predicate_query? I want to avoid bloating PG with this kind of log so this is why currently it is not present.
  • isCollectable is intended to be invoked only within the collect method, maybe I could define it as anonymous method inside the collect?
  • We can simplify a lot the logic in the isCollectable method if we assume strong and tight rules regarding the output of the predicate_query, what do you think about that?
  • Do I have to update the doc within this PR or, when approved, I will create another PR to document what is and it is working the predicate_query?

Thanks at disposal for any question 🙇 🚀

@gabrosys gabrosys requested a review from a team as a code owner May 9, 2024 13:57
@github-actions github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 release-1.23 labels May 9, 2024
Copy link
Contributor

github-actions bot commented May 9, 2024

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@@ -378,6 +387,66 @@ func (c QueryCollector) collect(conn *sql.DB, ch chan<- prometheus.Metric) error
return nil
}

// isCollectable checks if a query to collect metrics can be executed or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we are accepting multiple types?
In my opinion, we should only evaluate booleans, and it should be the responsibility of the query to return the intended type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Armando. It's better to rely on booleans to do that.
True means it's collectible, false means it's not, and NULL means we don't know.
And if we don't know, perhaps it's better not to collect it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to summarize we want to support ONLY bools so that even numbers are not tolerated. Also, do we want to force the developers to deliver a query that returns ONLY one column? If so we can simplify the whole logic with a single QueryRow and a Scan that accepts a bool variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think evaluating a bool should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment and the method accordingly. Now assume that the type is a bool if not we return false and log a warning.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I disagree and would prefer to see it accept both a 0-row result and a null as a false result. I can see the argument both ways though, so what you say goes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated code handles:

  • bool checking the first row first column.
  • no rows, returning false.
  • null value, returning false.

@leonardoce
Copy link
Contributor

Thank you, @gabrosys, for this! Welcome to the CNPG community!

@leonardoce leonardoce changed the title feat(4499): implement predicate_query to validate which are the metrics to be collected feat: implement predicate_query to validate which are the metrics to be collected May 9, 2024
@ringerc
Copy link

ringerc commented May 9, 2024

Do you think there is a need to add any specific log when a query is not executed due to the false response of the predicate_query? I want to avoid bloating PG with this kind of log so this is why currently it is not present.

No, it definitely should not log, or not at above debug level. That's part of the point, so unwanted log spam can be avoided when the target server doesn't support a specific view, column, function etc.

In general there is IMO a need to improve how CNP's metrics scraper diagnostics work and make it easier to understand its behaviour + detect errors. That's why I wrote a little tool to run a one-shot scrape against a pre-configured postgres recently. I'll see if I can tidy that up for a community contribution. But logging isn't IMO the answer to that. I already have enough problems with it quietly logging a complaint and happily continuing with some confusing behaviour.

It might make sense to emit metrics as scrape output to report on skips, but they should probably be optional, and I don't personally think they're necessary as part of the basic feature. They could easily bloat scrape results and add unwanted scrape cardinality, requiring more filtering by scrape result consumers to discard the expected ones.

We can simplify a lot the logic in the isCollectable method if we assume strong and tight rules regarding the output of the predicate_query, what do you think about that?

The problem with that is that the user has to reliably assume the query has the expected results under all circumstances. Part of the point of this is to make the CNP scraper robust in the face of differing server versions, extension installs, vendor flavours, etc etc.

OTOH if the query does something unexpected, a clear error log might be better than just silently returning false and not running the query. IDK. Right now it's hard to properly test CNP scraper configs across a wide matrix of possible server configs, so I prefer to be error-tolerant, but I can see the argument for strictness too.

I'd personally prefer it to accept a 0-row result as false, but it's not hard to wrap that in a SELECT EXISTS (...) if you want to require a strict boolean result.

@mnencia
Copy link
Member

mnencia commented May 17, 2024

/test tl=4 l=local

Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9131921366

@ringerc
Copy link

ringerc commented May 20, 2024

@mnencia What can we do to progress this PR?

@ringerc
Copy link

ringerc commented May 21, 2024

@gabrosys I see a test failure

[FAIL] Metrics [It] can gather metrics according with predicate query [observability]
/home/runner/work/cloudnative-pg/cloudnative-pg/tests/e2e/asserts_test.go:2926

https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9131921366/job/25112487661#step:9:5716

It failed in some of the other runs too, but passed in others.

Details https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9131921366/job/25112490237#step:9:2344

   Timeline >>
  Putting Tail on the operator log
  STEP: verifying the custom metrics ConfigMaps and Secrets exist @ 05/17/24 18:30:22.086
  STEP: setting up curl client pod @ 05/17/24 18:30:22.245
  STEP: having a predicate-query-metrics-e2e-9171 namespace @ 05/17/24 18:30:27.285
  STEP: creating a Cluster in the predicate-query-metrics-e2e-9171 namespace @ 05/17/24 18:30:27.291
  STEP: having a Cluster postgresql-metrics with each instance in status ready @ 05/17/24 18:30:27.335
  Cluster ready, took 1m12.298617472s
  STEP: ensuring metrics with positive predicate are collected @ 05/17/24 18:31:39.665
  STEP: checking metrics for pod: postgresql-metrics-1 @ 05/17/24 18:31:39.672
  [FAILED] in [It] - /home/runner/work/cloudnative-pg/cloudnative-pg/tests/e2e/asserts_test.go:2926 @ 05/17/24 18:31:40.062
  DUMPING tailed Operator Logs with error/warning (at most 10 lines ). Failed Spec: Metrics can gather metrics according with predicate query

  ================================================================================
  -- no error / warning logs --
  ================================================================================
  << Timeline

  [FAILED] Found no match for metric cnpg_pg_predicate_query_return_true_and_multiple_rows_fixed

  Priting rawMetricsOutput:
...
  cnpg_errors_total{errorUserQueries="pg_predicate_query_return_true_and_multiple_rows on db app: ERROR: subquery in FROM must have an alias (SQLSTATE 42601)"} 1
...
  # HELP cnpg_pg_predicate_query_return_true_and_multiple_columns_fixed Always 42, used to test predicate_query
  # TYPE cnpg_pg_predicate_query_return_true_and_multiple_columns_fixed gauge
  cnpg_pg_predicate_query_return_true_and_multiple_columns_fixed 42
  # HELP cnpg_pg_predicate_query_return_true_and_multiple_columns_multiple_rows_fixed Always 42, used to test predicate_query
  # TYPE cnpg_pg_predicate_query_return_true_and_multiple_columns_multiple_rows_fixed gauge
  cnpg_pg_predicate_query_return_true_and_multiple_columns_multiple_rows_fixed 42
  # HELP cnpg_pg_predicate_query_return_true_fixed Always 42, used to test predicate_query
  # TYPE cnpg_pg_predicate_query_return_true_fixed gauge
  cnpg_pg_predicate_query_return_true_fixed 42

so it looks like an issue with the spec for pg_predicate_query_return_true_and_multiple_rows

Copy link

@ringerc ringerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified query bug causing test failure

Copy link
Collaborator

@jsilvela jsilvela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't isCollectable be substantially simplified?

pkg/management/postgres/metrics/collector.go Outdated Show resolved Hide resolved
pkg/management/postgres/metrics/collector.go Outdated Show resolved Hide resolved
@gabrosys
Copy link
Contributor Author

Update the PR as requested. Now we accept only predicate_query with at most one row and with a single bool column.
In detail, we handle:

  • true and false
  • no row as false
  • null as false

@jsilvela ready to be reviewed, thanks 🙇

Copy link
Collaborator

@jsilvela jsilvela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, Gabriele

@armru
Copy link
Member

armru commented May 27, 2024

/test limit=local

Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9255900078

@armru
Copy link
Member

armru commented May 28, 2024

/test limit=local

Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9271964382

@github-actions github-actions bot added the ok to merge 👌 This PR can be merged label May 28, 2024
@armru armru removed backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 release-1.23 labels May 29, 2024
@armru
Copy link
Member

armru commented May 29, 2024

Hello @gabrosys could you please integrate the User defined metrics doc section with the new field and explain how to use it?

@armru armru self-requested a review May 29, 2024 09:07
@gabrosys
Copy link
Contributor Author

Update the doc with a new paragraph with a simple example and add the new prop to the prop list. Thank you all for your support 🙇

@armru armru merged commit c7d241c into cloudnative-pg:main May 29, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok to merge 👌 This PR can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants