Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CRD for Iceberg table maintanance #484

Open
6 tasks
sbernauer opened this issue Oct 5, 2023 · 5 comments
Open
6 tasks

Introduce CRD for Iceberg table maintanance #484

sbernauer opened this issue Oct 5, 2023 · 5 comments

Comments

@sbernauer
Copy link
Member

As a Trino Iceberg user I want to define a CR that allows me to regularly run maintenance actions on my tables.

  • Come up with a CRD
  • Figure out how to authenticate against Trino Cluster, e.g. always create a k8s Secret for a service user and add that into the authentication chain using Password file authentication as well as mount it into the k8s CronJob

Should

  • Allow to run at whole schema, which iterates through tables
  • Emit Prometheus metrics so we can alert on failures and have a Dashboard

Could

  • Prometheus alters
  • Grafana dashboard with e.g. files compacted, bytes and rows read/written

One possible solution would be to create a k8s CronJob for every maintenance CR.
CRD could look something like

spec:
  target:
    catalog: lakehouse
    schema: default
    table: my_table # Optional
  schedule:
    interval: 24h # using new Duration struct
    # OR
    cronExpression: XXX
  actions:
    - name: optimize
      fileSizeThreshold: 100MB # optional, otherwise let trino use it's internal default
    - name: expire_snapshots
      retentionThreshold: 7d # optional, otherwise let trino use it's internal default
    - name: remove_orphan_files
      # Document: The value for retention_threshold must be higher than or equal to iceberg.remove_orphan_files.min-retention in the catalog otherwise the procedure fails with a similar message: Retention specified (1.00d) is shorter than the minimum retention configured in the system (7.00d)
      retentionThreshold: 7d # optional, otherwise let trino use it's internal default
@soenkeliebau
Copy link
Member

At the risk of killing this issue with scope-creep, we discussed having TrinoTable crds a while back that the operator would read and actually go and create the tables in Trino based on the information in there.

If that hits, I think that object should contain the information described in this issue as well, not be put into a separate crd?

@soenkeliebau
Copy link
Member

Or well...thinking some more...we actually migth want to apply the same maintenance object to many tables ....

@sbernauer
Copy link
Member Author

Legit point, we should consider this when designing the CRD. I think it should be an ADR in any case

@therealslimjp
Copy link

Is this still in active development & is there already a release date determined?

@sbernauer
Copy link
Member Author

Hi @therealslimjp, sadly we did not start any work on this yet and I'm not aware of any ETA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants