Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(registry): add remove stale partition job #38165

Merged
merged 1 commit into from May 17, 2024

Conversation

bnchrch
Copy link
Contributor

@bnchrch bnchrch commented May 13, 2024

What

Add a job that lets us remove partition keys that no longer exist

Why

We have > 10,000 partitions, one for every metadata file ever. Likely only 500 of those reference files that exist.

Adding this job should let us clean out the noise.

Future

If it works I'll add it to a nightly job

Copy link

vercel bot commented May 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview May 17, 2024 7:23pm

Copy link
Contributor Author

bnchrch commented May 13, 2024

@bnchrch bnchrch marked this pull request as ready for review May 13, 2024 16:48
@bnchrch bnchrch requested a review from a team as a code owner May 13, 2024 16:48
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Just wondering how you're going to test it.
This change will materialize into a job you can manually trigger from Dagster?

Is there a reason we're not bumping the orchestrator package version?

Comment on lines 39 to 41
stale_etags = [etag for etag in all_etag_partitions if etag not in all_fresh_etags]
context.log.info(f"Stale etags found: {stale_etags}")
for stale_etag in stale_etags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can avoid one iteration if you filter and delete in a single iteration on all_etag_partitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great call. Will change


all_fresh_etags = [blob.etag for blob in all_metadata_file_blobs]

all_etag_partitions = context.instance.get_dynamic_partitions(partition_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.intance.get_dynamic_partitions calls the Dagster backend we went to clean up right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! That line asks Dagster "Hey what metadata files do you have partitions for?"

@@ -24,6 +24,35 @@
)


@op(required_resource_keys={"all_metadata_file_blobs"})
def remove_stale_metadata_partitions_op(context):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not perform this clean up at partition insertion time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I wanted to introduce this logic in a separate area right now simply because Im scared of it since its destructive.

If we accidentally delete ALL partitions, its not the end of the world, it just means we may miss a failed metadata file and have to reingest all existing metadata files to know we're ok.

Also Dagster doesn't let you bulk delete partition keys, and Im worried about the time it would take to interate over 10000 keys to delete. May lock up our sensor if deployed inside the add_partition sensor today.

So I wanted to keep it separate for now.

If things are looking good. I want to look at merging it back in.

Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 👍 - step by step :)

@octavia-squidington-iv octavia-squidington-iv requested a review from a team May 13, 2024 17:40
@bnchrch
Copy link
Contributor Author

bnchrch commented May 13, 2024

LGTM. Just wondering how you're going to test it.

Ive added unit tests, and tested it on my local dagster.

The next step is to trigger the job on production 😬

This change will materialize into a job you can manually trigger from Dagster?

exactly!

Is there a reason we're not bumping the orchestrator package version?

Oh that was a miss! Ill update

@bnchrch bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from 2a8de52 to 92c0f72 Compare May 13, 2024 17:52
Copy link
Contributor

@erohmensing erohmensing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After clearing up which ones are getting cleaned up this makes sense 👍🏻 interested in how many we are left with, I feel like it should be about half

Comment on lines +29 to +31
"""
This op is responsible for polling for new metadata files and adding their etag to the dynamic partition.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated!

Comment on lines +32 to +35
all_metadata_file_blobs = context.resources.all_metadata_file_blobs
partition_name = registry_entry.metadata_partitions_def.name

all_fresh_etags = [blob.etag for blob in all_metadata_file_blobs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would set subtraction be more appropriate here?

@bnchrch bnchrch force-pushed the 05-10-update_registry_bump_to_dagster-cloud_1.7.5_and_use_serverless branch from c433512 to ca2fbea Compare May 16, 2024 20:27
@bnchrch bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from 92c0f72 to c99de1f Compare May 16, 2024 20:27
@bnchrch bnchrch force-pushed the 05-10-update_registry_bump_to_dagster-cloud_1.7.5_and_use_serverless branch from ca2fbea to bae39f4 Compare May 17, 2024 18:03
@bnchrch bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from c99de1f to 3ce2aeb Compare May 17, 2024 18:04
Copy link
Contributor Author

bnchrch commented May 17, 2024

Merge activity

  • May 17, 3:20 PM EDT: @bnchrch started a stack merge that includes this pull request via Graphite.
  • May 17, 3:23 PM EDT: Graphite rebased this pull request as part of a merge.
  • May 17, 3:24 PM EDT: @bnchrch merged this pull request with Graphite.

@bnchrch bnchrch changed the base branch from 05-10-update_registry_bump_to_dagster-cloud_1.7.5_and_use_serverless to graphite-base/38165 May 17, 2024 19:21
@bnchrch bnchrch changed the base branch from graphite-base/38165 to master May 17, 2024 19:21
@bnchrch bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from 3ce2aeb to 33c257e Compare May 17, 2024 19:22
@bnchrch bnchrch merged commit caec5f2 into master May 17, 2024
23 of 25 checks passed
@bnchrch bnchrch deleted the 05-13-feat_registry_add_remove_stale_partition_job branch May 17, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants