Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7758] Only consider files in Hudi partitions when initializing MDT #11219

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

the-other-tim-brown
Copy link
Contributor

Change Logs

  • Only considers files in Hudi partitions for MDT to avoid any parsing issues

Impact

  • Avoids runtime exceptions trying to parse information from file paths

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label May 14, 2024
@github-actions github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:XS PR with lines of changes in <= 10 labels May 14, 2024
@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin
// Pre-allocate with the maximum length possible
filenameToSizeMap = new HashMap<>(pathInfos.size());

// Presence of partition meta file implies this is a HUDI partition
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix the FSUtils.isDataFile instead? The check for log file uses the regex pattern match, we should fix the base file check to be in line with the log file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you clarify what kind of unexpected parquets would cause issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log and this will lead to a parsing issue.

This is the proper way to fix the issue in my opinion. The intention of this code is to only add files that are in directories with a partition marker file. I'm worried that changing the isDataFile may lead to some unintended side effects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that changing the isDataFile may lead to some unintended side effects

Should be okay if all the CI tests pass. Actually the isDataFile for base file does not make sense because the invoker always needs to consider the directory is a Hudi partition dir, let's fix it.

Another concern is iterate through all the files under one partition is inefficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log and this will lead to a parsing issue.

A complete dataset includes files that both conform to and do not conform to the Hudi filename format. If the metadata table (MDT) only includes files that conform to Hudi's format, then some file data will be missing. It is not clear whether XTable has its own solution for maintaining the MDT. I think such handling should be maintained on the XTable side, not on the Hudi side. On the Hudi side, I think the MDT construction should fail and throw an exception, prompting the user to handle such anomalous files. what about you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the Hudi bootstrap should only consider files that are managed by Hudi. Letting people do things with their Hudi tables is important in my opinion. This can include adding directories under a base path that are not managed by Hudi to store some metadata.

The issue here is that if you had a Hudi table without MDT and then turn it on and you happen to have any parquet files that are not managed by Hudi then you will get an error even if those files are not in a data partition directory.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I have left some comments.

@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin
// Pre-allocate with the maximum length possible
filenameToSizeMap = new HashMap<>(pathInfos.size());

// Presence of partition meta file implies this is a HUDI partition
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for partioned table, if the parent path is already a Hudi partition, is it still necessary to validate the partition metadata files of the subdirectories, can we use short-circuit condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. I will add this

@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin
// Pre-allocate with the maximum length possible
filenameToSizeMap = new HashMap<>(pathInfos.size());

// Presence of partition meta file implies this is a HUDI partition
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log and this will lead to a parsing issue.

A complete dataset includes files that both conform to and do not conform to the Hudi filename format. If the metadata table (MDT) only includes files that conform to Hudi's format, then some file data will be missing. It is not clear whether XTable has its own solution for maintaining the MDT. I think such handling should be maintained on the XTable side, not on the Hudi side. On the Hudi side, I think the MDT construction should fail and throw an exception, prompting the user to handle such anomalous files. what about you?

@danny0405
Copy link
Contributor

@KnightChess

I think such handling should be maintained on the XTable side, not on the Hudi side.

Probably true, but keeping the files scoped in Hudi itself still make sense for Hudi table, I guess it is hard to fix on XTable side because the metadata table is kind of an inner component.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-0.15.0 size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants