-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7758] Only consider files in Hudi partitions when initializing MDT #11219
base: master
Are you sure you want to change the base?
Conversation
@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin | |||
// Pre-allocate with the maximum length possible | |||
filenameToSizeMap = new HashMap<>(pathInfos.size()); | |||
|
|||
// Presence of partition meta file implies this is a HUDI partition | |||
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix the FSUtils.isDataFile
instead? The check for log file uses the regex pattern match, we should fix the base file check to be in line with the log file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can you clarify what kind of unexpected parquets would cause issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log
and this will lead to a parsing issue.
This is the proper way to fix the issue in my opinion. The intention of this code is to only add files that are in directories with a partition marker file. I'm worried that changing the isDataFile
may lead to some unintended side effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that changing the isDataFile may lead to some unintended side effects
Should be okay if all the CI tests pass. Actually the isDataFile
for base file does not make sense because the invoker always needs to consider the directory is a Hudi partition dir, let's fix it.
Another concern is iterate through all the files under one partition is inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log and this will lead to a parsing issue.
A complete dataset includes files that both conform to and do not conform to the Hudi filename format. If the metadata table (MDT) only includes files that conform to Hudi's format, then some file data will be missing. It is not clear whether XTable has its own solution for maintaining the MDT. I think such handling should be maintained on the XTable side, not on the Hudi side. On the Hudi side, I think the MDT construction should fail and throw an exception, prompting the user to handle such anomalous files. what about you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the Hudi bootstrap should only consider files that are managed by Hudi. Letting people do things with their Hudi tables is important in my opinion. This can include adding directories under a base path that are not managed by Hudi to store some metadata.
The issue here is that if you had a Hudi table without MDT and then turn it on and you happen to have any parquet files that are not managed by Hudi then you will get an error even if those files are not in a data partition directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, I have left some comments.
@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin | |||
// Pre-allocate with the maximum length possible | |||
filenameToSizeMap = new HashMap<>(pathInfos.size()); | |||
|
|||
// Presence of partition meta file implies this is a HUDI partition | |||
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for partioned table, if the parent path is already a Hudi partition, is it still necessary to validate the partition metadata files of the subdirectories, can we use short-circuit condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense. I will add this
@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, List<StoragePathInfo> pathInfos, Strin | |||
// Pre-allocate with the maximum length possible | |||
filenameToSizeMap = new HashMap<>(pathInfos.size()); | |||
|
|||
// Presence of partition meta file implies this is a HUDI partition | |||
isHoodiePartition = pathInfos.stream().anyMatch(status -> status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you expose your Hudi table as a Delta Lake table with XTable, you will have parquet files in the _delta_log and this will lead to a parsing issue.
A complete dataset includes files that both conform to and do not conform to the Hudi filename format. If the metadata table (MDT) only includes files that conform to Hudi's format, then some file data will be missing. It is not clear whether XTable has its own solution for maintaining the MDT. I think such handling should be maintained on the XTable side, not on the Hudi side. On the Hudi side, I think the MDT construction should fail and throw an exception, prompting the user to handle such anomalous files. what about you?
Probably true, but keeping the files scoped in Hudi itself still make sense for Hudi table, I guess it is hard to fix on XTable side because the metadata table is kind of an inner component. |
Change Logs
Impact
Risk level (write none, low medium or high below)
None
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist