Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter #11204

Open
ziudu opened this issue May 13, 2024 · 3 comments
Labels
hive Issues related to hive performance

Comments

@ziudu
Copy link

ziudu commented May 13, 2024

According to [parisni in [HUDI-6150] Support bucketing for each hive client (https://github.com//pull/8657)

"So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:

the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
this current PR
the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index"

Do you have any update on this?

@danny0405 danny0405 added hive Issues related to hive performance labels May 14, 2024
@ziudu
Copy link
Author

ziudu commented May 14, 2024

Hi Danny0405,

I think the support for 2 hudi tables' Spark sort-merge-join with bucket optimization is an important feature.

Currently if we join 2 hudi tables, the bucket index's bucket information is not usable by spark, so shuffle is always needs. As explained in 8657 - hashing- file naming- file numbering- file sorting are different.

Unfortunately, according to https://issues.apache.org/jira/browse/SPARK-19256, spark bucket is not compatible with hive bucket yet. So if we have to choose one between spark and hive, I think spark might be of higher priority.

@danny0405
Copy link
Contributor

So if we have to choose one between spark and hive, I think spark might be of higher priority

I agree, do you have energy to complete that suspended PR.

@ziudu
Copy link
Author

ziudu commented May 14, 2024

I'm a newbie. It took me a while to understand why bucket join does not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Issues related to hive performance
Projects
Status: Awaiting Triage
Development

No branches or pull requests

2 participants