Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parquet input in local stats #783

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

DevinWu
Copy link

@DevinWu DevinWu commented May 13, 2024

Issue description
When I tested local stats with parquet data as raw data, it failed because it was not supported. So I have added this part of the function.

In StatsModelProcessor, it will call AkkaStatsWorker to do stats locally, which will call ShifuFileUtils.getDataScanners to get java Scanner from user input raw data. However it doesn't support the parquet format as input, so it broke the local testing.

How to support load parquet data into the scanner:

  1. Read Parquet Group from parquet file.
  2. Convert the parquet group to String in CSV format, columns separator with Shifu output data delimiter.
  3. Convert the string to ByteArraryInputStream for the input stream.
    The above 3 steps are processed during the stream reading, so will not cache much data in memory.

Tested with stats in ShifuCLITest, it can run successfully.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants