Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] LoRe framework - Support operator specific dump tool for performance issue local reproduce #10843

Open
winningsix opened this issue May 20, 2024 · 0 comments
Assignees
Labels
feature request New feature or request

Comments

@winningsix
Copy link
Collaborator

winningsix commented May 20, 2024

Is your feature request related to a problem? Please describe.
There exist some challenges in reproduce some performance issues locally with a simple approach. As a developer/user, it would be great to have a tool:

  • “At will” to dump data at operator granularity. Either performance issue or compatibility issue (a.k.a. Semantic consistency), the problematic operator may come in relatively late stage. The prior filter may have quite some different behavior in selectivity or a join with multiple tables where the join key is not easy to reproduce. Thus, it is really hard to make a local reproduce of the original issue. “At will” means we want to reproduce the issues at a specific operator without rerun the entire plan fragment or SQL.

  • “Replay” the execution locally with a single Spark application. A “replay” means the exact operator with the dump data can run directly from developer side.

  • Being able to do data desensitization for the dump data if needed. Two running modes are provided: masked mode VS. plain mode. For plain mode, data is not masked and used to generate needed NCU/Nsys directly for that data. For masked mode, data is translated into masked data in an irreversible way.

  • Being able to reproduce both diff and performance issues.

  • Provide a reasonable way to dump the data in a controllable way. For diff issues, it should allow the dump a dedicated number of rows. For performance issues, it allows dumping the execution batch which lasts longer than a preconfigured threshold. Additionally, it should provide a task limit avoiding dumping too much data.

  • Async dump mode for a problematic operator. Don't wait for job's complete to trigger the file/class dump.

Describe the solution you'd like
The workflow of LOcal REplay (Lore) framework usage in performance issue replay is as follows:

  • Dump: Developer decides which operator to dump and decide the threshold of operator time we should take care of. Using the following capture as example, if we set the dump filter with a threshold of 2 second, then it will dump related columnar batch as well as a serialized GpuProject class into binary files. With a configuration of task limit, it could specific how many data file dumps for each single file.

  • Replay: Restoring from the dumped files, the problematic operator together with some specific columnar batch will run locally easily.

This comprises of several sub-tasks:

@winningsix winningsix added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 20, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 21, 2024
@winningsix winningsix changed the title [FEA] Support operator specific dump tool for performance issue local reproduce [FEA] LoRe framework - Support operator specific dump tool for performance issue local reproduce May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants