You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
There exist some challenges in reproduce some performance issues locally with a simple approach. As a developer/user, it would be great to have a tool:
“At will” to dump data at operator granularity. Either performance issue or compatibility issue (a.k.a. Semantic consistency), the problematic operator may come in relatively late stage. The prior filter may have quite some different behavior in selectivity or a join with multiple tables where the join key is not easy to reproduce. Thus, it is really hard to make a local reproduce of the original issue. “At will” means we want to reproduce the issues at a specific operator without rerun the entire plan fragment or SQL.
“Replay” the execution locally with a single Spark application. A “replay” means the exact operator with the dump data can run directly from developer side.
Being able to do data desensitization for the dump data if needed. Two running modes are provided: masked mode VS. plain mode. For plain mode, data is not masked and used to generate needed NCU/Nsys directly for that data. For masked mode, data is translated into masked data in an irreversible way.
Being able to reproduce both diff and performance issues.
Provide a reasonable way to dump the data in a controllable way. For diff issues, it should allow the dump a dedicated number of rows. For performance issues, it allows dumping the execution batch which lasts longer than a preconfigured threshold. Additionally, it should provide a task limit avoiding dumping too much data.
Async dump mode for a problematic operator. Don't wait for job's complete to trigger the file/class dump.
Describe the solution you'd like
The workflow of LOcal REplay (Lore) framework usage in performance issue replay is as follows:
Dump: Developer decides which operator to dump and decide the threshold of operator time we should take care of. Using the following capture as example, if we set the dump filter with a threshold of 2 second, then it will dump related columnar batch as well as a serialized GpuProject class into binary files. With a configuration of task limit, it could specific how many data file dumps for each single file.
Replay: Restoring from the dumped files, the problematic operator together with some specific columnar batch will run locally easily.
winningsix
changed the title
[FEA] Support operator specific dump tool for performance issue local reproduce
[FEA] LoRe framework - Support operator specific dump tool for performance issue local reproduce
May 22, 2024
Is your feature request related to a problem? Please describe.
There exist some challenges in reproduce some performance issues locally with a simple approach. As a developer/user, it would be great to have a tool:
“At will” to dump data at operator granularity. Either performance issue or compatibility issue (a.k.a. Semantic consistency), the problematic operator may come in relatively late stage. The prior filter may have quite some different behavior in selectivity or a join with multiple tables where the join key is not easy to reproduce. Thus, it is really hard to make a local reproduce of the original issue. “At will” means we want to reproduce the issues at a specific operator without rerun the entire plan fragment or SQL.
“Replay” the execution locally with a single Spark application. A “replay” means the exact operator with the dump data can run directly from developer side.
Being able to do data desensitization for the dump data if needed. Two running modes are provided: masked mode VS. plain mode. For plain mode, data is not masked and used to generate needed NCU/Nsys directly for that data. For masked mode, data is translated into masked data in an irreversible way.
Being able to reproduce both diff and performance issues.
Provide a reasonable way to dump the data in a controllable way. For diff issues, it should allow the dump a dedicated number of rows. For performance issues, it allows dumping the execution batch which lasts longer than a preconfigured threshold. Additionally, it should provide a task limit avoiding dumping too much data.
Async dump mode for a problematic operator. Don't wait for job's complete to trigger the file/class dump.
Describe the solution you'd like
The workflow of LOcal REplay (Lore) framework usage in performance issue replay is as follows:
Dump
: Developer decides which operator to dump and decide the threshold of operator time we should take care of. Using the following capture as example, if we set the dump filter with a threshold of 2 second, then it will dump related columnar batch as well as a serialized GpuProject class into binary files. With a configuration of task limit, it could specific how many data file dumps for each single file.Replay
: Restoring from the dumped files, the problematic operator together with some specific columnar batch will run locally easily.This comprises of several sub-tasks:
The text was updated successfully, but these errors were encountered: