Handle Empty/Small Data DataFrames as a separate case #4605

naren-ponder · 2022-06-27T15:51:12Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?

In our current approach, we default empty dataframes to pandas at the query compiler level which leads to some overhead as well as some bugs in empty dataframes (#4306, #4307). It would be ideal to default not only empty dataframes to pandas, but also dataframes with a small amount of data where distributing leads to more cost than it is worth.

mvashishtha · 2022-06-28T19:10:22Z

For reference, #4191 and #4060 are also bugs coming from improper treatment of empty dataframes.

Signed-off-by: Naren Krishna <[email protected]>

billiam-wang · 2022-09-14T19:47:18Z

@modin-project/modin-core @modin-project/modin-contributors @RehanSD @vnlitvinov @anmyachev Currently, indexes are processed asynchronously making it difficult to determine when a data frame will be empty or not without waiting on the index to complete. Wondering if anybody had any suggestions on how to approach this problem.

Some ideas we have include changes at the query compiler level, API level, or modin core level whenever columns or rows are potentially added/removed.

vnlitvinov · 2022-09-15T15:09:00Z

In most cases, axes are known, and I'm pretty sure most operations can be analyzed to see what effects such operations have on the axes, so in a typical case both axes would be known. We can simply make an assumption that we either know the axes (and as such can use their sizes to see which compiler to apply) or the dataframe is big.

There are only a few operations which are unpredictable on outcoming axes - filtering by some user-defined condition (like df[df.a == b]), running groupby operations, etc. All other operations could be analyzed in advance.

naren-ponder self-assigned this Jun 27, 2022

naren-ponder mentioned this issue Jun 27, 2022

FEAT-#4605: Handle small and empty dataframes #4606

Closed

8 tasks

mvashishtha mentioned this issue Jun 28, 2022

using reset_index on empty DataFrame coverts column datatypes to object #4615

Closed

mvashishtha mentioned this issue Jun 30, 2022

Different behavior between modin and pandas for isin operation #4618

Closed

naren-ponder mentioned this issue Jul 1, 2022

Converting already categorical series using pd.Categorical results in AttributeError. #4623

Closed

naren-ponder added a commit to naren-ponder/modin that referenced this issue Jul 26, 2022

FEAT-modin-project#4605: Basic approach layout

0512180

Signed-off-by: Naren Krishna <[email protected]>

pyrito added pandas concordance 🐼 Functionality that does not match pandas P1 Important tasks that we should complete soon labels Aug 31, 2022

vnlitvinov mentioned this issue Sep 6, 2022

On empty dataframes, methods (that go through __getattribute__) default to pandas #4418

Open

mvashishtha mentioned this issue Sep 6, 2022

BUG: try_cast_to_pandas doesn't preserve dtypes for empty frame #4934

Closed

vnlitvinov mentioned this issue Sep 9, 2022

FEAT-#3043: allow pandas.DataFrame to be passed into merge and join functions #4324

Closed

8 tasks

mvashishtha unassigned naren-ponder Oct 12, 2022

mvashishtha added the Epic label Oct 12, 2022

mvashishtha added the empty dataframes and series 🚫 Bugs having to do with empty dataframes and series label Oct 27, 2022

dchigarev linked a pull request Nov 14, 2022 that will close this issue

FEAT-#4605: Implementation of Small Query Compiler to support small and empty DataFrames #5113

Open

7 tasks

arunjose696 linked a pull request May 13, 2024 that will close this issue

FEAT-#4605: Adding small query compiler #7259

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Empty/Small Data DataFrames as a separate case #4605

Handle Empty/Small Data DataFrames as a separate case #4605

naren-ponder commented Jun 27, 2022

mvashishtha commented Jun 28, 2022

billiam-wang commented Sep 14, 2022

vnlitvinov commented Sep 15, 2022

Handle Empty/Small Data DataFrames as a separate case #4605

Handle Empty/Small Data DataFrames as a separate case #4605

Comments

naren-ponder commented Jun 27, 2022

mvashishtha commented Jun 28, 2022

billiam-wang commented Sep 14, 2022

vnlitvinov commented Sep 15, 2022