Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behavior between modin and pandas for isin operation #4618

Closed
Garra1980 opened this issue Jun 29, 2022 · 2 comments
Closed

Different behavior between modin and pandas for isin operation #4618

Garra1980 opened this issue Jun 29, 2022 · 2 comments

Comments

@Garra1980
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Modin version (modin.__version__):
  • Python version:
  • Code we can use to reproduce:

Another example of difference in modin and pure pandas for following snippet

df = pd.DataFrame(columns=['col1', 'col2'])
df = df[df['col1'].isin(['1','2'])]
print(df)

Describe the problem

modin.pandas will print:
Empty DataFrame
Columns: []
Index: []

default pandas will print:
Empty DataFrame
Columns: [col1, col2]
Index: []

Not sure pandas is super correct here though

Source code / logs

@mvashishtha
Copy link
Collaborator

@Garra1980 thank you for reporting this issue. I can reproduce it at version 86d3610.

The root cause is that when Modin defaults to pandas for the dataframe __getitem__, it converts the boolean indexer df['col1'].isin(['1','2']) to pandas but the result has the wrong dtype. So Modin indexes the pandas dataframe with pandas.Series([], dtype="object", name='col1') instead of pandas.Series([], dtype=bool, name='col1'). For some reason, indexing with the former gives a dataframe with no columns instead of one with the correct columns.

Here's modin getting the wrong dtype for the indexer when converting to pandas:

import modin.pandas as pd

df = pd.DataFrame(columns=['col1', 'col2'])
modin_indexer = df['col1'].isin(['1','2'])
# Modin dtype is bool
print(modin_indexer.dtype)
# _to_pandas() dtype is object
print(modin_indexer._to_pandas().dtype)

and here is the difference in behavior for the two indexers:

import pandas

pdf = pandas.DataFrame(columns=['col1', 'col2'])
bool_indexer = pandas.Series([], dtype=bool, name='col1')
object_indexer = pandas.Series([], dtype="object", name='col1')
# prints Index(['col1', 'col2'], dtype='object')
print(pdf[bool_indexer].columns)
# prints Index([], dtype='object')
print(pdf[object_indexer].columns)

_to_pandas() is known to get incorrect dtypes for empty dataframe, e.g. in #4191 and #4060. #4605 tracks a way to robustly handle empty dataframes in general in Modin. We actually have a draft PR, #4606, ready for that feature. I think that PR should fix this bug.

I will mark this issue as a duplicate of #4605.

@mvashishtha
Copy link
Collaborator

Duplicate of #4605

@mvashishtha mvashishtha marked this as a duplicate of #4605 Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants