New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Data] repartition causes an error to be thrown for non-pyarrow compatible column types #45236
Comments
Well if you try without the i.e. see the below code: import ray.data
import pandas as pd
import numpy as np
ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))
def my_fn(batch):
return batch
ds.map_batches(my_fn).to_pandas()
print("The above will pass!") I like this behavior more because it expects less work from the user - but regardless of my preferences, I would expect the behavior to be consistent whether a |
Yeah, this is a bug. After the repartition, the dataset contains two pandas blocks. However, import numpy as np
import pandas as pd
import ray
ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))
def my_fn(batch):
return batch
refs = ds.repartition(2).map_batches(my_fn).get_internal_block_refs()
print([type(ray.get(ref)) for ref in refs])
# <class 'pyarrow.lib.Table'>, <class 'pandas.core.frame.DataFrame'> A possible fix is to convert unexpected types in |
What happened + What you expected to happen
I have a simple pandas dataframe with a string column that contains numpy nans.
When I attempt to build a dataset from it and apply transformations it works as expected. However when I include a
repartition
call, it causes a TypeError to be thrown.Versions / Dependencies
Version info:
3.10.11 (main, Dec 12 2023, 16:25:48) [Clang 15.0.0 (clang-1500.0.40.1)]
Ray version:
2.12.0
Reproduction script
See this code snippet to reproduce
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: