Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Data] repartition causes an error to be thrown for non-pyarrow compatible column types #45236

Open
marwan116 opened this issue May 10, 2024 · 3 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical

Comments

@marwan116
Copy link
Contributor

marwan116 commented May 10, 2024

What happened + What you expected to happen

I have a simple pandas dataframe with a string column that contains numpy nans.

When I attempt to build a dataset from it and apply transformations it works as expected. However when I include a repartition call, it causes a TypeError to be thrown.

Versions / Dependencies

Version info:
3.10.11 (main, Dec 12 2023, 16:25:48) [Clang 15.0.0 (clang-1500.0.40.1)]
Ray version:
2.12.0

Reproduction script

See this code snippet to reproduce

import ray.data
import pandas as pd
import numpy as np

ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))


def my_fn(batch):
    return batch


ds.repartition(2).map_batches(my_fn).take_all()
print("The above will pass!")

ds.repartition(2).map_batches(my_fn).to_pandas()
print("The above will crash!")

Issue Severity

Low: It annoys or frustrates me.

@marwan116 marwan116 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2024
@scottjlee scottjlee added P2 Important issue, but not time-critical data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2024
@982945902
Copy link
Contributor

This is not a bug, conservatively speaking, if you change ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]})) to
ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}))
is will pass.

why?
look that:
image
a return block will try cover to arrow first if failed will be pandas.
["a", "b", np.nan] cannot cover to arrow, because arrow require same type (type("a") != type(np.nan))

so in you case , first block cover to arrow , second cover to pandas, i will cause block join check raise exception.

you can do like this

ds.repartition(2).map_batches(my_fn,batch_format="pandas").to_pandas()
pass

@marwan116
Copy link
Contributor Author

marwan116 commented May 10, 2024

Well if you try without the .repartition, the code will pass without having to explicitly set batch_format="pandas"

i.e. see the below code:

import ray.data
import pandas as pd
import numpy as np

ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))


def my_fn(batch):
    return batch

ds.map_batches(my_fn).to_pandas()
print("The above will pass!")

I like this behavior more because it expects less work from the user - but regardless of my preferences, I would expect the behavior to be consistent whether a repartition operation is applied or not.

@bveeramani
Copy link
Member

Yeah, this is a bug.

After the repartition, the dataset contains two pandas blocks. However, map_batches converts the first block to an Arrow table and the keeps the second block as a pandas DataFrame.

import numpy as np
import pandas as pd

import ray

ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))


def my_fn(batch):
    return batch


refs = ds.repartition(2).map_batches(my_fn).get_internal_block_refs()
print([type(ray.get(ref)) for ref in refs])
# <class 'pyarrow.lib.Table'>, <class 'pandas.core.frame.DataFrame'>

A possible fix is to convert unexpected types in DelegatingBlockBuilder rather than raise an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

4 participants