[Ray Data] repartition causes an error to be thrown for non-pyarrow compatible column types #45236

marwan116 · 2024-05-10T01:21:20Z

What happened + What you expected to happen

I have a simple pandas dataframe with a string column that contains numpy nans.

When I attempt to build a dataset from it and apply transformations it works as expected. However when I include a repartition call, it causes a TypeError to be thrown.

Versions / Dependencies

Version info:
3.10.11 (main, Dec 12 2023, 16:25:48) [Clang 15.0.0 (clang-1500.0.40.1)]
Ray version:
2.12.0

Reproduction script

See this code snippet to reproduce

import ray.data
import pandas as pd
import numpy as np

ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))


def my_fn(batch):
    return batch


ds.repartition(2).map_batches(my_fn).take_all()
print("The above will pass!")

ds.repartition(2).map_batches(my_fn).to_pandas()
print("The above will crash!")

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

982945902 · 2024-05-10T08:15:52Z

This is not a bug, conservatively speaking， if you change ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]})) to
ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}))
is will pass.

why?
look that:

a return block will try cover to arrow first if failed will be pandas.
["a", "b", np.nan] cannot cover to arrow, because arrow require same type (type("a") != type(np.nan))

so in you case , first block cover to arrow , second cover to pandas, i will cause block join check raise exception.

you can do like this

ds.repartition(2).map_batches(my_fn,batch_format="pandas").to_pandas()
pass

marwan116 · 2024-05-10T14:12:44Z

Well if you try without the .repartition, the code will pass without having to explicitly set batch_format="pandas"

i.e. see the below code:

import ray.data
import pandas as pd
import numpy as np

ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))


def my_fn(batch):
    return batch

ds.map_batches(my_fn).to_pandas()
print("The above will pass!")

I like this behavior more because it expects less work from the user - but regardless of my preferences, I would expect the behavior to be consistent whether a repartition operation is applied or not.

bveeramani · 2024-05-10T15:07:22Z

Yeah, this is a bug.

After the repartition, the dataset contains two pandas blocks. However, map_batches converts the first block to an Arrow table and the keeps the second block as a pandas DataFrame.

import numpy as np
import pandas as pd

import ray

ds = ray.data.from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", np.nan]}))


def my_fn(batch):
    return batch


refs = ds.repartition(2).map_batches(my_fn).get_internal_block_refs()
print([type(ray.get(ref)) for ref in refs])
# <class 'pyarrow.lib.Table'>, <class 'pandas.core.frame.DataFrame'>

A possible fix is to convert unexpected types in DelegatingBlockBuilder rather than raise an error.

marwan116 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2024

scottjlee added P2 Important issue, but not time-critical data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2024

982945902 mentioned this issue May 11, 2024

avoid merge errors when blocks contain different type #45269

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Data] repartition causes an error to be thrown for non-pyarrow compatible column types #45236

[Ray Data] repartition causes an error to be thrown for non-pyarrow compatible column types #45236

marwan116 commented May 10, 2024 •

edited

982945902 commented May 10, 2024

marwan116 commented May 10, 2024 •

edited

bveeramani commented May 10, 2024

[Ray Data] repartition causes an error to be thrown for non-pyarrow compatible column types #45236

[Ray Data] repartition causes an error to be thrown for non-pyarrow compatible column types #45236

Comments

marwan116 commented May 10, 2024 • edited

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

982945902 commented May 10, 2024

marwan116 commented May 10, 2024 • edited

bveeramani commented May 10, 2024

marwan116 commented May 10, 2024 •

edited

marwan116 commented May 10, 2024 •

edited