[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

bveeramani · 2024-05-10T00:24:53Z

What happened + What you expected to happen

I returned array data from my UDF, but I got an error saying that arrays must be 1-dimensional:

(ReadRange->MapBatches(f) pid=69903) Could not construct Arrow block from numpy array; encountered values of unsupported numpy type `17` in column named 'unsupported', which cannot be casted to an Arrow data type. Falling back to using pandas block type, which is slower and consumes more memory. For maximum performance, consider applying the following suggestions before ingesting into Ray Data in order to use native Arrow block types:
(ReadRange->MapBatches(f) pid=69903) - Expand out each key-value pair in the dict column into its own column                  
(ReadRange->MapBatches(f) pid=69903) - Replace `None` values with an Arrow supported data type                                
(ReadRange->MapBatches(f) pid=69903)                                                                                          
Running 0:   0%|                                                                                       | 0/20 [00:00<?, ?it/s]2024-05-09 17:23:05,898(ERROR streaming_executor_state.py:455 -- An exception was raised from a task of operator "ReadRange->MapBatches(f)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
                                                                                                                             2024-05-09 17:23:05,916  ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/balaji/Documents/GitHub/ray/1.py", line 30, in <module>
    ray.data.range(100).map_batches(f).materialize()
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 4541, in materialize
    copy._plan.execute(force_read=True)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/exceptions.py", line 86, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(ValueError): ray::ReadRange->MapBatches(f)() (pid=69903, ip=127.0.0.1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/arrow_block.py", line 210, in numpy_to_block
    col = ArrowTensorArray.from_numpy(col, col_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/tensor_extensions/arrow.py", line 376, in from_numpy
    raise e
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/tensor_extensions/arrow.py", line 336, in from_numpy
    pa_dtype = pa.from_numpy_dtype(arr.dtype)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/types.pxi", line 5164, in pyarrow.lib.from_numpy_dtype
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 17

During handling of the above exception, another exception occurred:

ray::ReadRange->MapBatches(f)() (pid=69903, ip=127.0.0.1)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/execution/operators/map_operator.py", line 410, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 393, in __call__
    add_fn(data)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/output_buffer.py", line 48, in add_batch
    self._buffer.add_batch(batch)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/delegating_block_builder.py", line 38, in add_batch
    block = BlockAccessor.batch_to_block(batch)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/block.py", line 380, in batch_to_block
    return pd.DataFrame(dict(batch))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/frame.py", line 664, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
    index = _extract_index(arrays)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 653, in _extract_index
    raise ValueError("Per-column arrays must each be 1-dimensional")
ValueError: Per-column arrays must each be 1-dimensional

Versions / Dependencies

6a266db

Reproduction script

import numpy as np

import ray


class UnsupportedType:
    pass


def f(batch):
    batch_size = len(batch["id"])
    return {
        "array": np.zeros((batch_size, 32, 32, 3)),
        "unsupported": [UnsupportedType()] * batch_size,
    }


ray.data.range(100).map_batches(f).materialize()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

982945902 · 2024-05-10T06:58:19Z

this is not a bug, f() dict return will cover to pandas Dataframe, Dataframe inited by dict need value is 1 dimensional
like this:

pd.DataFrame({
    "array": np.zeros(10),  
})
pass

pd.DataFrame({
    "array": np.zeros((10,1)),  
})
raise same error

maybe you can return a total Dataframe

bveeramani · 2024-05-10T14:58:03Z

@982945902 Ray Data has a custom extension type for multi-dimensional array data. We should automatically use the extension type, but we're not in this code path

sjincho · 2024-05-24T00:37:51Z

Hi @bveeramani , thanks for the fix.
Does this fix #39559 too?
I had a unit test that checks the example there, and once I upgraded to 2.23, it no longer failed.

bveeramani added bug Something that is supposed to be working; but isn't P0 Issue that must be fixed in short order data Ray Data-related issues labels May 10, 2024

bveeramani self-assigned this May 10, 2024

terraflops1048576 mentioned this issue May 11, 2024

[Data] Add support for objects to Arrow blocks #45272

Open

8 tasks

bveeramani mentioned this issue May 13, 2024

[Data] Fix bug where you can't return objects and array from UDF #45287

Merged

8 tasks

bveeramani closed this as completed in #45287 May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

bveeramani commented May 10, 2024

982945902 commented May 10, 2024

bveeramani commented May 10, 2024

sjincho commented May 24, 2024

[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

Comments

bveeramani commented May 10, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

982945902 commented May 10, 2024

bveeramani commented May 10, 2024

sjincho commented May 24, 2024