Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

Closed
bveeramani opened this issue May 10, 2024 · 3 comments · Fixed by #45287 · May be fixed by #45272
Closed

[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

bveeramani opened this issue May 10, 2024 · 3 comments · Fixed by #45287 · May be fixed by #45272
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issue that must be fixed in short order

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

I returned array data from my UDF, but I got an error saying that arrays must be 1-dimensional:

(ReadRange->MapBatches(f) pid=69903) Could not construct Arrow block from numpy array; encountered values of unsupported numpy type `17` in column named 'unsupported', which cannot be casted to an Arrow data type. Falling back to using pandas block type, which is slower and consumes more memory. For maximum performance, consider applying the following suggestions before ingesting into Ray Data in order to use native Arrow block types:
(ReadRange->MapBatches(f) pid=69903) - Expand out each key-value pair in the dict column into its own column                  
(ReadRange->MapBatches(f) pid=69903) - Replace `None` values with an Arrow supported data type                                
(ReadRange->MapBatches(f) pid=69903)                                                                                          
Running 0:   0%|                                                                                       | 0/20 [00:00<?, ?it/s]2024-05-09 17:23:05,898(ERROR streaming_executor_state.py:455 -- An exception was raised from a task of operator "ReadRange->MapBatches(f)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
                                                                                                                             2024-05-09 17:23:05,916  ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/balaji/Documents/GitHub/ray/1.py", line 30, in <module>
    ray.data.range(100).map_batches(f).materialize()
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 4541, in materialize
    copy._plan.execute(force_read=True)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/exceptions.py", line 86, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(ValueError): ray::ReadRange->MapBatches(f)() (pid=69903, ip=127.0.0.1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/arrow_block.py", line 210, in numpy_to_block
    col = ArrowTensorArray.from_numpy(col, col_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/tensor_extensions/arrow.py", line 376, in from_numpy
    raise e
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/tensor_extensions/arrow.py", line 336, in from_numpy
    pa_dtype = pa.from_numpy_dtype(arr.dtype)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/types.pxi", line 5164, in pyarrow.lib.from_numpy_dtype
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 17

During handling of the above exception, another exception occurred:

ray::ReadRange->MapBatches(f)() (pid=69903, ip=127.0.0.1)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/execution/operators/map_operator.py", line 410, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 393, in __call__
    add_fn(data)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/output_buffer.py", line 48, in add_batch
    self._buffer.add_batch(batch)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/delegating_block_builder.py", line 38, in add_batch
    block = BlockAccessor.batch_to_block(batch)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/block.py", line 380, in batch_to_block
    return pd.DataFrame(dict(batch))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/frame.py", line 664, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
    index = _extract_index(arrays)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 653, in _extract_index
    raise ValueError("Per-column arrays must each be 1-dimensional")
ValueError: Per-column arrays must each be 1-dimensional

Versions / Dependencies

6a266db

Reproduction script

import numpy as np

import ray


class UnsupportedType:
    pass


def f(batch):
    batch_size = len(batch["id"])
    return {
        "array": np.zeros((batch_size, 32, 32, 3)),
        "unsupported": [UnsupportedType()] * batch_size,
    }


ray.data.range(100).map_batches(f).materialize()

Issue Severity

High: It blocks me from completing my task.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't P0 Issue that must be fixed in short order data Ray Data-related issues labels May 10, 2024
@bveeramani bveeramani self-assigned this May 10, 2024
@982945902
Copy link
Contributor

this is not a bug, f() dict return will cover to pandas Dataframe, Dataframe inited by dict need value is 1 dimensional
like this:

pd.DataFrame({
    "array": np.zeros(10),  
})
pass
pd.DataFrame({
    "array": np.zeros((10,1)),  
})
raise same error 

maybe you can return a total Dataframe

@bveeramani
Copy link
Member Author

@982945902 Ray Data has a custom extension type for multi-dimensional array data. We should automatically use the extension type, but we're not in this code path

@sjincho
Copy link

sjincho commented May 24, 2024

Hi @bveeramani , thanks for the fix.
Does this fix #39559 too?
I had a unit test that checks the example there, and once I upgraded to 2.23, it no longer failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issue that must be fixed in short order
Projects
None yet
3 participants