feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127

tcawlfield · 2024-05-22T21:53:32Z

Adding a module, awkward._connect.pyarrow_table_conv, to convert pyarrow tables with extensionarray types to ones without, storing metadata necessary to reconstruct the extensionarray types into the table-level metadata.

This allows reading select columns, as the tables in newly-written Parquet files will have native column types together with this metadata needed to convert back again on read.

Each object now includes "field_name". Each composite object, including the table itself, is represented as a list of objects. This feels like it has more consistency and robustness.

This does not work yet. Table.cast() seems to fail me here. But the schema generation appears to work so far.

* Replacing Table.cast() with custom function :replace_schema() * Replacing dictionary of lambdas with function having elif isinstance chain * This is currently very poorly tested

Handling all the pyarrow types that we use now, but there are still errors converting native DictionaryType arrays to awkward-extension-types.

Turns out you need AwkwardArrowArray.from_storage to create extension-type dictionary arrays. Strange but I'm evidently not the first poor soul to bump into this.

The unit tests do not yet cover Parquet file operations, which will likely be in the next commit.

Also expanded test_2772 a bit, trying to reproduce errors from test_1440. But instead of reproducing these, I found new errors. Ugh. Checking this in because it's just where I'm at right now.

Added new test for actually doing selective read. This required changing the top-level metadata from list to json object. Also fixed a bug, when converting table to navite, keep any existing table metadata entries.

agoose77 · 2024-05-23T09:47:29Z

Super excited to see this work @tcawlfield!

jpivarski

It's looking good! The next step would be to make ak.to_arrow_table and ak.from_arrow_table (and, by extension, ak.to_parquet and ak.from_parquet) do this automatically. Then a lot of the existing tests in

tests/test_1125_to_arrow_from_arrow.py
tests/test_1154_arrow_tables_should_preserve_parameters.py
tests/test_1294_to_and_from_parquet.py
tests/test_1453_write_single_records_to_parquet.py
tests/test_2340_unknown_type_to_arrow_and_back.py
tests/test_2772_parquet_extn_array_metadata.py

would verify that the metadata is being fully preserved and arrays are being properly reconstructed.

I tried this:

>>> arrow_table = ak.to_arrow_table(   # tuples need metadata
...     ak.Array([[(1, 1.1), (2, 2.2)], [], [(3, 3.3)]]),
...     extensionarray=True,
... )
>>> ak._connect.pyarrow_table_conv.convert_awkward_arrow_table_to_native(
...     arrow_table
... )

because this function takes a pa.Table and returns a pa.Table, but it didn't work:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 37, in convert_awkward_arrow_table_to_native
    return replace_schema(aatable, new_schema)
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 182, in replace_schema
    for col, new_field in zip(batch.itercolumns(), new_schema):
AttributeError: 'pyarrow.lib.RecordBatch' object has no attribute 'itercolumns'. Did you mean: 'num_columns'?

(I wanted to see what the collected metadata looks like.) Maybe I'm pulling the wrong function, though. (This whole new submodule consists of internal functions, so they don't have to have an obvious way to use them. They can be eclectic.)

This is looking good, but the real test will be when it's applied to all Awkward ↔ Arrow Table conversions, which will verify that the full conversions are lossless and correct.

src/awkward/_connect/pyarrow_table_conv.py

Fixes failures in test 2968

Co-authored-by: Jim Pivarski <[email protected]>

tcawlfield · 2024-05-23T16:25:25Z

It's looking good! The next step would be to make ak.to_arrow_table and ak.from_arrow_table (and, by extension, ak.to_parquet and ak.from_parquet) do this automatically. Then a lot of the existing tests in

tests/test_1125_to_arrow_from_arrow.py

tests/test_1154_arrow_tables_should_preserve_parameters.py

tests/test_1294_to_and_from_parquet.py

tests/test_1453_write_single_records_to_parquet.py

tests/test_2340_unknown_type_to_arrow_and_back.py

tests/test_2772_parquet_extn_array_metadata.py

would verify that the metadata is being fully preserved and arrays are being properly reconstructed.

I tried this:
>>> arrow_table = ak.to_arrow_table(   # tuples need metadata
...     ak.Array([[(1, 1.1), (2, 2.2)], [], [(3, 3.3)]]),
...     extensionarray=True,
... )
>>> ak._connect.pyarrow_table_conv.convert_awkward_arrow_table_to_native(
...     arrow_table
... )
because this function takes a pa.Table and returns a pa.Table, but it didn't work:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 37, in convert_awkward_arrow_table_to_native
    return replace_schema(aatable, new_schema)
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 182, in replace_schema
    for col, new_field in zip(batch.itercolumns(), new_schema):
AttributeError: 'pyarrow.lib.RecordBatch' object has no attribute 'itercolumns'. Did you mean: 'num_columns'?
(I wanted to see what the collected metadata looks like.) Maybe I'm pulling the wrong function, though. (This whole new submodule consists of internal functions, so they don't have to have an obvious way to use them. They can be eclectic.)

This is looking good, but the real test will be when it's applied to all Awkward ↔ Arrow Table conversions, which will verify that the full conversions are lossless and correct.

This should have worked. It works for me. Perhaps you found the first issue w/rt differing pyarrow versions? Or maybe I had failed to push local commits yesterday? I pushed some commits just now -- curious what this does for you now.

This makes things more natural to use from outside, with or without pyarrow installed.

tcawlfield · 2024-05-23T22:20:37Z

I'm marking this as ready for review, even though there's failures on Python 3.8 still. I'll look into what it takes to reproduce this, but so far I'm hitting errors like:

AttributeError("'pyarrow.lib.LargeListType' object has no attribute 'field'")

that come specifically from here.

I'm having trouble installing all of requirements-test-minimal.txt in a Python 3.11 environment, although I got pyarrow 7.0.0 installed. And indeed LargeListArray in that version has a value_field property but not a field() method. I can probably special-case that with hasattr, but without reproducing the whole environment I'm not sure what would fail next. So I'll pause here and wait for any advice you might offer, and in the meantime try to get this working with an old Python somewhere.

You'll quickly notice that I finally broke down and converted ._connect.pyarrow into a package, with my new code as a submodule, .table_conv. I did a very minimal restructuring of the rest of pyarrow.py -- top-level __init__.py just handles safe importing with/without pyarrow and creates the same namespace of symbols regardless, so that other code could import anything here without having to get too fancy. The extension classes went into a separate submodule because that part was easy to chop off and it felt reasonable to do so. Everything else went into a grab-bag, .../pyarrow/conversions.py. (I'm happy to rename this.) I felt this was a safe place to draw the line between minor and major rearranging. My primary motivation was to pull the new table-conversion functions, convert_awkward_arrow_table_to_native and convert_native_arrow_table_to_awkward, up into ..._connect.pyarrow in a way that kept the new code separate, avoided circular imports, and worked well with/without pyarrow.

jpivarski · 2024-05-24T06:22:21Z

And indeed LargeListArray in that version has a value_field property but not a field() method. I can probably special-case that with hasattr, but without reproducing the whole environment I'm not sure what would fail next.

Actually, I remember that issue with LargeListArray. I might have even reported it, but it's probably from back when Arrow used JIRA instead of GitHub and I'd have trouble finding it. Special-casing with hasattr is definitely fine, and it's an isolated thing. There's no reason to expect a deluge of issues afterward.

Is the difficulty of making an environment with pyarrow 7 related to conda versus pip venv (because pyarrow has a binary dependency)? Making environments with the minimal and maximal dependencies works with conda (mamba).

Reorganizing all of the pyarrow connections into a directory is a good idea. Thanks!

Moving to_awkwardarrow_storage_types from .conversions to .extn_types.

tcawlfield · 2024-05-24T19:40:33Z

Okay I think this is all ready for review now, with tests passing!

Is the difficulty of making an environment with pyarrow 7 related to conda versus pip venv (because pyarrow has a binary dependency)? Making environments with the minimal and maximal dependencies works with conda (mamba).

My issue with pyarrow 7.0.0 is resolved. I'm not sure why I had trouble installing the older requirements earlier. It was a compilation error of some sort, but I was able to grab old pyarrow but latest versions of the other dependencies in our requirements-minimal.txt (something like that anyway) and work out all the changes needed. I left some earlier code commented-out with the rationale that we might in the future drop support for older pyarrow versions, and can clean up and/or improve the code. Some people don't like checking in commented-out code. "That's what git is for!" And this is true but I have things commented-out that might be worth reviewing in the future, or might also make intentions more clear. I do try to keep this to a bare minimum though.

jpivarski

This is looking great! I think this is ready to merge, if updating to main is uneventful (i.e. it merges cleanly and all tests continue to pass).

src/awkward/_connect/pyarrow/__init__.py

src/awkward/_connect/pyarrow/conversions.py

src/awkward/_connect/pyarrow/extn_types.py

src/awkward/_connect/pyarrow/table_conv.py

tests/test_2772_parquet_extn_array_metadata.py

The failing test is being moved to a new file, to be added later.

tcawlfield added 10 commits May 8, 2024 17:13

POC for 2772, outbound table conversion

04968a6

Changing table-wide metadata

97b3d13

Each object now includes "field_name". Each composite object, including the table itself, is represented as a list of objects. This feels like it has more consistency and robustness.

Adding convert_native_arrow_table_to_awkward

ca9ee04

This does not work yet. Table.cast() seems to fail me here. But the schema generation appears to work so far.

Various fixes to table conversions

8993d7f

* Replacing Table.cast() with custom function :replace_schema() * Replacing dictionary of lambdas with function having elif isinstance chain * This is currently very poorly tested

Ruff formatting

b5fc20e

Improvements to pyarrow_table_conv but still issues

0218837

Handling all the pyarrow types that we use now, but there are still errors converting native DictionaryType arrays to awkward-extension-types.

Fixing bug in array_with_replacement_type

35e7579

Turns out you need AwkwardArrowArray.from_storage to create extension-type dictionary arrays. Strange but I'm evidently not the first poor soul to bump into this.

Adding unit testing, fixing a couple bugs

cab2128

The unit tests do not yet cover Parquet file operations, which will likely be in the next commit.

Adding hooks to parquet read & write

91a70f2

Also expanded test_2772 a bit, trying to reproduce errors from test_1440. But instead of reproducing these, I found new errors. Ugh. Checking this in because it's just where I'm at right now.

Ruff-fmt fixes

70506c0

tcawlfield requested a review from jpivarski May 22, 2024 21:53

tcawlfield self-assigned this May 22, 2024

tcawlfield linked an issue May 22, 2024 that may be closed by this pull request

PyArrow doesn't permit selective reading with ExtensionArray #2772

Closed

tcawlfield temporarily deployed to docs May 22, 2024 22:03 — with GitHub Actions Inactive

Making progress

094c70d

Added new test for actually doing selective read. This required changing the top-level metadata from list to json object. Also fixed a bug, when converting table to navite, keep any existing table metadata entries.

tcawlfield temporarily deployed to docs May 22, 2024 23:39 — with GitHub Actions Inactive

jpivarski reviewed May 23, 2024

View reviewed changes

src/awkward/_connect/pyarrow_table_conv.py Outdated Show resolved Hide resolved

src/awkward/_connect/pyarrow_table_conv.py Outdated Show resolved Hide resolved

src/awkward/_connect/pyarrow_table_conv.py Outdated Show resolved Hide resolved

tcawlfield and others added 2 commits May 23, 2024 10:03

Fixing another bug: convert each row group when writing

43d8af5

Fixes failures in test 2968

pyarrow_table_conv -- change our new table metadata key name

6269993

Co-authored-by: Jim Pivarski <[email protected]>

Some stylistic improvements

9f4a74a

tcawlfield temporarily deployed to docs May 23, 2024 16:40 — with GitHub Actions Inactive

tcawlfield and others added 2 commits May 23, 2024 10:44

Commented-out a messy assertion in test_2772

8266b34

style: pre-commit fixes

f2cdcc6

pre-commit-ci bot temporarily deployed to docs May 23, 2024 16:57 Inactive

tcawlfield added 4 commits May 23, 2024 14:28

Moving awkward._connect.pyarrow into a package

17a050b

Restructuring ._connect.pyarrow package

c2e1275

This makes things more natural to use from outside, with or without pyarrow installed.

Fixing unused imports and other Ruffage

bd9b16e

Fixing Ruffage, this time for sure

7047979

tcawlfield temporarily deployed to docs May 23, 2024 20:39 — with GitHub Actions Inactive

tcawlfield marked this pull request as ready for review May 23, 2024 22:00

tcawlfield added 2 commits May 24, 2024 12:49

Fixes for old versions of pyarrow

683e025

Small fixes

a0a6ae4

Moving to_awkwardarrow_storage_types from .conversions to .extn_types.

tcawlfield temporarily deployed to docs May 24, 2024 19:07 — with GitHub Actions Inactive

tcawlfield requested a review from jpivarski May 24, 2024 19:34

jpivarski approved these changes May 26, 2024

View reviewed changes

Adding BSD licenses, moving a commented test

4f474fc

The failing test is being moved to a new file, to be added later.

tcawlfield deployed to docs May 27, 2024 17:08 — with GitHub Actions View deployment

tcawlfield merged commit ef2e08f into main May 27, 2024
41 checks passed

tcawlfield deleted the 2772-pyarrow-doesnt-permit-selective-reading-with-extensionarray branch May 27, 2024 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127

feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127

tcawlfield commented May 22, 2024

agoose77 commented May 23, 2024

jpivarski left a comment

tcawlfield commented May 23, 2024

tcawlfield commented May 23, 2024

jpivarski commented May 24, 2024

tcawlfield commented May 24, 2024

jpivarski left a comment

feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127

feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127

Conversation

tcawlfield commented May 22, 2024

agoose77 commented May 23, 2024

jpivarski left a comment

Choose a reason for hiding this comment

tcawlfield commented May 23, 2024

tcawlfield commented May 23, 2024

jpivarski commented May 24, 2024

tcawlfield commented May 24, 2024

jpivarski left a comment

Choose a reason for hiding this comment