Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE-REQUEST] Support for HDF5 special types (e.g., variable-length dtypes.) #2393

Open
callous4567 opened this issue Sep 19, 2023 · 0 comments

Comments

@callous4567
Copy link

Thank you for reaching out and helping us improve Vaex!

Description
Include support for vaex.DataFrame.export_hdf5(...) to handle columns that contain elements with variable length lists/arrays/etc and other HDF5 "special types," e.g., see https://docs.h5py.org/en/stable/special.html. Here's some example code that would ideally run, and generate an appropriate HDF5 file-

import vaex
import numpy as np

# Generate some test arrays/lists/lists-of-lists
rng = np.random.default_rng()
lol = [[d for d in range(rng.integers(0, 100, 1)[0])] for i in range(1000)]
lol = np.array(lol, dtype=list)

# To vaex
df = vaex.from_arrays(_primary=lol)

# Export to a file 
df.export_hdf5("test.hdf5")

The column lol (list-of-lists) includes a list of variable-length lists (these could be other variable-length objects.) These are ostensibly supported by h5py/HDF5, e.g., see https://docs.h5py.org/en/stable/special.html and I've confirmed this in Python 3.10 via (this is just a scrap of code from something I'm writing that happens to write lists-of-lists fine)


    def write_list(self, group: str, dataset: str, _list: list, **kwargs):

        """
        Write the provided list within [group,dataset] in the file located at self.path.

        Behaviour
        ----
            If [group,set] exists, del will be attempted within the group, and a new dataset made. Note that this will
            simply remove the data from the HDF5 files tree- it will not relieve file space. Special behaviour arises
            when the elements of your list are not all of the same size-
            see https://docs.h5py.org/en/stable/special.html.

        **kwargs
        ----
            _vtype: str (optional, default False)
                In the case that your list is made up of lists or other elements of various length, you must specify
                the dtype, e.g., "int32" or "float64." The list-of-lists will be converted to a list-of-arrays before
                being written.

        :param group: Parent key
        :param dataset: Child key
        :param _list: list
        :return: bool for success.
        """

        with h5py.File(self.path, 'a') as f:

            if group not in f.keys():

                f.create_group(group)

            if dataset in f[group].keys():

                del f[group][dataset]

            _vtype = kwargs.get("_vtype", False)

            if _vtype is not False:

                _dtype = h5py.vlen_dtype(np.dtype(_vtype))
                _list = [np.array(d, _vtype) for d in _list]
                f.create_dataset(name=group + "/" + dataset, dtype=_dtype, data=_list)

            else:

                f.create_dataset(name=group + "/" + dataset, data=_list)

Is your feature request related to a problem? Please describe.
Not as far as I am aware of.

Additional context
When vaex attempts to write a list of variable length objects, this error message arises-

Traceback (most recent call last):
  File "A:\straszaks\pycharm_tpa\DBKnowPy-sstrasza\class_DB.py", line 430, in <module>
    DB().test()
  File "A:\straszaks\pycharm_tpa\DBKnowPy-sstrasza\class_DB.py", line 428, in test
    self._export()
  File "A:\straszaks\pycharm_tpa\DBKnowPy-sstrasza\class_DB.py", line 211, in _export
    self.FileLookup.export_hdf5(os.path.join(self.Root, self.Name + "_FileLookup.hdf5"), progress=False)
  File "C:\Users\sstrasza\Documents\miniforge3\lib\site-packages\vaex\dataframe.py", line 6949, in export_hdf5
    writer.layout(self, progress=progressbar_layout)
  File "C:\Users\sstrasza\Documents\miniforge3\lib\site-packages\vaex\hdf5\writer.py", line 85, in layout
    raise TypeError(f"Cannot export column of type: {dtype} (column {name})")
TypeError: Cannot export column of type: object (column _keys)

There should be an option somewhere under vaex.DataFrame.export_hdf5 for the user to specify if variable length types (or indeed other HDF5 "special" types) are present, and which columns in the DataFrame correspond to them, such that vaex can then successfully go forth and export these particular columns into the HDF5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant