compare_classifiers_performance_from_pred - hdf5 is not supported for ground truth file #3550

iflow · 2023-08-27T19:32:53Z

Describe the bug
The visualization compare_classifiers_performance_from_pred does not work, because the following error is raised:
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}
According to the documentation the parameter ground_truth should be the name of the HDF5 file obtained during training preprocessing.
Documentation: https://ludwig.ai/latest/user_guide/visualizations/#compare_classifiers_performance_from_pred

To Reproduce
Steps to reproduce the behavior:

Go to Google Colab and generate some training + prediction data.
Generate a visualization:

!ludwig visualize --visualization compare_classifiers_performance_from_pred \
  --predictions predictions_20230827_183245.csv \
  --ground_truth train.hdf5 \
  --ground_truth_metadata 1dbf206244e911ee93d40242ac1c000c.meta.json \
  --output_feature_name MyTarget

See error

Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4172, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 469, in compare_classifiers_performance_from_pred_cli
    ground_truth = _extract_ground_truth_values(ground_truth, output_feature_name, ground_truth_split, split_file)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 264, in _extract_ground_truth_values
    ground_truth_df = _get_ground_truth_df(ground_truth) if isinstance(ground_truth, str) else ground_truth
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 233, in _get_ground_truth_df
    raise ValueError(
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}

Expected behavior
Some plots like in the documentation.

Environment:

OS: Google Colab - Linux Ubuntu
Python version: 3.10
Ludwig version: 0.8.1.post1

Additional context

I tried to find a bug in ludwig/utils/data_utils.py, but it looks good. I have also tried to call directly from a Jupyter Notebook (compare_classifiers_performance_from_pred_cli), but the same error raises there.

The text was updated successfully, but these errors were encountered:

tgaddair · 2023-08-30T04:56:22Z

Hey @iflow, thanks for reporting this issue! It looks like somehow we excluded HDF5 from the list of valid file formats in this check. Can you try running with the changes in #3557 and let me know if that addresses the issue?

arnavgarg1 · 2023-08-30T20:34:11Z

Hi @iflow whenever you can confirm that this fixes the issue, we're good to merge it our fix in!

iflow · 2023-08-31T12:12:08Z

Thanks for the quick fix! Unfortunately I couldn't try it yet because I installed the library with Google Colab. So I have to set everything up on my local machine, which might take some time.

tgaddair · 2023-08-31T18:05:12Z

Hey @iflow, in Collab you can test out the branch by installing Ludwig like this:

!pip install "git+https://github.com/ludwig-ai/ludwig.git@fix-gt-formats#egg=ludwig[llm]" --quiet

iflow · 2023-08-31T21:48:03Z

Thank you @tgaddair, I did not know about this awesome command :)

With using the fixed version the error "hd5 is not supported..." does not show up anymore 👍

However a different error is given:
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

I guess it is not related to this issue?

Full trace:

Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4175, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 475, in compare_classifiers_performance_from_pred_cli
    predictions_per_model = _get_cols_from_predictions(predictions, [col], metadata)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 305, in _get_cols_from_predictions
    pred_df = pd.read_parquet(predictions_path)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 503, in read_parquet
    return impl.read(
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 251, in read
    result = self.api.parquet.read_table(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2780, in read_table
    dataset = _ParquetDatasetV2(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2368, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 898, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status

tgaddair · 2023-09-03T17:04:26Z

Hey @iflow, for the --predictions, can you try using the parquet file generated by Ludwig instead of the CSV? There should be a file called something like predictions_20230827_183245.parquet in the same folder.

arnavgarg1 linked a pull request Aug 30, 2023 that will close this issue

Fixed ground truth formats to include hdf5 #3557

Open

arnavgarg1 assigned tgaddair Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare_classifiers_performance_from_pred - hdf5 is not supported for ground truth file #3550

compare_classifiers_performance_from_pred - hdf5 is not supported for ground truth file #3550

iflow commented Aug 27, 2023 •

edited

tgaddair commented Aug 30, 2023

arnavgarg1 commented Aug 30, 2023

iflow commented Aug 31, 2023

tgaddair commented Aug 31, 2023

iflow commented Aug 31, 2023 •

edited

tgaddair commented Sep 3, 2023

compare_classifiers_performance_from_pred - hdf5 is not supported for ground truth file #3550

compare_classifiers_performance_from_pred - hdf5 is not supported for ground truth file #3550

Comments

iflow commented Aug 27, 2023 • edited

tgaddair commented Aug 30, 2023

arnavgarg1 commented Aug 30, 2023

iflow commented Aug 31, 2023

tgaddair commented Aug 31, 2023

iflow commented Aug 31, 2023 • edited

tgaddair commented Sep 3, 2023

iflow commented Aug 27, 2023 •

edited

iflow commented Aug 31, 2023 •

edited