Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare_classifiers_performance_from_pred - hdf5 is not supported for ground truth file #3550

Open
iflow opened this issue Aug 27, 2023 · 6 comments · May be fixed by #3557
Open

compare_classifiers_performance_from_pred - hdf5 is not supported for ground truth file #3550

iflow opened this issue Aug 27, 2023 · 6 comments · May be fixed by #3557
Assignees

Comments

@iflow
Copy link

iflow commented Aug 27, 2023

Describe the bug
The visualization compare_classifiers_performance_from_pred does not work, because the following error is raised:
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}
According to the documentation the parameter ground_truth should be the name of the HDF5 file obtained during training preprocessing.
Documentation: https://ludwig.ai/latest/user_guide/visualizations/#compare_classifiers_performance_from_pred

To Reproduce
Steps to reproduce the behavior:

  1. Go to Google Colab and generate some training + prediction data.
  2. Generate a visualization:
!ludwig visualize --visualization compare_classifiers_performance_from_pred \
  --predictions predictions_20230827_183245.csv \
  --ground_truth train.hdf5 \
  --ground_truth_metadata 1dbf206244e911ee93d40242ac1c000c.meta.json \
  --output_feature_name MyTarget
  1. See error
Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4172, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 469, in compare_classifiers_performance_from_pred_cli
    ground_truth = _extract_ground_truth_values(ground_truth, output_feature_name, ground_truth_split, split_file)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 264, in _extract_ground_truth_values
    ground_truth_df = _get_ground_truth_df(ground_truth) if isinstance(ground_truth, str) else ground_truth
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 233, in _get_ground_truth_df
    raise ValueError(
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}

Expected behavior
Some plots like in the documentation.

Environment:

  • OS: Google Colab - Linux Ubuntu
  • Python version: 3.10
  • Ludwig version: 0.8.1.post1

Additional context

I tried to find a bug in ludwig/utils/data_utils.py, but it looks good. I have also tried to call directly from a Jupyter Notebook (compare_classifiers_performance_from_pred_cli), but the same error raises there.

@tgaddair
Copy link
Collaborator

Hey @iflow, thanks for reporting this issue! It looks like somehow we excluded HDF5 from the list of valid file formats in this check. Can you try running with the changes in #3557 and let me know if that addresses the issue?

@arnavgarg1 arnavgarg1 linked a pull request Aug 30, 2023 that will close this issue
@arnavgarg1
Copy link
Contributor

Hi @iflow whenever you can confirm that this fixes the issue, we're good to merge it our fix in!

@iflow
Copy link
Author

iflow commented Aug 31, 2023

Thanks for the quick fix! Unfortunately I couldn't try it yet because I installed the library with Google Colab. So I have to set everything up on my local machine, which might take some time.

@tgaddair
Copy link
Collaborator

Hey @iflow, in Collab you can test out the branch by installing Ludwig like this:

!pip install "git+https://github.com/ludwig-ai/ludwig.git@fix-gt-formats#egg=ludwig[llm]" --quiet

@iflow
Copy link
Author

iflow commented Aug 31, 2023

Thank you @tgaddair, I did not know about this awesome command :)

With using the fixed version the error "hd5 is not supported..." does not show up anymore 👍

However a different error is given:
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

I guess it is not related to this issue?

Full trace:

Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4175, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 475, in compare_classifiers_performance_from_pred_cli
    predictions_per_model = _get_cols_from_predictions(predictions, [col], metadata)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 305, in _get_cols_from_predictions
    pred_df = pd.read_parquet(predictions_path)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 503, in read_parquet
    return impl.read(
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 251, in read
    result = self.api.parquet.read_table(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2780, in read_table
    dataset = _ParquetDatasetV2(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2368, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 898, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status

@tgaddair
Copy link
Collaborator

tgaddair commented Sep 3, 2023

Hey @iflow, for the --predictions, can you try using the parquet file generated by Ludwig instead of the CSV? There should be a file called something like predictions_20230827_183245.parquet in the same folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants