Fix: fit_surrogate CSV handling issues in DeepHyper #187

evvaletov · 2023-05-18T15:32:19Z

This pull request addresses issues described in bug #186 related to the fit_surrogate method in DeepHyper. The specific issues addressed include:

Categorical hyperparameters, represented as numbers, are now correctly read as categorical values rather than numeric values, provided that a list of hyperparameter ConfigSpace types or the context YAML filename is passed to fit_surrogate as an optional argument.
Rows with missing values or incorrect number of columns are properly handled now to avoid inaccurate modeling or data misalignment.
When multiple results CSV files are concatenated, duplicate header rows are now ignored to prevent incorrect data ingestion.

With these fixes, users can expect fit_surrogate to correctly interpret and handle their CSV files, thereby improving the reliability and accuracy of the method.

Please review the changes and let me know if there are any questions or concerns.

Resolves #186

Deathn0t · 2023-05-19T09:19:18Z

Hello @evvaletov,

Thanks for submitting the PR! I started commenting on some minor aspects.

I think it would be great to complement the tests, to check for the different cases you mentioned in #186, see tests here. Also, it can help us provide better documentation.

My biggest concern is the context_yaml_file_or_datatypes, which I don't understand well as the variable types are accessible in self._problem.space. Could you explain?

evvaletov · 2023-05-19T09:58:12Z

Hi @Deathn0t,

It sounds good about complementing the tests.

The reason I added the context_yaml_file_or_datatypes argument is that I use categorical hyperparameters which are read as numeric values by Pandas, causing a check in check_x_in_space to fail with a ValueError. So, when ingesting a checkpoint, it is checked and not assumed that the hyperparameters in the checkpoint satisfy the hyperparameter specifications in the currently running optimization. The hyperparameter types are not specified in the CSV file, but they are listed in context.yaml.

However, it also makes sense and would be more convenient for the user to get the variable types from self._problem.space and to use them to read the CSV file. I haven't thought about this option. Would you like to go with this option instead?

As a third possibility, _cbo.py could use self._problem.space for this purpose by default but also allow the user to specify a context_yaml_file_or_datatypes argument. However, I could not think of a use case for this option at the moment.

Deathn0t · 2023-05-22T19:11:26Z

Yes, I think the best option is to use self._problem.space.

Deathn0t · 2023-05-30T07:37:28Z

Hi @evvaletov, let me know if this is good for you or if you need help. Here is an example code using the self._problem.spaceobject (which is a ConfigurationSpace).

evvaletov · 2023-05-30T08:15:07Z

Hi @Deathn0t , this does sound good to me, but I am at a conference at the moment and will have to catch up on other things when I return before returning to this pull request.

arnold-jr · 2024-05-24T13:37:06Z

deephyper/search/hps/_cbo.py

+ raise ValueError("Provided object is not a pandas DataFrame")
+
+ # Check if objective columns exist
+ if "objective" not in df.columns and not any(


Since python 3.8, this could be simplified as

if not (objective_columns := df.filter(regex=r"^objective(?:_\d+)?$").columns): raise ValueError(...)

arnold-jr · 2024-05-24T13:39:12Z

deephyper/search/hps/_cbo.py

+ raise ValueError("Objective column(s) missing from DataFrame")
+
+ # Convert objective columns to numeric if they're not
+ if "objective" in df.columns:


Can get rid of the if else here using

for column in objective_columns: df[column] = ...

Eremey Valetov added 2 commits May 18, 2023 08:19

Improved ValueError messages in check_x_in_space function

dc31b09

Fix CSV handling in fit_surrogate and add clean_dataframe function

2e0f954

evvaletov changed the base branch from master to develop May 18, 2023 15:49

evvaletov marked this pull request as draft May 18, 2023 15:52

Enhance to accept hyperparameter types as a list

2785d0e

evvaletov marked this pull request as ready for review May 18, 2023 19:21

evvaletov marked this pull request as draft May 18, 2023 19:21

evvaletov marked this pull request as ready for review May 18, 2023 21:18

Remove an unnecessary check from clean_dataframe

8c81830

evvaletov changed the title ~~Fix: fit_surrogate CSV handling issues in DeepHyperDevelop~~ Fix: fit_surrogate CSV handling issues in DeepHyper May 19, 2023

Eremey Valetov added 2 commits May 19, 2023 05:41

Revise comments from recent commits

0626fd4

Update fit_surrogate function documentation

8d1bc60

arnold-jr reviewed May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: fit_surrogate CSV handling issues in DeepHyper #187

Fix: fit_surrogate CSV handling issues in DeepHyper #187

evvaletov commented May 18, 2023 •

edited

Deathn0t commented May 19, 2023

evvaletov commented May 19, 2023 •

edited

Deathn0t commented May 22, 2023

Deathn0t commented May 30, 2023

evvaletov commented May 30, 2023

arnold-jr May 24, 2024

arnold-jr May 24, 2024

Fix: fit_surrogate CSV handling issues in DeepHyper #187

Are you sure you want to change the base?

Fix: fit_surrogate CSV handling issues in DeepHyper #187

Conversation

evvaletov commented May 18, 2023 • edited

Deathn0t commented May 19, 2023

evvaletov commented May 19, 2023 • edited

Deathn0t commented May 22, 2023

Deathn0t commented May 30, 2023

evvaletov commented May 30, 2023

arnold-jr May 24, 2024

Choose a reason for hiding this comment

arnold-jr May 24, 2024

Choose a reason for hiding this comment

evvaletov commented May 18, 2023 •

edited

evvaletov commented May 19, 2023 •

edited