Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: fit_surrogate CSV handling issues in DeepHyper #187

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

evvaletov
Copy link

@evvaletov evvaletov commented May 18, 2023

This pull request addresses issues described in bug #186 related to the fit_surrogate method in DeepHyper. The specific issues addressed include:

  1. Categorical hyperparameters, represented as numbers, are now correctly read as categorical values rather than numeric values, provided that a list of hyperparameter ConfigSpace types or the context YAML filename is passed to fit_surrogate as an optional argument.
  2. Rows with missing values or incorrect number of columns are properly handled now to avoid inaccurate modeling or data misalignment.
  3. When multiple results CSV files are concatenated, duplicate header rows are now ignored to prevent incorrect data ingestion.

With these fixes, users can expect fit_surrogate to correctly interpret and handle their CSV files, thereby improving the reliability and accuracy of the method.

Please review the changes and let me know if there are any questions or concerns.

Resolves #186

@evvaletov evvaletov changed the base branch from master to develop May 18, 2023 15:49
@evvaletov evvaletov marked this pull request as draft May 18, 2023 15:52
@evvaletov evvaletov marked this pull request as ready for review May 18, 2023 19:21
@evvaletov evvaletov marked this pull request as draft May 18, 2023 19:21
@evvaletov evvaletov marked this pull request as ready for review May 18, 2023 21:18
@evvaletov evvaletov changed the title Fix: fit_surrogate CSV handling issues in DeepHyperDevelop Fix: fit_surrogate CSV handling issues in DeepHyper May 19, 2023
@Deathn0t
Copy link
Member

Hello @evvaletov,

Thanks for submitting the PR! I started commenting on some minor aspects.

I think it would be great to complement the tests, to check for the different cases you mentioned in #186, see tests here. Also, it can help us provide better documentation.

My biggest concern is the context_yaml_file_or_datatypes, which I don't understand well as the variable types are accessible in self._problem.space. Could you explain?

@evvaletov
Copy link
Author

evvaletov commented May 19, 2023

Hi @Deathn0t,

It sounds good about complementing the tests.

The reason I added the context_yaml_file_or_datatypes argument is that I use categorical hyperparameters which are read as numeric values by Pandas, causing a check in check_x_in_space to fail with a ValueError. So, when ingesting a checkpoint, it is checked and not assumed that the hyperparameters in the checkpoint satisfy the hyperparameter specifications in the currently running optimization. The hyperparameter types are not specified in the CSV file, but they are listed in context.yaml.

However, it also makes sense and would be more convenient for the user to get the variable types from self._problem.space and to use them to read the CSV file. I haven't thought about this option. Would you like to go with this option instead?

As a third possibility, _cbo.py could use self._problem.space for this purpose by default but also allow the user to specify a context_yaml_file_or_datatypes argument. However, I could not think of a use case for this option at the moment.

@Deathn0t
Copy link
Member

Yes, I think the best option is to use self._problem.space.

@Deathn0t
Copy link
Member

Hi @evvaletov, let me know if this is good for you or if you need help. Here is an example code using the self._problem.spaceobject (which is a ConfigurationSpace).

@evvaletov
Copy link
Author

Hi @Deathn0t , this does sound good to me, but I am at a conference at the moment and will have to catch up on other things when I return before returning to this pull request.

raise ValueError("Provided object is not a pandas DataFrame")

# Check if objective columns exist
if "objective" not in df.columns and not any(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since python 3.8, this could be simplified as

if not (objective_columns := df.filter(regex=r"^objective(?:_\d+)?$").columns):
    raise ValueError(...)

raise ValueError("Objective column(s) missing from DataFrame")

# Convert objective columns to numeric if they're not
if "objective" in df.columns:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can get rid of the if else here using

for column in objective_columns:
    df[column] = ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Inaccurate Reading of CSV Files by fit_surrogate in Certain Conditions
3 participants