Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How do I fix this issue? #1716

Open
jordannelson0 opened this issue Dec 28, 2023 · 20 comments
Open

[Question] How do I fix this issue? #1716

jordannelson0 opened this issue Dec 28, 2023 · 20 comments

Comments

@jordannelson0
Copy link

Here is my code:

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier

dataframe = read_csv("Spy.csv", skiprows=0)
dataset = dataframe.values
x = dataset[:, 0:9503]
y = dataset[:, 9503]
print(dataset)

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
							  initial_configurations_via_metalearning=0,
							  memory_limit=2000,
							  time_left_for_this_task=10 * 60,
							  per_run_time_limit=60,
							  n_jobs=24)
# perform the search
model.fit(x_train, y_train)

# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

Here is the warning I'm receiving:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jordan/Documents/Brighton_University/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 961, in fit
    self._logger.exception(e)
AttributeError: 'NoneType' object has no attribute 'exception'

During handling of the above exception, another exception occurred:

Followed by:

File "/home/jordan/Documents/Brighton_University/auto_scikit.py", line 23, in <module>
    model.fit(x_train, y_train)
AttributeError: 'NoneType' object has no attribute 'info'

I have no idea how to fix this, I have been looking for hours and trying different things - even changing datasets and nothings worked. Can anyone help with code snippets preferably?

Expected behaviour

For it to run as normal

Environment and installation:

Please give details about your installation:

  • OS: Ubuntu 22.04.3 LTS
  • Pycharm IDE
  • Python version: 3.10.12
  • Auto-sklearn version 0.15.0
@eddiebergman
Copy link
Contributor

Hi @jordannelson0,

Have you tried with putting your code in a if __name__ == '__main__': block? This is required with using multiple processes on windows and nothing can be done about that

@jordannelson0
Copy link
Author

Hi @jordannelson0,

Have you tried with putting your code in a if __name__ == '__main__': block? This is required with using multiple processes on windows and nothing can be done about that

Im not ln windows

@eddiebergman
Copy link
Contributor

Oh sorry, that initial error looks very much like it's a windows one, i.e. based on this:

This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

By default, we use forkserver for spawning new processes, which is almost identical to using fork. This is the default on Linux, where as on Windows, it would have to use spawn, hence my inclination that you were using windows (sorry for not seeing the bottom part). And running a simple example, with all auto-sklearn's defaults, does that work?

@jordannelson0
Copy link
Author

Auto-SKL works fine using the datasets the API has integrated. But not with this dataset.

@jordannelson0
Copy link
Author

The dataset itself, while large is extremely clean. Using standard scikit learn/keras for example you can expect results close to 100% (accuracy metric), as a testament to the fidelity of the dataset. So despite its size, I don't consider the dataset an issue.

@jordannelson0
Copy link
Author

Using all defaults returns the same error(s)

@eddiebergman
Copy link
Contributor

My best advice is see if you can subsample 100 rows or so and see if that causes the issues, still ... and if so, subsample down to 50 and so on...

If you can construct artificial data that causes this issue then maybe I can help, but otherwise it seems like it's dataset related. There's not much I can go off of based on what's provided.

This part of the traceback:

Traceback (most recent call last):
  File "/home/jordan/Documents/Brighton_University/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 961, in fit
    self._logger.exception(e)
AttributeError: 'NoneType' object has no attribute 'exception'

Is just due to the __del__ part of autosklearn and some odd choices of the logging system. However it seems to be caused by the first error.

Just to be clear, have you tried using the if __name__ == "__main__" block?

@jordannelson0
Copy link
Author

In regards to your last comment, I haven't tried. I'm sorry to admit I'm overloaded with other work atm (im doing a phd). If you have time, and are kind enough to provide me with some code samples to c+p and test, id be more than willing.

@eddiebergman
Copy link
Contributor

I took your sample and added the small bit to take 100 samples. If you can provide the prints, that might help.

Don't worry, I also work in a research lab and understand it can be busy. Let me know when you can try it

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier

dataframe = read_csv("Spy.csv", skiprows=0)
dataset = dataframe.values

N_SAMPLES = 100
x = dataset[:N_SAMPLES, 0:9503]
y = dataset[:N_SAMPLES, 9503]
print(x, y)
print(x.dtype, y.dtype)

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
							  initial_configurations_via_metalearning=0,
							  memory_limit=2000,
							  time_left_for_this_task=10 * 60,
							  per_run_time_limit=60,
							  n_jobs=24)
# perform the search
model.fit(x_train, y_train)

# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

@jordannelson0
Copy link
Author

Thanks, I'll try this tomorrow and get back to you. I'm in GMT timezone. For reference Thursday 4th Jan GMT.

@jordannelson0
Copy link
Author

`[[ 6 0 0 ... 0 0 0]
[304 0 0 ... 0 0 0]
[224 0 0 ... 0 0 0]
...
[ 1 0 0 ... 0 0 0]
[304 0 0 ... 0 0 0]
[ 3 0 0 ... 0 0 0]] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
int64 int64
[[ 6 0 0 ... 0 0 0]
[304 0 0 ... 0 0 0]
[224 0 0 ... 0 0 0]
...
[ 1 0 0 ... 0 0 0]
[304 0 0 ... 0 0 0]
[ 3 0 0 ... 0 0 0]] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
int64 int64
Traceback (most recent call last):
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 634, in fit
self._logger = self._get_logger(dataset_name)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 390, in _get_logger
self.logging_server.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 961, in fit
self._logger.exception(e)
AttributeError: 'NoneType' object has no attribute 'exception'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
code = _serve_one(child_r, fds,
File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
code = spawn._main(child_r, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in run_code
exec(code, run_globals)
File "/home/jordan/Documents/Brighton_University/New_Idea/auto_scikit.py", line 26, in
model.fit(x_train, y_train)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 1448, in fit
super().fit(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 540, in fit
self.automl
.fit(load_models=self.load_models, **kwargs)
File "/home/jordan/Documents/Brighton_University/PhD/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2304, in fit
return super().fit(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 964, in fit
self._fit_cleanup()
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 1064, in _fit_cleanup
self._logger.info("Closing the dask infrastructure")
AttributeError: 'NoneType' object has no attribute 'info'
`
Hi, here's a the full output after trying to run with 100 sample

@jordannelson0
Copy link
Author

I ran this with 100, 50, 10 & 5 sample size. Same output each time

@jordannelson0
Copy link
Author

I also ran this with an alternate dataset which has the same datatypes & properties. A dataset with a label in the final column, both datasets are used for binary classification. Each dataset is from a cyber security background relating to malware on the android platform, each column represents a different permission an app does/doesn't have access to, 1 representing access to that permission, 0 the opposite. The final label column has the value of 1 or 0, 1 representing malicious application 0 representing non-malicious. I hope this provides some insight into the datasets I'm using

@eddiebergman
Copy link
Contributor

And this? With the main guard included?

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier

if __name__ == "__main__":
	dataframe = read_csv("Spy.csv", skiprows=0)
	dataset = dataframe.values
	
	N_SAMPLES = 100
	x = dataset[:N_SAMPLES, 0:9503]
	y = dataset[:N_SAMPLES, 9503]
	print(x, y)
	print(x.dtype, y.dtype)
	
	# Split the dataset into training and testing sets
	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
	
	# define search
	model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
								  initial_configurations_via_metalearning=0,
								  memory_limit=2000,
								  time_left_for_this_task=10 * 60,
								  per_run_time_limit=60,
								  n_jobs=24)
	# perform the search
	model.fit(x_train, y_train)
	
	# summarize
	print(model.sprint_statistics())
	# evaluate best model
	y_hat = model.predict(x_test)
	acc = accuracy_score(y_test, y_hat)
	print("Accuracy: %.3f" % acc)

@jordannelson0
Copy link
Author

`[[ 6 0 0 ... 0 0 0]
[304 0 0 ... 0 0 0]
[224 0 0 ... 0 0 0]
...
[ 1 0 0 ... 0 0 0]
[304 0 0 ... 0 0 0]
[ 3 0 0 ... 0 0 0]] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
int64 int64
[ERROR] [2024-01-04 15:29:19,768:Client-AutoML(1):016542bd-ab16-11ee-a0b9-0ddc104e5793] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",)
[ERROR] [2024-01-04 15:29:19,768:Client-AutoML(1):016542bd-ab16-11ee-a0b9-0ddc104e5793] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",)
Traceback (most recent call last):
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 765, in fit
self._do_dummy_prediction()
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 489, in do_dummy_prediction
raise ValueError(msg)
ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",)
Traceback (most recent call last):
File "/home/jordan/Documents/Brighton_University/New_Idea/auto_scikit.py", line 27, in
model.fit(x_train, y_train)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 1448, in fit
super().fit(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 540, in fit
self.automl
.fit(load_models=self.load_models, **kwargs)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2304, in fit
return super().fit(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 962, in fit
raise e
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 765, in fit
self._do_dummy_prediction()
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 489, in _do_dummy_prediction
raise ValueError(msg)
ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",)

Process finished with exit code 1
`

@eddiebergman
Copy link
Contributor

eddiebergman commented Jan 4, 2024

Okay, so that's a lot more helpful of an error. My guess is that since you have 9000+ features and they are all integers, autosklearn is trying to one-hot encode them. This effectively adds X new columns per column, where X is the number of unique integer values in that column. Multiply that by ~9000 and it's likely the dataset size explodes.

Estimators like a hist gradient boosting classifiers do not really care about one hot encoded variables while something like an MLP will. The only thing I could suggest is to try disable "data_preprocess" with the exclude parameter since your data is already pretty clean. If you need to do some data preprocessing, then I would suggest doing it manually before AutoSklearn.

exclude : Optional[Dict[str, List[str]]] = None
If None, all possible algorithms are used.
Otherwise, specifies a step and the components that are excluded from search.
See ``/pipeline/components/<step>/*`` for available components.
Incompatible with parameter ``include``.
**Possible Steps**:
* ``"data_preprocessor"``

Maybe another alternative is to convert the data into float dtypes, as then autosklearn wont try to one-hot encode them, but I do not know your data and whether these values represent categoricals.

@jordannelson0
Copy link
Author

`from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier

if name == "main":
dataframe = read_csv("Spy.csv", skiprows=0)
dataset = dataframe.values

N_SAMPLES = 100
x = dataset[:N_SAMPLES, 0:9503]
y = dataset[:N_SAMPLES, 9503]
print(x, y)
print(x.dtype, y.dtype)

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
							  initial_configurations_via_metalearning=0,
							  memory_limit=2000,
							  time_left_for_this_task=10 * 60,
							  per_run_time_limit=60,
							  n_jobs=24,
							  exclude={
								  'data_preprocessor': ['feature_type']
							  })
# perform the search
model.fit(x_train, y_train)

# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

And got: [ERROR] [2024-01-04 16:52:46,496:Client-AutoML(1):aef2d3dd-ab21-11ee-adc1-0ddc104e5793] No valid pipeline found.
Traceback (most recent call last):
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 751, in fit
self.configuration_space, configspace_path = self._create_search_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2252, in _create_search_space
configuration_space = pipeline.get_configuration_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 53, in get_configuration_space
return _get_classification_configuration_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 155, in _get_classification_configuration_space
return SimpleClassificationPipeline(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 88, in init
super().init(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 66, in init
self.config_space = self.get_hyperparameter_search_space(feat_type=feat_type)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 276, in get_hyperparameter_search_space
self.config_space = self._get_hyperparameter_search_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 206, in _get_hyperparameter_search_space
cs = self._get_base_search_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 384, in get_base_search_space
assert np.sum(matches) != 0, "No valid pipeline found."
AssertionError: No valid pipeline found.
Traceback (most recent call last):
File "/home/jordan/Documents/Brighton_University/New_Idea/auto_scikit.py", line 30, in
model.fit(x_train, y_train)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 1448, in fit
super().fit(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 540, in fit
self.automl
.fit(load_models=self.load_models, **kwargs)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2304, in fit
return super().fit(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 962, in fit
raise e
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 751, in fit
self.configuration_space, configspace_path = self._create_search_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2252, in _create_search_space
configuration_space = pipeline.get_configuration_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 53, in get_configuration_space
return _get_classification_configuration_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 155, in _get_classification_configuration_space
return SimpleClassificationPipeline(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 88, in init
super().init(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 66, in init
self.config_space = self.get_hyperparameter_search_space(feat_type=feat_type)
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 276, in get_hyperparameter_search_space
self.config_space = self._get_hyperparameter_search_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 206, in _get_hyperparameter_search_space
cs = self._get_base_search_space(
File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 384, in _get_base_search_space
assert np.sum(matches) != 0, "No valid pipeline found."
AssertionError: No valid pipeline found.`

@eddiebergman
Copy link
Contributor

@jordannelson0
Copy link
Author

I tried, the memory issue persisted unfortunately

@jordannelson0
Copy link
Author

from typing import Optional
from pprint import pprint

import autosklearn.classification
import autosklearn.pipeline.components.data_preprocessing
import sklearn.metrics
from ConfigSpace.configuration_space import ConfigurationSpace

from autosklearn.askl_typing import FEAT_TYPE_TYPE
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT
from pandas import read_csv
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
def init(self, **kwargs):
"""This preprocessors does not change the data"""
# Some internal checks makes sure parameters are set
for key, val in kwargs.items():
setattr(self, key, val)

def fit(self, X, Y=None):
    return self

def transform(self, X):
    return X

@staticmethod
def get_properties(dataset_properties=None):
    return {
        "shortname": "NoPreprocessing",
        "name": "NoPreprocessing",
        "handles_regression": True,
        "handles_classification": True,
        "handles_multiclass": True,
        "handles_multilabel": True,
        "handles_multioutput": True,
        "is_deterministic": True,
        "input": (SPARSE, DENSE, UNSIGNED_DATA),
        "output": (INPUT,),
    }

@staticmethod
def get_hyperparameter_search_space(
    feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
):
    return ConfigurationSpace()  # Return an empty configuration as there is None

Add NoPreprocessing component to auto-sklearn.

autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)

dataframe = read_csv("adware1.csv", skiprows=0)
dataset = dataframe.values

N_SAMPLES = 100
x = dataset[:, 0:440]
y = dataset[:, 440]
print(x, y)
print(x.dtype, y.dtype)

Split the dataset into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

clf = autosklearn.classification.AutoSklearnClassifier(
ensemble_kwargs={'ensemble_size': 0},
time_left_for_this_task=30*60,
include={"data_preprocessor": ["NoPreprocessing"]},
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
per_run_time_limit=60,
)
clf.fit(x_train, y_train)

To check that models were found without issue when running examples

assert len(clf.get_models_with_weights()) > 0
print(clf.sprint_statistics())

summarize

print(clf.sprint_statistics())

evaluate best model

y_hat = clf.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)
pprint(clf.show_models())

I do have this example working with a different smaller dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants