-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pecos #2153
base: master
Are you sure you want to change the base?
Pecos #2153
Conversation
Job PR-2153-b18bf2f is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution, very nice! I added some comments. Once they are addressed it should be ready for benchmarking.
def __init__(self, model_type = "XRLinear", workdir = None, model_dir = None, | ||
cat_features = None, text_features = None, num_features = None, **kwargs): | ||
super().__init__(**kwargs) | ||
|
||
# Create directory to house model artifacts | ||
run_id = str(uuid.uuid4())[:10] | ||
if model_dir is None: | ||
self.model_dir = pathlib.Path(f'./pecos-workdir/{run_id}/model') | ||
else: | ||
self.model_dir = pathlib.Path(model_dir) | ||
self.model_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
# Create working directory to house input data and model output | ||
if workdir is None: | ||
self.workdir = pathlib.Path(f'./pecos-workdir/{run_id}/') | ||
else: | ||
self.workdir = pathlib.Path(workdir + f'{run_id}/') | ||
self.workdir.mkdir(parents=True, exist_ok=True) | ||
|
||
# Configure model type | ||
self.model_type = model_type | ||
if self.model_type not in self.SUPPORTED_MODEL_TYPES: | ||
raise f"model_type {self.model_type} not supported. model_type should be one of the following: {self.SUPPORTED_MODEL_TYPES}" | ||
|
||
# Specify the types of input features | ||
self.cat_features = cat_features | ||
self.text_features = text_features | ||
self.num_features = num_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should live in ._fit
instead of in __init__
. Additionally, use self._feature_metadata
instead of cat_features, text_features, num_features
.
workdir
and model_dir
should either be hyperparameters or automatically set to a fixed location relative to and inside of self.path
. This logic should occur in ._fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a commit to move workdir and model_dir. With this, all comments should be addressed
Convert X and self.train_labels to the required format for PECOS. | ||
|
||
Currently only supports one label per training example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't update y in _preprocess
, instead do this in _fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do.
|
||
SUPPORTED_MODEL_TYPES = ["XRLinear"] | ||
|
||
def __init__(self, model_type = "XRLinear", workdir = None, model_dir = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_type
should be a hyperparameter, not an init arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it, thanks for the suggestion!
self.train_labels = y | ||
self.label_dict = {label:i for i, label in enumerate(y.unique())} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.train_labels and self.label_dict shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed; they are not necessary if y is processed in train() instead of preprocess()
# If no features are defined during initialization, we assume features are specified in model hyperparameters. | ||
if self.cat_features is None and self.text_features is None and self.num_features is None: | ||
params = self._get_model_params() | ||
print(f'Hyperparameters: {params}') | ||
self.cat_features = params['cat_features'] | ||
self.text_features = params['text_features'] | ||
self.num_features = params['num_features'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to set this in _preprocess
, set it in _fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed self.cat_features/text_features/num_features and used self.feature_metadata instead.
if seed is not None: | ||
random.seed(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a seed? Isn't this running via a command line tool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command line tool accepts random seed as a parameter. Agreed that there is no need for this, removed
cat_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country'] | ||
text_features = None | ||
num_features = ['age', 'fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will automatically be stored in self._feature_metadata
in PecosModel during the call to _fit
. No need to define it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
""" | ||
return re.sub('\W+', '_', str(s)) | ||
|
||
def read_pred_outfile(pred_outfile, k = 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: PEP8 suggests 2 newlines between functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, fixed :)
from pecos_interface import PecosInterface | ||
from pecos_utils import clean_str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use relative imports: from .pecos_interface
from .pecos_utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
# Run PECOS with AutoGluon. Testing file provided for convenience | ||
from autogluon.tabular import TabularDataset | ||
from pecos_model import PecosModel | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Can you also add unit tests akin to the other models? After addressing my other comments PECOS shouldn't need any special inputs to work (ex: cat_features, text_features, num_features shouldn't be necessary).
Here is an example unit tests for vowpalwabbit that is also a command-line based model interface: https://github.com/awslabs/autogluon/blob/master/tabular/tests/unittests/models/test_vowpalwabbit.py
You should be able to do the same unit test, just replacing VW with PECOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit tests. Thanks!
Just published a few new commits. Some notes on this "revision":
|
Thanks for the revision @noahj08! Could you look into why the PECOS unit test is failing? https://github.com/awslabs/autogluon/actions/runs/3185772278/jobs/5195752047#step:4:1687 If this can be fixed that will make the 2nd round of review easier since I can move straight to benchmarking. |
I'm not quite sure why this would be necessary. Worst case scenario you can do If this was truly an issue we should have seen it with |
|
||
def predict(self, X: np.ndarray): | ||
df_pred = self.predict_proba(X, k=1) | ||
df_pred.columns = ['label', 'score'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the line that is causing unit test to fail
[2022-10-04T22:38:11.619Z] Warning: Exception caused PecosModel to fail during training... Skipping this model.
[2022-10-04T22:38:11.619Z] 'numpy.ndarray' object has no attribute 'columns'
[2022-10-04T22:38:11.619Z] Detailed Traceback:
[2022-10-04T22:38:11.619Z] Traceback (most recent call last):
[2022-10-04T22:38:11.619Z] File "/home/ci/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 1159, in _train_and_save
[2022-10-04T22:38:11.619Z] y_pred_proba_val = model.predict_proba(X_val)
[2022-10-04T22:38:11.619Z] File "/home/ci/autogluon/core/src/autogluon/core/models/abstract/abstract_model.py", line 687, in predict_proba
[2022-10-04T22:38:11.619Z] y_pred_proba = self._predict_proba(X=X, **kwargs)
[2022-10-04T22:38:11.619Z] File "/home/ci/autogluon/core/src/autogluon/core/models/abstract/abstract_model.py", line 702, in _predict_proba
[2022-10-04T22:38:11.619Z] y_pred = self.model.predict(X)
[2022-10-04T22:38:11.619Z] File "/home/ci/autogluon/tabular/src/autogluon/tabular/models/pecos_tabular/pecos_interface.py", line 110, in predict
[2022-10-04T22:38:11.619Z] df_pred.columns = ['label', 'score']
[2022-10-04T22:38:11.619Z] AttributeError: 'numpy.ndarray' object has no attribute 'columns'
[2022-10-04T22:38:11.619Z] No base models to train on, skipping auxiliary stack level 2...
Revisited this - I still don't think I can name the directory pecos. Unlike with the catboost model, we are running the When I use relative imports to run this code in a directory named |
Gotcha, that makes sense. I suppose it is fine as it is then |
Job PR-2153-b510684 is done. |
Benchmarking this branch on a sample of 10 datasets, you can reproduce results via running this code: https://github.com/Innixma/autogluon-benchmark/blob/master/examples/train_ag_tiny.py and editing config1 and config2:
Results:
Take-aways:
As a next step, an example dataset should be provided where PECOS (as implemented in this PR via use in AutoGluon) meaningfully competes with existing models to justify its inclusion, or a description of what would need to be added for it to meaningfully compete. If text support needs to be added, please provide a text dataset either publicly or privately that PECOS is supposed to do well on with proper implementation for ease of tracking text support progress of PECOS in AutoGluon. |
Added support for text data and vectorized the preprocess method in the latest commit. New results on benchmark are copied below. We see that PECOS performance is better - it very slightly outperforms GBM on the kc1 task. We don't expect PECOS to be great at these tasks; PECOS is meant for solving extreme classification problems with thousands of labels or more. I am sending an internal-only dataset offline where PECOS clearly outperforms the other algorithms in AutoGluon. Integrating this PR enables AutoGluon to solve a new domain of problems; it adds considerable value to the framework.
|
Fixed lint check - issue was that I needed to merge with upstream :). A multimodal test is now failing due to a Bad Request error when trying to read from S3. This does not seem like it would be related to my changes; my changes are in the tabular module, where all tests pass. I could not fix the issue within the time I allocated for this today, so I will need to revisit this another day. Copying the error below for reference.
|
Regarding this contribution: I'm waiting for v0.7 to release to simplify the amount of testing needed when creating the model contrib example repo and moving forward with this contribution, thanks! |
Issue #, if available:
1827
Description of changes:
Add PECOS as a custom model to AutoGluon.
Adds a PecosModel custom model object and PecosInterface object to interact with PECOS through command-line tools.
Also contains a run.py file to demonstrate how the model can be used in-conjunction with AutoGluon.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.