Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pecos #2153

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Pecos #2153

wants to merge 12 commits into from

Conversation

noahj08
Copy link

@noahj08 noahj08 commented Sep 21, 2022

Issue #, if available:
1827

Description of changes:

Add PECOS as a custom model to AutoGluon.
Adds a PecosModel custom model object and PecosInterface object to interact with PECOS through command-line tools.
Also contains a run.py file to demonstrate how the model can be used in-conjunction with AutoGluon.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@noahj08 noahj08 changed the title Pecos branch Pecos Sep 21, 2022
@github-actions
Copy link

Job PR-2153-b18bf2f is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2153/b18bf2f/index.html

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, very nice! I added some comments. Once they are addressed it should be ready for benchmarking.

Comment on lines 24 to 51
def __init__(self, model_type = "XRLinear", workdir = None, model_dir = None,
cat_features = None, text_features = None, num_features = None, **kwargs):
super().__init__(**kwargs)

# Create directory to house model artifacts
run_id = str(uuid.uuid4())[:10]
if model_dir is None:
self.model_dir = pathlib.Path(f'./pecos-workdir/{run_id}/model')
else:
self.model_dir = pathlib.Path(model_dir)
self.model_dir.mkdir(parents=True, exist_ok=True)

# Create working directory to house input data and model output
if workdir is None:
self.workdir = pathlib.Path(f'./pecos-workdir/{run_id}/')
else:
self.workdir = pathlib.Path(workdir + f'{run_id}/')
self.workdir.mkdir(parents=True, exist_ok=True)

# Configure model type
self.model_type = model_type
if self.model_type not in self.SUPPORTED_MODEL_TYPES:
raise f"model_type {self.model_type} not supported. model_type should be one of the following: {self.SUPPORTED_MODEL_TYPES}"

# Specify the types of input features
self.cat_features = cat_features
self.text_features = text_features
self.num_features = num_features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should live in ._fit instead of in __init__. Additionally, use self._feature_metadata instead of cat_features, text_features, num_features.

workdir and model_dir should either be hyperparameters or automatically set to a fixed location relative to and inside of self.path. This logic should occur in ._fit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a commit to move workdir and model_dir. With this, all comments should be addressed

Comment on lines 55 to 57
Convert X and self.train_labels to the required format for PECOS.

Currently only supports one label per training example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't update y in _preprocess, instead do this in _fit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do.


SUPPORTED_MODEL_TYPES = ["XRLinear"]

def __init__(self, model_type = "XRLinear", workdir = None, model_dir = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_type should be a hyperparameter, not an init arg.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it, thanks for the suggestion!

Comment on lines 119 to 120
self.train_labels = y
self.label_dict = {label:i for i, label in enumerate(y.unique())}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.train_labels and self.label_dict shouldn't be necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed; they are not necessary if y is processed in train() instead of preprocess()

Comment on lines 70 to 76
# If no features are defined during initialization, we assume features are specified in model hyperparameters.
if self.cat_features is None and self.text_features is None and self.num_features is None:
params = self._get_model_params()
print(f'Hyperparameters: {params}')
self.cat_features = params['cat_features']
self.text_features = params['text_features']
self.num_features = params['num_features']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to set this in _preprocess, set it in _fit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed self.cat_features/text_features/num_features and used self.feature_metadata instead.

Comment on lines 51 to 52
if seed is not None:
random.seed(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a seed? Isn't this running via a command line tool?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command line tool accepts random seed as a parameter. Agreed that there is no need for this, removed

Comment on lines 46 to 48
cat_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']
text_features = None
num_features = ['age', 'fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will automatically be stored in self._feature_metadata in PecosModel during the call to _fit. No need to define it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

"""
return re.sub('\W+', '_', str(s))

def read_pred_outfile(pred_outfile, k = 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PEP8 suggests 2 newlines between functions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed :)

Comment on lines 10 to 11
from pecos_interface import PecosInterface
from pecos_utils import clean_str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use relative imports: from .pecos_interface from .pecos_utils

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Comment on lines 1 to 4
# Run PECOS with AutoGluon. Testing file provided for convenience
from autogluon.tabular import TabularDataset
from pecos_model import PecosModel
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Can you also add unit tests akin to the other models? After addressing my other comments PECOS shouldn't need any special inputs to work (ex: cat_features, text_features, num_features shouldn't be necessary).

Here is an example unit tests for vowpalwabbit that is also a command-line based model interface: https://github.com/awslabs/autogluon/blob/master/tabular/tests/unittests/models/test_vowpalwabbit.py

You should be able to do the same unit test, just replacing VW with PECOS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests. Thanks!

@noahj08
Copy link
Author

noahj08 commented Oct 4, 2022

Just published a few new commits. Some notes on this "revision":

  • During development I realized that I need to fundamentally change how I call PECOS in order to use it to process text input with AutoGluon, since AutoGluon automatically pre-processes text features before training and testing. I will make this change in a future PR since the current code will technically still work on tabular data (though I don’t expect PECOS to perform well without text inputs)
  • I had to rename the pecos directory to pecos_tabular in order to avoid conflicting names with relative imports (the pecos module we import/call is also called ‘pecos’). Sorry for the messy diff that results
  • All other comments were addressed, including:
    • Removing explicit naming of cat_features/text_features/num_features, use FeatureMetadata instead
    • Remove text features entirely for now
    • Formatting fixes
    • Importing constants like R_INT, R_BOOL, etc
    • Stop preprocessing y in _preprocess
    • Make model_type a hyperparam
    • Use relative imports
    • Add a time_limit
    • Remove random seed
    • Add version cap to PECOS
    • Add unit tests
      Thanks for the feedback on rev 1!

@Innixma
Copy link
Contributor

Innixma commented Oct 4, 2022

Thanks for the revision @noahj08! Could you look into why the PECOS unit test is failing? https://github.com/awslabs/autogluon/actions/runs/3185772278/jobs/5195752047#step:4:1687

If this can be fixed that will make the 2nd round of review easier since I can move straight to benchmarking.

@Innixma
Copy link
Contributor

Innixma commented Oct 4, 2022

I had to rename the pecos directory to pecos_tabular in order to avoid conflicting names with relative imports (the pecos module we import/call is also called ‘pecos’). Sorry for the messy diff that results

I'm not quite sure why this would be necessary. Worst case scenario you can do import pecos as _pecos to avoid collisions. Ideally pecos/ would be a nice directory name.

If this was truly an issue we should have seen it with catboost since we call import catboost within catboost/catboost_model.py, but this works without issue.


def predict(self, X: np.ndarray):
df_pred = self.predict_proba(X, k=1)
df_pred.columns = ['label', 'score']
Copy link
Contributor

@Innixma Innixma Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line that is causing unit test to fail

[2022-10-04T22:38:11.619Z] 	Warning: Exception caused PecosModel to fail during training... Skipping this model.
[2022-10-04T22:38:11.619Z] 		'numpy.ndarray' object has no attribute 'columns'
[2022-10-04T22:38:11.619Z] Detailed Traceback:
[2022-10-04T22:38:11.619Z] Traceback (most recent call last):
[2022-10-04T22:38:11.619Z]   File "/home/ci/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 1159, in _train_and_save
[2022-10-04T22:38:11.619Z]     y_pred_proba_val = model.predict_proba(X_val)
[2022-10-04T22:38:11.619Z]   File "/home/ci/autogluon/core/src/autogluon/core/models/abstract/abstract_model.py", line 687, in predict_proba
[2022-10-04T22:38:11.619Z]     y_pred_proba = self._predict_proba(X=X, **kwargs)
[2022-10-04T22:38:11.619Z]   File "/home/ci/autogluon/core/src/autogluon/core/models/abstract/abstract_model.py", line 702, in _predict_proba
[2022-10-04T22:38:11.619Z]     y_pred = self.model.predict(X)
[2022-10-04T22:38:11.619Z]   File "/home/ci/autogluon/tabular/src/autogluon/tabular/models/pecos_tabular/pecos_interface.py", line 110, in predict
[2022-10-04T22:38:11.619Z]     df_pred.columns = ['label', 'score']
[2022-10-04T22:38:11.619Z] AttributeError: 'numpy.ndarray' object has no attribute 'columns'
[2022-10-04T22:38:11.619Z] No base models to train on, skipping auxiliary stack level 2...

@noahj08
Copy link
Author

noahj08 commented Oct 5, 2022

I'm not quite sure why this would be necessary. Worst case scenario you can do import pecos as _pecos to avoid collisions. Ideally pecos/ would be a nice directory name.

Revisited this - I still don't think I can name the directory pecos. Unlike with the catboost model, we are running the pecos library from the command line. The command to run pecos from the PecosInterface look like this: python3 -m pecos.apps.text2text.train.

When I use relative imports to run this code in a directory named pecos, the command I run looks like python3 -m pecos.run, and I get the error pecos.apps module does not exist due to the name conflict. I am not sure how I could get around this other than renaming the code directory.

@Innixma
Copy link
Contributor

Innixma commented Oct 6, 2022

I'm not quite sure why this would be necessary. Worst case scenario you can do import pecos as _pecos to avoid collisions. Ideally pecos/ would be a nice directory name.

Revisited this - I still don't think I can name the directory pecos. Unlike with the catboost model, we are running the pecos library from the command line. The command to run pecos from the PecosInterface look like this: python3 -m pecos.apps.text2text.train.

When I use relative imports to run this code in a directory named pecos, the command I run looks like python3 -m pecos.run, and I get the error pecos.apps module does not exist due to the name conflict. I am not sure how I could get around this other than renaming the code directory.

Gotcha, that makes sense. I suppose it is fine as it is then

@github-actions
Copy link

github-actions bot commented Oct 6, 2022

Job PR-2153-b510684 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2153/b510684/index.html

@Innixma
Copy link
Contributor

Innixma commented Oct 20, 2022

Benchmarking this branch on a sample of 10 datasets, you can reproduce results via running this code:

https://github.com/Innixma/autogluon-benchmark/blob/master/examples/train_ag_tiny.py

and editing config1 and config2:

    config1 = dict(
        name='GBM',
        fit_args={
            'hyperparameters': {'GBM': {}},
        },
    )
    from autogluon.tabular.models.pecos_tabular.pecos_model import PecosModel
    config2 = dict(
        name='PECOS',
        fit_args={
            'hyperparameters': {PecosModel: {}},
        }
    )

Results:

     name                task_name  test_score   time_fit  time_predict eval_metric  test_error  fold  repeat  sample  task_id problem_type
0     GBM                     wilt    0.978166   0.592907      0.004127     roc_auc    0.021834     0       0       0   146820       binary
1     GBM                 credit-g    0.807619   0.358318      0.011728     roc_auc    0.192381     0       0       0   168757       binary
2     GBM                  jasmine    0.877181   0.638222      0.025533     roc_auc    0.122819     0       0       0   168911       binary
3     GBM                 madeline    0.903035   0.953085      0.006725     roc_auc    0.096965     0       0       0   190392       binary
4     GBM               eucalyptus   -0.763752   1.622462      0.010539    log_loss    0.763752     0       0       0   359954   multiclass
5     GBM              qsar-biodeg    0.948016   0.494164      0.006107     roc_auc    0.051984     0       0       0   359956       binary
6     GBM                      pc4    0.933594   0.454187      0.005441     roc_auc    0.066406     0       0       0   359958       binary
7     GBM                      kc1    0.786837   0.457706      0.004599     roc_auc    0.213163     0       0       0   359962       binary
8     GBM                  segment   -0.199786   1.927116      0.004838    log_loss    0.199786     0       0       0   359963   multiclass
9     GBM  Internet-Advertisements    0.967545   2.554853      0.250002     roc_auc    0.032455     0       0       0   359966       binary
10  PECOS                     wilt    0.614293   2.989111      1.256844     roc_auc    0.385707     0       0       0   146820       binary
11  PECOS                 credit-g    0.650000   2.679966      1.271916     roc_auc    0.350000     0       0       0   168757       binary
12  PECOS                  jasmine    0.772394   4.292119      1.451271     roc_auc    0.227606     0       0       0   168911       binary
13  PECOS                 madeline    0.551241   6.048249      1.617987     roc_auc    0.448759     0       0       0   190392       binary
14  PECOS               eucalyptus   -1.146890   5.497149      1.242149    log_loss    1.146890     0       0       0   359954   multiclass
15  PECOS              qsar-biodeg    0.797222   2.746031      1.265541     roc_auc    0.202778     0       0       0   359956       binary
16  PECOS                      pc4    0.579427   2.842677      1.271110     roc_auc    0.420573     0       0       0   359958       binary
17  PECOS                      kc1    0.655115   2.815633      1.263598     roc_auc    0.344885     0       0       0   359962       binary
18  PECOS                  segment   -0.814216   5.409791      1.268010    log_loss    0.814216     0       0       0   359963   multiclass
19  PECOS  Internet-Advertisements    0.911270  26.414598      3.977463     roc_auc    0.088730     0       0       0   359966       binary

Take-aways:

  1. Inference speed is quite slow (not unusable, but still slow)
  2. Test scores are quite poor compared to LightGBM across all datasets.
    • wilt: LightGBM has 0.97 AUC, PECOS has 0.65 AUC.
    • segment: LightGBM has 0.2 logloss, PECOS has 0.81 logloss.

As a next step, an example dataset should be provided where PECOS (as implemented in this PR via use in AutoGluon) meaningfully competes with existing models to justify its inclusion, or a description of what would need to be added for it to meaningfully compete. If text support needs to be added, please provide a text dataset either publicly or privately that PECOS is supposed to do well on with proper implementation for ease of tracking text support progress of PECOS in AutoGluon.

@noahj08
Copy link
Author

noahj08 commented Dec 21, 2022

Added support for text data and vectorized the preprocess method in the latest commit.

New results on benchmark are copied below. We see that PECOS performance is better - it very slightly outperforms GBM on the kc1 task. We don't expect PECOS to be great at these tasks; PECOS is meant for solving extreme classification problems with thousands of labels or more. I am sending an internal-only dataset offline where PECOS clearly outperforms the other algorithms in AutoGluon.

Integrating this PR enables AutoGluon to solve a new domain of problems; it adds considerable value to the framework.

     name                task_name  test_score   time_fit  time_predict eval_metric  test_error  fold  repeat  sample  task_id problem_type
0   PECOS                     wilt    0.813151   3.515500      1.612471     roc_auc    0.186849     0       0       0   146820       binary
1   PECOS                 credit-g    0.747143   3.320124      1.603239     roc_auc    0.252857     0       0       0   168757       binary
2   PECOS                  jasmine    0.814564   3.968651      1.723619     roc_auc    0.185436     0       0       0   168911       binary
3   PECOS                 madeline    0.540084   4.666079      1.718924     roc_auc    0.459916     0       0       0   190392       binary
4   PECOS               eucalyptus   -0.962728   7.613407      1.615996    log_loss    0.962728     0       0       0   359954   multiclass
5   PECOS              qsar-biodeg    0.893651   3.489304      1.632812     roc_auc    0.106349     0       0       0   359956       binary
6   PECOS                      pc4    0.865017   3.549638      1.654120     roc_auc    0.134983     0       0       0   359958       binary
7   PECOS                      kc1    0.789892   3.511672      1.621228     roc_auc    0.210108     0       0       0   359962       binary
8   PECOS                  segment   -0.588283   6.144801      1.580236    log_loss    0.588283     0       0       0   359963   multiclass
9   PECOS  Internet-Advertisements    0.953438  11.147243      2.585245     roc_auc    0.046562     0       0       0   359966       binary
10    GBM                     wilt    0.978166   1.521617      0.021018     roc_auc    0.021834     0       0       0   146820       binary
11    GBM                 credit-g    0.807619   0.898730      0.029716     roc_auc    0.192381     0       0       0   168757       binary
12    GBM                  jasmine    0.877181   1.623613      0.050219     roc_auc    0.122819     0       0       0   168911       binary
13    GBM                 madeline    0.903035   2.149817      0.023113     roc_auc    0.096965     0       0       0   190392       binary
14    GBM               eucalyptus   -0.763752   4.365198      0.026870    log_loss    0.763752     0       0       0   359954   multiclass
15    GBM              qsar-biodeg    0.948016   1.087405      0.022294     roc_auc    0.051984     0       0       0   359956       binary
16    GBM                      pc4    0.933594   1.503057      0.021492     roc_auc    0.066406     0       0       0   359958       binary
17    GBM                      kc1    0.786837   1.501772      0.025907     roc_auc    0.213163     0       0       0   359962       binary
18    GBM                  segment   -0.199786  15.058131      0.020986    log_loss    0.199786     0       0       0   359963   multiclass
19    GBM  Internet-Advertisements    0.967545   4.548597      0.392050     roc_auc    0.032455     0       0       0   359966       binary
  • Linter is currently failing - behavior is not replicating when I run the linter locally. Will investigate more tomorrow

@noahj08
Copy link
Author

noahj08 commented Dec 22, 2022

Fixed lint check - issue was that I needed to merge with upstream :).

A multimodal test is now failing due to a Bad Request error when trying to read from S3. This does not seem like it would be related to my changes; my changes are in the tabular module, where all tests pass. I could not fix the issue within the time I allocated for this today, so I will need to revisit this another day. Copying the error below for reference.

Downloading /home/noahjaco/.automm_unit_tests/datasets/petfinder_for_unit_tests.zip from s3://automl-mm-bench/unit-tests-0.4/datasets/petfinder_for_unit_tests.zip...
download failed due to ClientError('An error occurred (400) when calling the HeadObject operation: Bad Request'), retrying, 4 attempts left
Downloading /home/noahjaco/.automm_unit_tests/datasets/petfinder_for_unit_tests.zip from s3://automl-mm-bench/unit-tests-0.4/datasets/petfinder_for_unit_tests.zip...
download failed due to ClientError('An error occurred (400) when calling the HeadObject operation: Bad Request'), retrying, 3 attempts left
Downloading /home/noahjaco/.automm_unit_tests/datasets/petfinder_for_unit_tests.zip from s3://automl-mm-bench/unit-tests-0.4/datasets/petfinder_for_unit_tests.zip...
download failed due to ClientError('An error occurred (400) when calling the HeadObject operation: Bad Request'), retrying, 2 attempts left
Downloading /home/noahjaco/.automm_unit_tests/datasets/petfinder_for_unit_tests.zip from s3://automl-mm-bench/unit-tests-0.4/datasets/petfinder_for_unit_tests.zip...
download failed due to ClientError('An error occurred (400) when calling the HeadObject operation: Bad Request'), retrying, 1 attempt left
Downloading /home/noahjaco/.automm_unit_tests/datasets/petfinder_for_unit_tests.zip from s3://automl-mm-bench/unit-tests-0.4/datasets/petfinder_for_unit_tests.zip...
________________________________________________________________________ ERROR collecting tests/unittests/others/test_data_augmentation.py ________________________________________________________________________
tests/unittests/others/test_data_augmentation.py:26: in <module>
    from ..predictor.test_predictor import verify_predictor_save_load
<frozen importlib._bootstrap>:1007: in _find_and_load
    ???
<frozen importlib._bootstrap>:986: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:680: in _load_unlocked
    ???
../../../final-ag/lib/python3.9/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
    exec(co, module.__dict__)
tests/unittests/predictor/test_predictor.py:38: in <module>
    "petfinder": PetFinderDataset(),
tests/unittests/others/unittest_datasets.py:24: in __init__
    download(
src/autogluon/multimodal/utils/download.py:268: in download
    raise e
src/autogluon/multimodal/utils/download.py:210: in download
    response = s3.meta.client.head_object(Bucket=s3_bucket_name, Key=s3_key)
../../../final-ag/lib/python3.9/site-packages/botocore/client.py:530: in _api_call
    return self._make_api_call(operation_name, kwargs)
../../../final-ag/lib/python3.9/site-packages/botocore/client.py:960: in _make_api_call
    raise error_class(parsed_response, operation_name)
E   botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

@Innixma Innixma added this to the 0.7 Release milestone Jan 6, 2023
@Innixma
Copy link
Contributor

Innixma commented Feb 1, 2023

Regarding this contribution: I'm waiting for v0.7 to release to simplify the amount of testing needed when creating the model contrib example repo and moving forward with this contribution, thanks!

@Innixma Innixma added module: tabular dependency Related to dependency packages labels Feb 2, 2023
@Innixma Innixma removed this from the 0.7 Release milestone Feb 9, 2023
@Innixma Innixma added this to the 0.7 Fast-Follow Items milestone Feb 9, 2023
@Innixma Innixma modified the milestones: 0.7 Fast-Follow Items, 0.9 Release May 23, 2023
@Innixma Innixma modified the milestones: 0.9 Release, Feature Backlog Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency Related to dependency packages module: tabular
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants