Major difference between predictions from trained model #2627

DevBerge · 2024-04-04T14:11:07Z

Problem: Inference differs from training for CatboostClassifier()
catboost version: 1.2.2
Operating System: Ubuntu
CPU: AMD Ryzen 9 7950X3D
GPU: MSI GeForce RTX 4090 Ventus

I'm having issues with using trained models for inference after tuning and I'm going mad trying to debug the data pipeline since there is a huge difference between the results produced at the end of parameter tuning with .predict_proba() vs loading the model from file and calling .predict_proba(). The model is saved with the native save model function. I'm saving the dtypes, categorical features and the required features when the best model is fitted. When loading the model the data is then processed based on the feature schema created when the model was fitted. After the data has been processed we're constructing a pool object with the categorical keys and feature names from the fitting process. We're also using text and embeddings provided from a BERT model (these are converted into columns).

We're using the models in a Flask API for inference and training. For testing I've set up a "bulk prediction" and a single observation prediction endpoint. What is worrying is that the prediction for the same observation gives different results between the two endpoints.
This is also shown when trying to predict on the testset which has already been predicted once at the end of training the model.

Here is an example for predictions on the testset:
(from training) prob_negative: 0.10914
(from single) prob_negative: 0.87085
(from bulk) prob_negative: 0.78522

While this is a single observation this applies to the entire testset when predicted. This causes the entire distribution to be shifted and changes the thresholds dramatically. Which in turn causes major trouble for inference serving. Has anybody encountered this before or know a solution? I've observed this behavior before where inference results differ, but this is way too much.

andrey-khropov · 2024-04-08T16:39:57Z

Please provide an example on which the issue can be reproduced.

I tried on a small dataset with numerical, categorical, text and embedding features that we use for tests and cannot reproduce any problems - the result is the same after the model is saved and loaded again:

import os

import numpy as np
import catboost as cb

print(f'cb.version={cb.version.VERSION}')

data_root = os.path.join(
    <path to the root of the working copy of CatBoost's git repository>
    , 'catboost', 'pytest', 'data', 'rotten_tomatoes_small_with_embeddings'
)

train = cb.Pool(
    os.path.join(data_root, 'train_two_labels'),
    column_description=os.path.join(data_root, 'cd_binclass')
)

model = cb.CatBoostClassifier(iterations=100)
model.fit(train)
predicted_probabilities_original = model.predict_proba(train)

print(f'predicted_probabilities_original={predicted_probabilities_original[:5]}')

model.save_model('model.cbm')


loaded_model = cb.CatBoostClassifier()
loaded_model.load_model('model.cbm')

predicted_probabilities_from_loaded = loaded_model.predict_proba(train)

print(f'predicted_probabilities_from_loaded={predicted_probabilities_from_loaded[:5]}')

assert np.allclose(predicted_probabilities_original, predicted_probabilities_from_loaded)

DevBerge · 2024-04-09T14:09:25Z

Want to clarify, it was a bug in the data pipeline, which makes sense as I've never seen this much of a difference in results before. In short, embeddings were not properly generated for the samples.

However, I've tried to debug the variation in prediction probabilites before and given up. As of now the same sample gets 3 different probabilites.

(from training) prob_negative: 0.9361
(from single) prob_negative: 0.9171
(from bulk) prob_negative: 0.9024

While the difference is "small" I'm still unsure about the reason for this, the data pipline is the same (hopefully lol). When I first noticed this I found that the output may be non-deterministic due to the categorical features for the data. Found this here: https://catboost.ai/en/docs/concepts/faq#applying-the-model-to-train-dataset and through previous issues.

Any clarification on this would be super helpful, and sorry for the hostility earlier.

Code to how we load the model through the flask endpoint:

    cb_mod = cb.CatBoostClassifier()
    cb_mod = cb_mod.load_model(path)
    cols = features["required"]
    data = data[cols]
    category_keys = [key for key, value in features["properties"].items() if value.get("type") == "category"]
    pool = Pool(data=data, cat_features=category_keys, feature_names=cols, text_features=['text'])
    probs = model.predict_proba(pool)

Where path to the model and "features" is retrieved from a database for the model version. The "features" is a feature schema storing the datatypes, categories and required features for the model version. I don't have an easy way of sharing code/data for reproducibility as of now. Just a short snippet of the code that might provide some insight into the loading of the model.

andrey-khropov · 2024-04-09T14:50:53Z

pool = Pool(data=data, cat_features=category_keys, feature_names=cols, text_features=['text'])

You said that you have embedding features but I don't see that you pass embedding_features parameter here.

Have you compared all feature values that you passed to predict_proba in all cases?

DevBerge · 2024-04-09T21:03:45Z

Had issues using the embedding_feature parameter a while ago so we're passing the vector as embedding_feature_0, ... , embedding_feature_767.

example of how we both generate and pass the cols:

    data = pd.concat([data.loc[:, data.columns != 'embeddings'], data.embeddings.apply(pd.Series)], axis=1)
    bert_names = []
    for column in data.columns:
        if str(column).isdigit() and 0 <= int(column) <= 767:
            bert_names.append(column)

    cols = cols + bert_names
    # previous loading
    data = data[cols]

During processing we create an embeddings column in the dataframe then explode the vector into separate columns before passing it to the catboost model. The embedding columns are always in the last 0,..,N columns in the dataframe.

If I recall right when this was created, we passed a dataframe to the model and omitted constructing the pool object with Pool(x, y, cat_features=cat_features, text_features=['text'], feature_names=x.columns.tolist()) .
Can't really remember the specific issue we had with passing the vector in the embeddings column of the dataframe, but would the correct method be to pass it like this
Pool(x, y, cat_features=cat_features, text_features=['text'], embedding_features= ['embeddings'], feature_names=x.columns.tolist()) instead of exploding the vector?

I'll check on the feature values and report back on this.

andrey-khropov added need info python labels Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major difference between predictions from trained model #2627

Major difference between predictions from trained model #2627

DevBerge commented Apr 4, 2024

andrey-khropov commented Apr 8, 2024

DevBerge commented Apr 9, 2024 •

edited

andrey-khropov commented Apr 9, 2024

DevBerge commented Apr 9, 2024 •

edited

Major difference between predictions from trained model #2627

Major difference between predictions from trained model #2627

Comments

DevBerge commented Apr 4, 2024

andrey-khropov commented Apr 8, 2024

DevBerge commented Apr 9, 2024 • edited

andrey-khropov commented Apr 9, 2024

DevBerge commented Apr 9, 2024 • edited

DevBerge commented Apr 9, 2024 •

edited

DevBerge commented Apr 9, 2024 •

edited