Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In classification tasks, perform ordinal encoding on the target column of string types #1260

Open
HJH0924 opened this issue Jan 2, 2024 · 7 comments

Comments

@HJH0924
Copy link

HJH0924 commented Jan 2, 2024

In the classification task, when using ordinal to encode the target column of a string type, an error will be reported. After checking the source code, it was found that in the 1021st line of code in 'pgml extension/src/om/snapshot. rs', it will be converted to f32, which is a floating-point number, instead of "Encode each category as ascending integer values" as described in the 80th line of code. However, changing the task to Regression is sufficient, but that would be meaningless.

@montanalow
Copy link
Contributor

Can you add steps to reproduce an error? It's typical that things are eventually encoded as f32 in ML for hardware acceleration, even integer categoricals, depending on the underlying library implementation.

@HJH0924
Copy link
Author

HJH0924 commented Jan 3, 2024

For example, I have a data table test whose label column is a string type, either positive or negative. I preprocess the label column data using the following statement:
SELECT aisql.train(
project_name => 'preprocessed_model',
task => 'classification',
relation_name => 'test',
y_column_name => 'label',
preprocess => '{
"label": {"encode": {"ordinal": ["positive", "negative"]}}
}'
);
It can be preprocessed, encoding positive as 1.0 and negative as 2.0, but it cannot be trained to end because this is a classification task that requires the label to be an integer. At the same time, the comment on line 80 of 'pgml-extension/src/om/snapshot. rs' mentions that ordinal encoding will encode it as an integer, but in practice, both running and viewing line 1021 of the code indicate that it is encoded as a floating-point number (f32).

@montanalow
Copy link
Contributor

Postgres does not have a "string" type. And TEXT is not positive or negative. You may want to cast sql TEXT to int?

@HJH0924
Copy link
Author

HJH0924 commented Jan 3, 2024

yes

@montanalow
Copy link
Contributor

This is related to #631, with mixed text/int datasets

@HJH0924
Copy link
Author

HJH0924 commented Jan 4, 2024

got it. Thanks!

@hobson
Copy link

hobson commented Mar 21, 2024

Strings cannot automatically be converted to meaningful representations (machine learning features) as ordinal ints. You must explicitly map them to floating point values that represent their position on a continuous number line that you explicitly define. You may want to represent "positive" as 100 and "negative” as -100 and "neutral" as 0. Or maybe 1, 0, and .5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants