In classification tasks, perform ordinal encoding on the target column of string types #1260

HJH0924 · 2024-01-02T07:22:03Z

In the classification task, when using ordinal to encode the target column of a string type, an error will be reported. After checking the source code, it was found that in the 1021st line of code in 'pgml extension/src/om/snapshot. rs', it will be converted to f32, which is a floating-point number, instead of "Encode each category as ascending integer values" as described in the 80th line of code. However, changing the task to Regression is sufficient, but that would be meaningless.

montanalow · 2024-01-03T03:54:47Z

Can you add steps to reproduce an error? It's typical that things are eventually encoded as f32 in ML for hardware acceleration, even integer categoricals, depending on the underlying library implementation.

HJH0924 · 2024-01-03T06:25:07Z

For example, I have a data table test whose label column is a string type, either positive or negative. I preprocess the label column data using the following statement:
SELECT aisql.train(
project_name => 'preprocessed_model',
task => 'classification',
relation_name => 'test',
y_column_name => 'label',
preprocess => '{
"label": {"encode": {"ordinal": ["positive", "negative"]}}
}'
);
It can be preprocessed, encoding positive as 1.0 and negative as 2.0, but it cannot be trained to end because this is a classification task that requires the label to be an integer. At the same time, the comment on line 80 of 'pgml-extension/src/om/snapshot. rs' mentions that ordinal encoding will encode it as an integer, but in practice, both running and viewing line 1021 of the code indicate that it is encoded as a floating-point number (f32).

montanalow · 2024-01-03T06:58:36Z

Postgres does not have a "string" type. And TEXT is not positive or negative. You may want to cast sql TEXT to int?

HJH0924 · 2024-01-03T09:01:28Z

yes

montanalow · 2024-01-03T18:49:49Z

This is related to #631, with mixed text/int datasets

HJH0924 · 2024-01-04T01:50:04Z

got it. Thanks!

hobson · 2024-03-21T13:21:25Z

Strings cannot automatically be converted to meaningful representations (machine learning features) as ordinal ints. You must explicitly map them to floating point values that represent their position on a continuous number line that you explicitly define. You may want to represent "positive" as 100 and "negative” as -100 and "neutral" as 0. Or maybe 1, 0, and .5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In classification tasks, perform ordinal encoding on the target column of string types #1260

In classification tasks, perform ordinal encoding on the target column of string types #1260

HJH0924 commented Jan 2, 2024

montanalow commented Jan 3, 2024

HJH0924 commented Jan 3, 2024 •

edited

montanalow commented Jan 3, 2024

HJH0924 commented Jan 3, 2024

montanalow commented Jan 3, 2024

HJH0924 commented Jan 4, 2024

hobson commented Mar 21, 2024

In classification tasks, perform ordinal encoding on the target column of string types #1260

In classification tasks, perform ordinal encoding on the target column of string types #1260

Comments

HJH0924 commented Jan 2, 2024

montanalow commented Jan 3, 2024

HJH0924 commented Jan 3, 2024 • edited

montanalow commented Jan 3, 2024

HJH0924 commented Jan 3, 2024

montanalow commented Jan 3, 2024

HJH0924 commented Jan 4, 2024

hobson commented Mar 21, 2024

HJH0924 commented Jan 3, 2024 •

edited