Incorrect ASCII detection #9

cbourgeois · 2021-03-01T10:10:01Z

Hi,

I think that the test set for this package is too reduced, the default values for very simple strings are wrong:

echo $LANG
en_US.UTF-8

charamel.Detector().probe('abc')
[(<Encoding.CP_1006: 'cp1006'>, 0.9521461826551444), (<Encoding.CP_864: 'cp864'>, 0.9462450387005286), (<Encoding.UTF_7: 'utf_7'>, 0.9452766125829656)]
charamel.Detector().probe('Param1234567890*ą_')
[(<Encoding.CP_1006: 'cp1006'>, 0.9521461826551444), (<Encoding.CP_864: 'cp864'>, 0.9462450387005286), (<Encoding.UTF_7: 'utf_7'>, 0.9452766125829656)]

The first one should return ascii and the second one UTF-8.

Thanks in advance for looking into that,

chomechome · 2021-03-01T11:25:31Z

Hi Clément,

These two examples run detection on Unicode strings, the correct test would be:

charamel.Detector().probe('abc'.encode('ascii'))
charamel.Detector().probe('Param1234567890*ą_'.encode('utf-8'))

By design, charamel returns encodings that likely can decode a sequence of bytes into a string correctly. It does not have to be the same encoding that was used to encode the string as long as the result of .decode(encoding) is the same.

This holds true for the first test with abc because the most probable returned encoding is UTF-7 and it can decode ASCII correctly. However, the second test is indeed not working as expected because it returns shift_jis_2004 which is used to encode Japanese text. Thank you for notifying me about that. I am currently working on a new release and will take that into account.

cbourgeois · 2021-03-01T11:52:50Z

Hi Vladislav,

Indeed your answer makes sense.
From the user perspective I can think of the following two enchancements:

When providing a python unicode string it might make sense to either raise an error (like chardet.detect) or return a specific reference to the internal python unicode encoding (instead of CP1006 which is not really meaningful in that case).

When processing an arbitrary string, there's some value in having Detector() telling you that the ASCII encoding is sufficient to decode it (like chardet does, allowing charamel to be a dropped-in replacement).
I do not know if this would need to be applied to other encodings that are strict subsets of others, i.e. tweak charamel to return the smaller subset that can decode the string.

chomechome added the bug Something isn't working label Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect ASCII detection #9

Incorrect ASCII detection #9

cbourgeois commented Mar 1, 2021

chomechome commented Mar 1, 2021 •

edited

cbourgeois commented Mar 1, 2021

Incorrect ASCII detection #9

Incorrect ASCII detection #9

Comments

cbourgeois commented Mar 1, 2021

chomechome commented Mar 1, 2021 • edited

cbourgeois commented Mar 1, 2021

chomechome commented Mar 1, 2021 •

edited