Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect ASCII detection #9

Open
cbourgeois opened this issue Mar 1, 2021 · 2 comments
Open

Incorrect ASCII detection #9

cbourgeois opened this issue Mar 1, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@cbourgeois
Copy link

Hi,

I think that the test set for this package is too reduced, the default values for very simple strings are wrong:

echo $LANG
en_US.UTF-8

charamel.Detector().probe('abc')
[(<Encoding.CP_1006: 'cp1006'>, 0.9521461826551444), (<Encoding.CP_864: 'cp864'>, 0.9462450387005286), (<Encoding.UTF_7: 'utf_7'>, 0.9452766125829656)]
charamel.Detector().probe('Param1234567890*ą_')
[(<Encoding.CP_1006: 'cp1006'>, 0.9521461826551444), (<Encoding.CP_864: 'cp864'>, 0.9462450387005286), (<Encoding.UTF_7: 'utf_7'>, 0.9452766125829656)]

The first one should return ascii and the second one UTF-8.

Thanks in advance for looking into that,

@chomechome chomechome added the bug Something isn't working label Mar 1, 2021
@chomechome
Copy link
Owner

chomechome commented Mar 1, 2021

Hi Clément,

These two examples run detection on Unicode strings, the correct test would be:

charamel.Detector().probe('abc'.encode('ascii'))
charamel.Detector().probe('Param1234567890*ą_'.encode('utf-8'))

By design, charamel returns encodings that likely can decode a sequence of bytes into a string correctly. It does not have to be the same encoding that was used to encode the string as long as the result of .decode(encoding) is the same.

This holds true for the first test with abc because the most probable returned encoding is UTF-7 and it can decode ASCII correctly. However, the second test is indeed not working as expected because it returns shift_jis_2004 which is used to encode Japanese text. Thank you for notifying me about that. I am currently working on a new release and will take that into account.

@cbourgeois
Copy link
Author

Hi Vladislav,

Indeed your answer makes sense.
From the user perspective I can think of the following two enchancements:

When providing a python unicode string it might make sense to either raise an error (like chardet.detect) or return a specific reference to the internal python unicode encoding (instead of CP1006 which is not really meaningful in that case).

When processing an arbitrary string, there's some value in having Detector() telling you that the ASCII encoding is sufficient to decode it (like chardet does, allowing charamel to be a dropped-in replacement).
I do not know if this would need to be applied to other encodings that are strict subsets of others, i.e. tweak charamel to return the smaller subset that can decode the string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants