Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File detected as Windows-1250, but is UTF-8 #108

Open
tobbi opened this issue Feb 20, 2020 · 3 comments
Open

File detected as Windows-1250, but is UTF-8 #108

tobbi opened this issue Feb 20, 2020 · 3 comments

Comments

@tobbi
Copy link

tobbi commented Feb 20, 2020

I'm using UTF.Unknown 2.3.0
The following file is detected as Windows-1250, but is UTF-8:

csv_test_correct_GZ.zip

@rstm-sf
Copy link
Collaborator

rstm-sf commented Feb 20, 2020

Hello, @tobbi !

Thank you for the report.

Could you add a text file? Why did you choose zip? Do you submit this to input?

@tobbi
Copy link
Author

tobbi commented Feb 20, 2020

Sorry, my bad, it used to be a csv file and github wouldn't accept those. Here's the file with the extension changed to .txt:

csv_test_correct_GZ.txt

@rstm-sf
Copy link
Collaborator

rstm-sf commented Feb 20, 2020

Thanks for clarifying.

At first glance, I think the result is normal. Why? The algorithm by which detected is statistical, and, accordingly, the more different input data, the more accurate the final result. Details can be found in the "A composite approach to language/encoding detection" article.

But, we need to try to improve the result :)


Status Logs:

SBCS: Detected windows-1250 with confidence of 0.7738685

Get confidence:
-- new match found: confidence 0.01, index 0, charset windows-1251.
-- new match found: confidence 0.18598664, index 6, charset iso-8859-7.
-- new match found: confidence 0.7133932, index 15, charset iso-8859-1.
-- new match found: confidence 0.71340704, index 18, charset iso-8859-1.
-- new match found: confidence 0.76677626, index 23, charset iso-8859-1.
-- new match found: confidence 0.7738685, index 86, charset windows-1250.
Get confidence done.
SBCS Group Prober --------begin status
SBCS 0.01: [windows-1251]
SBCS: 0.01 [windows-1251]

SBCS 0.01: [koi8-r]
SBCS: 0.01 [koi8-r]

SBCS 0: [iso-8859-5]
SBCS: 0.00 [iso-8859-5]

SBCS 0.01: [x-mac-cyrillic]
SBCS: 0.01 [x-mac-cyrillic]

SBCS 0.01: [ibm866]
SBCS: 0.01 [ibm866]

SBCS 0.01: [ibm855]
SBCS: 0.01 [ibm855]

SBCS 0.18598664: [iso-8859-7]
SBCS: 0.1859866 [iso-8859-7]

SBCS 0.18598664: [windows-1253]
SBCS: 0.1859866 [windows-1253]

SBCS 0: [iso-8859-5]
SBCS: 0.00 [iso-8859-5]

SBCS 0.01: [windows-1251]
SBCS: 0.01 [windows-1251]

SBCS 0: [windows-1255]
HEB: 0 - 0 [Logical-Visual score]

SBCS 0: [windows-1255]
SBCS: 0.00 [windows-1255]

SBCS 0: [windows-1255]
SBCS: 0.00 [windows-1255]

SBCS 0.09991017: [tis-620]
SBCS: 0.09991017 [tis-620]

SBCS 0.09991017: [iso-8859-11]
SBCS: 0.09991017 [iso-8859-11]

SBCS 0.7133932: [iso-8859-1]
SBCS: 0.7133932 [iso-8859-1]

SBCS 0.6674997: [iso-8859-15]
SBCS: 0.6674997 [iso-8859-15]

SBCS 0.7133932: [windows-1252]
SBCS: 0.7133932 [windows-1252]

SBCS 0.71340704: [iso-8859-1]
SBCS: 0.713407 [iso-8859-1]

SBCS 0.67082536: [iso-8859-15]
SBCS: 0.6708254 [iso-8859-15]

SBCS 0.71340704: [windows-1252]
SBCS: 0.713407 [windows-1252]

SBCS 0.6861101: [iso-8859-2]
SBCS: 0.6861101 [iso-8859-2]

SBCS 0.6861101: [windows-1250]
SBCS: 0.6861101 [windows-1250]

SBCS 0.76677626: [iso-8859-1]
SBCS: 0.7667763 [iso-8859-1]

SBCS 0.76677626: [windows-1252]
SBCS: 0.7667763 [windows-1252]

SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS 0.717128: [iso-8859-9]
SBCS: 0.717128 [iso-8859-9]

SBCS inactive: [iso-8859-6] (i.e. confidence is too low).
SBCS 0: [windows-1256]
SBCS: 0.00 [windows-1256]

SBCS 0.40016073: [viscii]
SBCS: 0.4001607 [viscii]

SBCS 0.44124976: [windows-1258]
SBCS: 0.4412498 [windows-1258]

SBCS 0.71854687: [iso-8859-15]
SBCS: 0.7185469 [iso-8859-15]

SBCS 0.7641578: [iso-8859-1]
SBCS: 0.7641578 [iso-8859-1]

SBCS 0.7641578: [windows-1252]
SBCS: 0.7641578 [windows-1252]

SBCS 0.71640146: [iso-8859-13]
SBCS: 0.7164015 [iso-8859-13]

SBCS 0.6377162: [iso-8859-10]
SBCS: 0.6377162 [iso-8859-10]

SBCS 0.6736411: [iso-8859-4]
SBCS: 0.6736411 [iso-8859-4]

SBCS 0.71818155: [iso-8859-13]
SBCS: 0.7181816 [iso-8859-13]

SBCS 0.6363546: [iso-8859-10]
SBCS: 0.6363546 [iso-8859-10]

SBCS 0.6753149: [iso-8859-4]
SBCS: 0.6753149 [iso-8859-4]

SBCS 0.666065: [iso-8859-1]
SBCS: 0.666065 [iso-8859-1]

SBCS 0.666065: [iso-8859-9]
SBCS: 0.666065 [iso-8859-9]

SBCS 0.62630904: [iso-8859-15]
SBCS: 0.626309 [iso-8859-15]

SBCS 0.666065: [windows-1252]
SBCS: 0.666065 [windows-1252]

SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS 0.6366351: [windows-1250]
SBCS: 0.6366351 [windows-1250]

SBCS 0.6366351: [iso-8859-2]
SBCS: 0.6366351 [iso-8859-2]

SBCS 0.72143143: [x-mac-ce]
SBCS: 0.7214314 [x-mac-ce]

SBCS 0.72143143: [ibm852]
SBCS: 0.7214314 [ibm852]

SBCS 0.6434225: [windows-1250]
SBCS: 0.6434225 [windows-1250]

SBCS 0.64008415: [iso-8859-2]
SBCS: 0.6400841 [iso-8859-2]

SBCS 0.7291228: [x-mac-ce]
SBCS: 0.7291228 [x-mac-ce]

SBCS 0.7253399: [ibm852]
SBCS: 0.7253399 [ibm852]

SBCS 0.58494663: [windows-1250]
SBCS: 0.5849466 [windows-1250]

SBCS 0.5881849: [iso-8859-2]
SBCS: 0.5881849 [iso-8859-2]

SBCS 0.61615247: [iso-8859-13]
SBCS: 0.6161525 [iso-8859-13]

SBCS 0.58494663: [iso-8859-16]
SBCS: 0.5849466 [iso-8859-16]

SBCS 0.66285837: [x-mac-ce]
SBCS: 0.6628584 [x-mac-ce]

SBCS 0.65958494: [ibm852]
SBCS: 0.6595849 [ibm852]

SBCS 0.7628341: [iso-8859-1]
SBCS: 0.7628341 [iso-8859-1]

SBCS 0.71730226: [iso-8859-4]
SBCS: 0.7173023 [iso-8859-4]

SBCS 0.71730226: [iso-8859-9]
SBCS: 0.7173023 [iso-8859-9]

SBCS 0.7628341: [iso-8859-13]
SBCS: 0.7628341 [iso-8859-13]

SBCS 0.71730226: [iso-8859-15]
SBCS: 0.7173023 [iso-8859-15]

SBCS 0.7628341: [windows-1252]
SBCS: 0.7628341 [windows-1252]

SBCS 0.76252055: [iso-8859-1]
SBCS: 0.7625206 [iso-8859-1]

SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS 0.76252055: [iso-8859-9]
SBCS: 0.7625206 [iso-8859-9]

SBCS 0.71700746: [iso-8859-15]
SBCS: 0.7170075 [iso-8859-15]

SBCS 0.76252055: [windows-1252]
SBCS: 0.7625206 [windows-1252]

SBCS 0.6695262: [windows-1250]
SBCS: 0.6695262 [windows-1250]

SBCS 0.6695262: [iso-8859-2]
SBCS: 0.6695262 [iso-8859-2]

SBCS 0.7052443: [iso-8859-13]
SBCS: 0.7052443 [iso-8859-13]

SBCS 0.6695262: [iso-8859-16]
SBCS: 0.6695262 [iso-8859-16]

SBCS 0.7587035: [x-mac-ce]
SBCS: 0.7587035 [x-mac-ce]

SBCS 0.7587035: [ibm852]
SBCS: 0.7587035 [ibm852]

SBCS 0.76380235: [windows-1252]
SBCS: 0.7638023 [windows-1252]

SBCS 0.76380235: [windows-1257]
SBCS: 0.7638023 [windows-1257]

SBCS 0.71821266: [iso-8859-4]
SBCS: 0.7182127 [iso-8859-4]

SBCS 0.76380235: [iso-8859-13]
SBCS: 0.7638023 [iso-8859-13]

SBCS 0.71821266: [iso-8859-15]
SBCS: 0.7182127 [iso-8859-15]

SBCS 0.6575037: [iso-8859-1]
SBCS: 0.6575037 [iso-8859-1]

SBCS 0.6575037: [iso-8859-9]
SBCS: 0.6575037 [iso-8859-9]

SBCS 0.61825883: [iso-8859-15]
SBCS: 0.6182588 [iso-8859-15]

SBCS 0.6575037: [windows-1252]
SBCS: 0.6575037 [windows-1252]

SBCS 0.7738685: [windows-1250]
SBCS: 0.7738685 [windows-1250]

SBCS 0.7738685: [iso-8859-2]
SBCS: 0.7738685 [iso-8859-2]

SBCS 0.7738685: [iso-8859-16]
SBCS: 0.7738685 [iso-8859-16]

SBCS 0.75962406: [ibm852]
SBCS: 0.7596241 [ibm852]

SBCS 0.66994256: [windows-1250]
SBCS: 0.6699426 [windows-1250]

SBCS 0.66994256: [iso-8859-2]
SBCS: 0.6699426 [iso-8859-2]

SBCS 0.66994256: [iso-8859-16]
SBCS: 0.6699426 [iso-8859-16]

SBCS 0.75917524: [x-mac-ce]
SBCS: 0.7591752 [x-mac-ce]

SBCS 0.75917524: [ibm852]
SBCS: 0.7591752 [ibm852]

SBCS 0.76376295: [iso-8859-1]
SBCS: 0.763763 [iso-8859-1]

SBCS 0.7181756: [iso-8859-4]
SBCS: 0.7181756 [iso-8859-4]

SBCS 0.76376295: [iso-8859-9]
SBCS: 0.763763 [iso-8859-9]

SBCS 0.7181756: [iso-8859-15]
SBCS: 0.7181756 [iso-8859-15]

SBCS 0.76376295: [windows-1252]
SBCS: 0.763763 [windows-1252]

SBCS Group found best match [windows-1250] confidence 0.7738685.

MBCS: Detected utf-8 with confidence of 0.7525

Get confidence:
-- new match found: confidence 0.7525, index 0, charset utf-8.
Get confidence done.
MBCS Group Prober --------begin status
MBCS 0.7525: [utf-8]

MBCS 0.01: [shift-jis]

MBCS 0.01: [euc-jp]

MBCS 0.01: [gb18030]

MBCS 0.01: [euc-kr]

MBCS 0.01: [cp949]

MBCS 0.01: [big5]

MBCS inactive: euc-tw (i.e. confidence is too low).
MBCS Group found best match [utf-8] confidence 0.7525.

Latin1Prober: Detected windows-1252 with confidence of 0.43269232

Latin1Prober: 0.43269232 [windows-1252]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants