Tika-2421 : About the encoding of HTML #338

PeterAlfredLee · 2020-08-13T07:34:47Z

Seems we can use charsetdetector.StandardHtmlEncodingDetector for charset detecting of HTML. I'm wondering why we are not using it?

And I stopped treating ISO-8859-1 as Windows-1252.

tballison · 2020-08-13T14:21:21Z

Inertia... I never got around to doing a bakeoff between the two, and, unless there's evidence of improvement, I'm hesitant to make the change as the default detector.

PeterAlfredLee · 2020-08-14T01:41:11Z

Like TIKA-2421 says , according to w3 description , we should read html byte mark order first.
If there is no BOM , that means it is ASCII-compatible , then we can read this html's meta tag with ACSII and get charset.

HtmlEncodingDetector will not read html's BOM first , it assume html's meta tag is ASCII-compatible.
StandardHtmlEncodingDetector will read BOM first , then read metadata if there is no BOM , then read meta tag if no charset in metadata.
So I think use StandardHtmlEncodingDetector is more compliant to the w3 standard.

Only problem I can see is StandardHtmlEncodingDetector treating ISO-8859-1 as Windows-1252 , I have modify that in this PR.

So I think we can change StandardHtmlEncodingDetector as default detector.
Or we can modify HtmlEncodingDetector to compliant to w3 standard. WDYT

tballison · 2020-09-03T16:32:07Z

Wait, it turns out I did get around to doing this study...

https://github.com/tballison/share/blob/main/slides/Tika_charset_detector_study_201909.docx

Let me read it and remember what I found... 🤣

Replace HtmlEncodingDetector to StandardHtmlEncodingDetector Adjust some test case

PeterAlfredLee force-pushed the TIKA-2421 branch from 2a1b6a5 to dcc5e38 Compare August 25, 2020 09:08

PeterAlfredLee force-pushed the TIKA-2421 branch from dcc5e38 to 90af867 Compare September 2, 2020 01:33

PeterAlfredLee force-pushed the TIKA-2421 branch from 90af867 to ffc6e5b Compare September 5, 2020 02:31

PeterAlfredLee force-pushed the TIKA-2421 branch from ffc6e5b to 6a9f4ca Compare December 3, 2020 01:24

PeterAlfredLee added 2 commits December 5, 2020 10:17

Modify Charset Aliases : Stop treat ISO-8859-1 as Windows-1252's alias

0e7b475

Modify default encoding detector

99eaa8a

Replace HtmlEncodingDetector to StandardHtmlEncodingDetector Adjust some test case

PeterAlfredLee force-pushed the TIKA-2421 branch from 6a9f4ca to 99eaa8a Compare December 5, 2020 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tika-2421 : About the encoding of HTML #338

Tika-2421 : About the encoding of HTML #338

PeterAlfredLee commented Aug 13, 2020

tballison commented Aug 13, 2020

PeterAlfredLee commented Aug 14, 2020

tballison commented Sep 3, 2020

Tika-2421 : About the encoding of HTML #338

Are you sure you want to change the base?

Tika-2421 : About the encoding of HTML #338

Conversation

PeterAlfredLee commented Aug 13, 2020

tballison commented Aug 13, 2020

PeterAlfredLee commented Aug 14, 2020

tballison commented Sep 3, 2020