CSV files and other TXT files charset encoding selection needed ( `windows-1251` etc ) #2277

alfablend · 2024-03-26T13:15:16Z

Version and OS
0.45.16 on windows 11/docker

Is your feature request related to a problem? Please describe.
CSV files (tables in plain text format) charset encoding selection is unsupported so content of these files may be unreadable .
Changedetection as far as I understand use UTF-8 charset for these files. The CSV files that i need to monitor are in windows-1251 charset.

Describe the solution you'd like
I need to have an opportunity to select correct chatset.
My CSV files are encoded in windows-1251 charset.

Describe the use-case and give concrete real-world examples

There are a lot of big data in CSV format. It is text format that represents the data tables by using commas or other symbols. You can see more about it on Wikipedia:
https://en.wikipedia.org/wiki/Comma-separated_values
As text files, CSV may be encoded non in UTF-8. For example, in can have windows-1251 or koi8-r charset.
CSV files that I try to use with changedetection app are unreadable due absense of charset selection
.

dgtlmoon · 2024-03-26T14:24:54Z

Any chance you can copy+paste the request headers for the site you are trying? i need more exact info

Load the URL in chrome and hit up the inspection > network tab

alfablend · 2024-03-26T15:04:02Z

Thanks for your answer!

When I try to open link to CSV file in Chrome it automaticly download a file with .csv extension. Chrome window stay blank.

So network tab, as far as I understand, is blank too.

I use plain parser in changedetection.io to work with CSV files, Chrome mode is not working with these files.

dgtlmoon · 2024-03-26T15:13:42Z

@alfablend use curl from command line instead

$ curl --head https://changedetection.io/CHANGELOG.txt
HTTP/2 200 
server: nginx
date: Tue, 26 Mar 2024 15:13:13 GMT
content-type: text/plain
content-length: 86815
last-modified: Tue, 26 Mar 2024 15:01:02 GMT
vary: Accept-Encoding
etag: "6602e32e-1531f"
strict-transport-security: max-age=63072000
accept-ranges: bytes

try that

alfablend · 2024-03-26T16:16:35Z

Thanks, done it (changed the link in your command to my link first).

As I see, there is UTF-8 charset in this response. But it is not similar as downloadable CSV file itself encoding, that is windows-1251. May be is there any way to force using windows-1251 charset?

dgtlmoon · 2024-03-26T16:23:23Z

it seems the server is returning the wrong information, your CSV is reported as "text/html"

can you attach the CSV file?

alfablend · 2024-03-26T17:03:42Z

Thank you for explanation!
Thats the file
urvi (1).csv

dgtlmoon · 2024-03-27T12:47:08Z

import chardet

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        rawdata = f.read()
        result = chardet.detect(rawdata)
    return result

result = detect_encoding("urvi.1.csv")
print("The encoding of the file is:", result['encoding'])
print("Confidence level:", result['confidence'])

$ python3 ./test.py 
The encoding of the file is: windows-1251
Confidence level: 0.9414230748073508

so the file is windows-1251 but the web server is reporting the wrong encoding type

i'm also not sure if windows-1251 is supported by any of our text difference handlers, more than likely not...

alfablend · 2024-03-27T14:32:56Z

Thank you! If I understand you right, there is general problem with non-unicode (non-latin) content. And the solution may be finding preprocessor (charset converter).

Due the Wikipedia windows-1251 charset is still "the second most-used single-byte character encoding (or third most-used character encoding overall)". But, of course,it is still small percents in the scale of global internet, and, I understand, it may be not the priority task.

dgtlmoon · 2024-03-27T14:38:16Z

Thank you! If I understand you right, there is general problem with non-unicode (non-latin) content. And the solution may be finding preprocessor (charset converter).

the software already has the chardet detection library installed :) so first is to write some tests and understand the relationship between the windows encoding type, and websites that return the wrong mime type

alfablend added the enhancement New feature or request label Mar 26, 2024

dgtlmoon changed the title ~~CSV files and other TXT files charset encoding selection needed~~ CSV files and other TXT files charset encoding selection needed ( windows-1251 etc ) Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV files and other TXT files charset encoding selection needed ( `windows-1251` etc ) #2277

CSV files and other TXT files charset encoding selection needed ( `windows-1251` etc ) #2277

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 26, 2024

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 26, 2024

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 26, 2024

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 27, 2024

alfablend commented Mar 27, 2024

dgtlmoon commented Mar 27, 2024

CSV files and other TXT files charset encoding selection needed ( windows-1251 etc ) #2277

CSV files and other TXT files charset encoding selection needed ( windows-1251 etc ) #2277

Comments

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 26, 2024

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 26, 2024

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 26, 2024

alfablend commented Mar 26, 2024

dgtlmoon commented Mar 27, 2024

alfablend commented Mar 27, 2024

dgtlmoon commented Mar 27, 2024

CSV files and other TXT files charset encoding selection needed ( `windows-1251` etc ) #2277

CSV files and other TXT files charset encoding selection needed ( `windows-1251` etc ) #2277