Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV files and other TXT files charset encoding selection needed ( windows-1251 etc ) #2277

Open
alfablend opened this issue Mar 26, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@alfablend
Copy link

Version and OS
0.45.16 on windows 11/docker

Is your feature request related to a problem? Please describe.
CSV files (tables in plain text format) charset encoding selection is unsupported so content of these files may be unreadable .
Changedetection as far as I understand use UTF-8 charset for these files. The CSV files that i need to monitor are in windows-1251 charset.

Describe the solution you'd like
I need to have an opportunity to select correct chatset.
My CSV files are encoded in windows-1251 charset.

Describe the use-case and give concrete real-world examples

изображение

There are a lot of big data in CSV format. It is text format that represents the data tables by using commas or other symbols. You can see more about it on Wikipedia:
https://en.wikipedia.org/wiki/Comma-separated_values
As text files, CSV may be encoded non in UTF-8. For example, in can have windows-1251 or koi8-r charset.
CSV files that I try to use with changedetection app are unreadable due absense of charset selection
.

@alfablend alfablend added the enhancement New feature or request label Mar 26, 2024
@dgtlmoon
Copy link
Owner

Any chance you can copy+paste the request headers for the site you are trying? i need more exact info

Load the URL in chrome and hit up the inspection > network tab

image

@alfablend
Copy link
Author

Thanks for your answer!

When I try to open link to CSV file in Chrome it automaticly download a file with .csv extension. Chrome window stay blank.

изображение

So network tab, as far as I understand, is blank too.

изображение

I use plain parser in changedetection.io to work with CSV files, Chrome mode is not working with these files.

@dgtlmoon
Copy link
Owner

@alfablend use curl from command line instead

$ curl --head https://changedetection.io/CHANGELOG.txt
HTTP/2 200 
server: nginx
date: Tue, 26 Mar 2024 15:13:13 GMT
content-type: text/plain
content-length: 86815
last-modified: Tue, 26 Mar 2024 15:01:02 GMT
vary: Accept-Encoding
etag: "6602e32e-1531f"
strict-transport-security: max-age=63072000
accept-ranges: bytes

try that

@alfablend
Copy link
Author

Thanks, done it (changed the link in your command to my link first).

изображение

As I see, there is UTF-8 charset in this response. But it is not similar as downloadable CSV file itself encoding, that is windows-1251. May be is there any way to force using windows-1251 charset?

@dgtlmoon
Copy link
Owner

it seems the server is returning the wrong information, your CSV is reported as "text/html"

can you attach the CSV file?

@alfablend
Copy link
Author

Thank you for explanation!
Thats the file
urvi (1).csv

@dgtlmoon
Copy link
Owner

import chardet

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        rawdata = f.read()
        result = chardet.detect(rawdata)
    return result

result = detect_encoding("urvi.1.csv")
print("The encoding of the file is:", result['encoding'])
print("Confidence level:", result['confidence'])
$ python3 ./test.py 
The encoding of the file is: windows-1251
Confidence level: 0.9414230748073508

so the file is windows-1251 but the web server is reporting the wrong encoding type

i'm also not sure if windows-1251 is supported by any of our text difference handlers, more than likely not...

@dgtlmoon dgtlmoon changed the title CSV files and other TXT files charset encoding selection needed CSV files and other TXT files charset encoding selection needed ( windows-1251 etc ) Mar 27, 2024
@alfablend
Copy link
Author

Thank you! If I understand you right, there is general problem with non-unicode (non-latin) content. And the solution may be finding preprocessor (charset converter).

Due the Wikipedia windows-1251 charset is still "the second most-used single-byte character encoding (or third most-used character encoding overall)". But, of course,it is still small percents in the scale of global internet, and, I understand, it may be not the priority task.

@dgtlmoon
Copy link
Owner

Thank you! If I understand you right, there is general problem with non-unicode (non-latin) content. And the solution may be finding preprocessor (charset converter).

the software already has the chardet detection library installed :) so first is to write some tests and understand the relationship between the windows encoding type, and websites that return the wrong mime type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants