Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with invalid utf-8 #121

Open
OrKoN opened this issue Apr 8, 2019 · 5 comments
Open

Dealing with invalid utf-8 #121

OrKoN opened this issue Apr 8, 2019 · 5 comments

Comments

@OrKoN
Copy link
Contributor

OrKoN commented Apr 8, 2019

Expected behavior

Filter out non-utf-8 characters automatically or allow to opt-in for this behavior.

Actual behavior

Error XML syntax error on line 93: invalid UTF-8 is produced and the feed cannot be processed.

Steps to reproduce the behavior

It seems to be happening only if I fetch the feed from https://ain.ua/feed using f.ParseURL. When I open a locally saved file with f.Parse, it works.

ain.zip

@musabgultekin
Copy link

You can fetch feed using your code, not f.ParseURL. And add this header to your request.

req.Header.Set("Accept-Charset", "utf-8")

And read response using this auto decoder

// Determine encoding and read body
reader, err := charset.NewReader(resp.Body, resp.Header.Get("Content-Type"))
if err != nil {
	return nil, err
}

@OrKoN
Copy link
Contributor Author

OrKoN commented Apr 20, 2019

I have used the following workaround: https://github.com/kisielk/gorge/blob/master/util/util.go which strips non-utf8 chars from the stream.

@musabgultekin
Copy link

Yes this could work for any content. But it removes not decodes. Anyway, if you are satisfied with this, no problem :)

@OrKoN
Copy link
Contributor Author

OrKoN commented Apr 20, 2019

@musabgultekin yeah, I don't think that the problem is that the site serves wrong encoding I think it's really just badly-encoded utf8 and only some chars are broken. The content still looks good after bad chars are removed.

@googollee
Copy link

I digged in this issue and found out it caused by encoding/xml package. The package checks if characters are in the xml characters range, and if not, pop that error.

I copy isInCharacterRange() to my code and filter all characters with it before feeding into gofeed.Parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants