New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make illegal character sanitization more robust #206
Labels
Comments
I'm guessing I need to handle this by first:
I could do something like: func convertToUTF8(data []byte) (string, error) {
reader, err := charset.NewReader(bytes.NewReader(data), "")
if err != nil {
return "", err
}
utf8Data, err := ioutil.ReadAll(reader)
if err != nil {
return "", err
}
return string(utf8Data), nil
}
func sanitizeXML(xmlData []byte) (string, error) {
utf8Data, err := convertToUTF8(xmlData)
if err != nil {
utf8Data = string(xmlData) // Fallback to original data if conversion fails
}
var buffer bytes.Buffer
for _, r := range utf8Data {
if isLegalXMLChar(r) {
buffer.WriteRune(r)
} else {
buffer.WriteString(fmt.Sprintf("&#x%X;", r))
}
}
return buffer.String(), nil
} I could call this at the beginning of the sanitize function, but I'm not sure what I'd do if |
This was referenced Mar 25, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.
I've previously tried to have the code do something like the following:
However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.
If anyone has any suggestions for how to accommodate both requirements:
It would be much appreciated!
The text was updated successfully, but these errors were encountered: