Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make illegal character sanitization more robust #206

Open
mmcdole opened this issue Mar 25, 2023 · 1 comment
Open

Make illegal character sanitization more robust #206

mmcdole opened this issue Mar 25, 2023 · 1 comment

Comments

@mmcdole
Copy link
Owner

mmcdole commented Mar 25, 2023

Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.

I've previously tried to have the code do something like the following:

func sanitizeXML(xmlData string) string {
	var buffer bytes.Buffer

	for _, r := range xmlData {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			// Replace illegal characters with their XML character reference.
			// You can also skip writing illegal characters by commenting the next line.
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String()
}

func isLegalXMLChar(r rune) bool {
	return r == 0x9 || r == 0xA || r == 0xD ||
		(r >= 0x20 && r <= 0xD7FF) ||
		(r >= 0xE000 && r <= 0xFFFD) ||
		(r >= 0x10000 && r <= 0x10FFFF)
}

However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.

If anyone has any suggestions for how to accommodate both requirements:

  • Stripping illegal characters from feeds to prevent the xml parser from throwing an error
  • Allowing the parsing of non-utf8 feeds

It would be much appreciated!

@mmcdole
Copy link
Owner Author

mmcdole commented Mar 25, 2023

I'm guessing I need to handle this by first:

  1. Parsing non-UTF8 feeds into UTF8 first
  2. Sanitize the feed afterwards

I could do something like:

func convertToUTF8(data []byte) (string, error) {
	reader, err := charset.NewReader(bytes.NewReader(data), "")
	if err != nil {
		return "", err
	}
	utf8Data, err := ioutil.ReadAll(reader)
	if err != nil {
		return "", err
	}
	return string(utf8Data), nil
}

func sanitizeXML(xmlData []byte) (string, error) {
	utf8Data, err := convertToUTF8(xmlData)
	if err != nil {
		utf8Data = string(xmlData) // Fallback to original data if conversion fails
	}

	var buffer bytes.Buffer

	for _, r := range utf8Data {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String(), nil
}

I could call this at the beginning of the sanitize function, but I'm not sure what I'd do if charset.NewReader failed to detect the encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant