Make illegal character sanitization more robust #206

mmcdole · 2023-03-25T16:42:54Z

Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.

I've previously tried to have the code do something like the following:

func sanitizeXML(xmlData string) string {
	var buffer bytes.Buffer

	for _, r := range xmlData {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			// Replace illegal characters with their XML character reference.
			// You can also skip writing illegal characters by commenting the next line.
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String()
}

func isLegalXMLChar(r rune) bool {
	return r == 0x9 || r == 0xA || r == 0xD ||
		(r >= 0x20 && r <= 0xD7FF) ||
		(r >= 0xE000 && r <= 0xFFFD) ||
		(r >= 0x10000 && r <= 0x10FFFF)
}

However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.

If anyone has any suggestions for how to accommodate both requirements:

Stripping illegal characters from feeds to prevent the xml parser from throwing an error
Allowing the parsing of non-utf8 feeds

It would be much appreciated!

mmcdole · 2023-03-25T16:52:31Z

I'm guessing I need to handle this by first:

Parsing non-UTF8 feeds into UTF8 first
Sanitize the feed afterwards

I could do something like:

func convertToUTF8(data []byte) (string, error) {
	reader, err := charset.NewReader(bytes.NewReader(data), "")
	if err != nil {
		return "", err
	}
	utf8Data, err := ioutil.ReadAll(reader)
	if err != nil {
		return "", err
	}
	return string(utf8Data), nil
}

func sanitizeXML(xmlData []byte) (string, error) {
	utf8Data, err := convertToUTF8(xmlData)
	if err != nil {
		utf8Data = string(xmlData) // Fallback to original data if conversion fails
	}

	var buffer bytes.Buffer

	for _, r := range utf8Data {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String(), nil
}

I could call this at the beginning of the sanitize function, but I'm not sure what I'd do if charset.NewReader failed to detect the encoding.

mmcdole added the enhancement label Mar 25, 2023

This was referenced Mar 25, 2023

corrupted / mangled nested custom XML #203

Open

XML syntax error on line 34: illegal character code U+0008 #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make illegal character sanitization more robust #206

Make illegal character sanitization more robust #206

mmcdole commented Mar 25, 2023 •

edited

mmcdole commented Mar 25, 2023

Make illegal character sanitization more robust #206

Make illegal character sanitization more robust #206

Comments

mmcdole commented Mar 25, 2023 • edited

mmcdole commented Mar 25, 2023

mmcdole commented Mar 25, 2023 •

edited