Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte order marks #1

Open
jennybc opened this issue Jun 23, 2021 · 3 comments
Open

Byte order marks #1

jennybc opened this issue Jun 23, 2021 · 3 comments

Comments

@jennybc
Copy link
Contributor

jennybc commented Jun 23, 2021

I recently got to spend some quality time with my best friend charToRaw(), courtesy of a byte order mark 😬

I was doing a round trip like so:

local plain text file --> upload to Google Drive & convert to a Google Doc --> export from Google Drive as text/plain --> read into memory in R --> parse back to character vector

While developing a test I see:

>   expect_setequal(
+     chicken_poem,
+     readLines(drive_example("chicken.txt"))
+   )
Error: `chicken_poem`[1] absent from readLines(drive_example("chicken.txt"))
> chicken_poem[[1]]
[1] "A chicken whose name was Chantecler"
> readLines(drive_example("chicken.txt"))[[1]]
[1] "A chicken whose name was Chantecler"
> chicken_poem[[1]] == readLines(drive_example("chicken.txt"))[[1]]
[1] FALSE
> Encoding(chicken_poem[[1]])
[1] "UTF-8"
> Encoding(readLines(drive_example("chicken.txt"))[[1]])
[1] "unknown"
> charToRaw(chicken_poem[[1]])
 [1] ef bb bf 41 20 63 68 69 63 6b 65 6e 20 77 68 6f 73 65 20 6e 61 6d 65 20 77 61
[27] 73 20 43 68 61 6e 74 65 63 6c 65 72
> charToRaw(readLines(drive_example("chicken.txt"))[[1]])
 [1] 41 20 63 68 69 63 6b 65 6e 20 77 68 6f 73 65 20 6e 61 6d 65 20 77 61 73 20 43
[27] 68 61 6e 74 65 63 6c 65 72

And thus I found the BOM on the text returning from the round trip.

Do you have anything to say about ... when you're likely to encounter BOMs? Should you get rid of them? If so, how? Or can you compare two strings in a way that ignores them?

@gaborcsardi
Copy link
Owner

Yeah, that is tricky. UTF-8 of course does not need a BOM because it is byte order independent.

Some tools, however, use \xef\xbb\xbf to mark a plain text file as UTF-8. E.g. Microsoft tools like to do that. Some of them also require it at the beginning of a text file.

It is a really tough question what to do with it in R, because R does not need it, in fact it messes up all R functions:

x <- paste0("\xef\xbb\xbfword ", "\u30de")
❯ Encoding(x)
[1] "UTF-8"x
[1] "word マ"

❯ nchar(x)
[1] 7

❯ substr(x, 1, 4)
[1] "wor"

❯ grepl("^word", x)
[1] FALSE

Why pasting strings with "unknown" and "UTF-8" encodings will mark the result as "UTF-8" I am not sure. But the the real weird stuff is that nchar(), substr() and grepl() are all wrong, because they consider the BOM as part of the string.

So yes, ideally you would remove the BOM when manipulating the strings in R.

OTOH, if you are downloading a file from Google Drive that you would use in some (MS) tool later, then you'd want to keep it, otherwise that tool might not be able to read in the file.

I am not sure what the right solution is here. I am afraid that if you want to handle all use cases, then you'd need to make BOM handling explicit when downloading text files from Google Drive. E.g. have an option and/or function argument for it. Maybe the default of the option could be to remove it, and mark the string as UTF-8.

@jennybc
Copy link
Contributor Author

jennybc commented Jul 2, 2021

I think the suggestion for the R Encoding FAQ, then, is just to create awareness of the potential for these marks to exist.

When two strings look the same, but clearly are not the same, as usual ,charToRaw() is your friend and a BOM is one of the specific things to be looking for.

@jimhester
Copy link

FWIW readr / vroom have code to skip the byte order marks at https://github.com/r-lib/vroom/blob/b3ba15212978253174c9f99f1098799cca9a6f74/src/utils.h#L215-L266, since they are pretty common in CSV's created using Microsoft programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants