Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I try to read a rds file, but get the following error: #49

Open
Erikvvats opened this issue Sep 23, 2020 · 13 comments
Open

I try to read a rds file, but get the following error: #49

Erikvvats opened this issue Sep 23, 2020 · 13 comments
Labels
bug Something isn't working

Comments

@Erikvvats
Copy link

This is my code:

import pyreadr
result = pyreadr.read_r('data/injuryTimeDataset.rds')

This is the error:
parser.parse(path)
File "pyreadr\librdata.pyx", line 117, in pyreadr.librdata.Parser.parse
File "pyreadr\librdata.pyx", line 139, in pyreadr.librdata.Parser.parse
File "pyreadr\librdata.pyx", line 102, in pyreadr.librdata._handle_value_label
File "pyreadr\librdata.pyx", line 197, in pyreadr.librdata.Parser.__handle_value_label
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte

What should I do? I have not looked in the rds file, but it is supposed to be a mixture of strings, ints and floats. Lastly, this works:
pyreadr.object_list

@ofajardo
Copy link
Owner

Make sure you are using the latest version of pyreadr. If the problem persists send a file to reproduce the issue. If I cannot reproduce it, I cannot fix it.

@Erikvvats
Copy link
Author

I have the latest pyreadr. However, I am not allowed to share the dataset.

@ofajardo
Copy link
Owner

that's unfortunate because if I can't reproduce it there is nothing I can do now.
If you can prepare a minimal synthetic dataset that reproduces the error, that would be ideal.
Otherwise, we will have to wait until somebody else finds the same issue and generates a file to reproduce it.

@FiniDG
Copy link

FiniDG commented Dec 14, 2021

Hello,

I have the same issue and I have made a reproducible file for you to check out (however I cannot find how to upload it here). I tried a lot to get it to work and probably more during my long internet search. I think my file is not a "good" .RData file and tried to find the reason why, but so far unsuccessful. Could you have a look?

  1. Loading it into Rstudio and trying to save it in a different way. saveRDS() with different parameters (compress, version, ascii)
  2. tried to change the encoding if that might work with R scripts and then saving it as an .RData file
fix.encoding <- function(df, originalEncoding = "UTF-8") {
  numCols <- ncol(df)
  df <- data.frame(df)
  for (col in 1:numCols)
  {
    if(class(df[, col]) == "character"){
      Encoding(df[, col]) <- originalEncoding
    }
    
    if(class(df[, col]) == "factor"){
      Encoding(levels(df[, col])) <- originalEncoding
    }
    else{
      Encoding(df[, col]) <- originalEncoding
    }
  }
  return(as_data_frame(df))
}
  1. tried to open it with this python code. Which kind of works, but not really.
with open(file, 'rb') as f:
    text = f.read()
    text = text.decode("utf-8") 
  1. also tried to remove the rownames, or change all my factors to characters, but also still an error.
    df <- tibble::rownames_to_column(df, "VALUE")
    and
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)

@ofajardo
Copy link
Owner

thanks, I need the file to take a look. Zip it and then upload it here, just drag and drop into this text box. If the file is too big, then put it in dropbox, google drive or similar and share it with everyone and paste here the link. You can research for other services where you can put your file without having an account.

Without file it is impossible for me to take a look.

@FiniDG
Copy link

FiniDG commented Dec 14, 2021

Sorry, I uploaded some corrupt files earlier. This one should work
test8.RData.zip

@FiniDG
Copy link

FiniDG commented Dec 14, 2021

I Finally found a solution! However, I would like not to load it into R, to re-save the file, and then use it in my code. I would rather just use the original RData files. But I was trying all kinds of stuff for proof of concept.

I load this file into R, run the following to remove the Factors: (rlvnc2 is the name of de dataframe, change accordingly)

i <- sapply(rlvnc2, is.factor)
rlvnc2[i] <- lapply(rlvnc2[i], as.character)

And then save it with the standard save() option from R
save(rlvnc2, file = "/file/path/test9.RData")

Then it works fine with your pyreadr. But if I save it with saveRDS() it doesn't work anymore. Also the original file doesn't work (with the factors instead of characters)

@ofajardo ofajardo reopened this Dec 14, 2021
@ofajardo
Copy link
Owner

Ok, thanks I can reproduce it. The issue is coming from the C library, therefore I have submitted a new issue about this.

I see that in the file every factor has a lot of levels, I wonder if there is some non-UTF8 character hidden there somewhere. In the other hand it seems that you already tried to change the encoding of all factors and that didn't work.

@FiniDG
Copy link

FiniDG commented Dec 14, 2021

to be sure, I tried to change the encoding again and save with save() instead of saveRDS() and still I have the error.

Good luck finding the exact problem. If you need any help with trial and error, let me know

@ofajardo
Copy link
Owner

interesting, when I save the file it looks completely different when looked at a hex file editor. What version of R are you using, on which platform? (windows, mac, linux ... )?

@FiniDG
Copy link

FiniDG commented Dec 14, 2021

I think that the original file (that isn't working) is made on a linux based computer with an old version of R or a windows computer with an old version of R. I do not know the exact origin, because I only work with this file and was created before I was involved.

the new file (after changing the factors to characters) was made on R version 4.0.2 with Rstudio 2021.09.0 Build 351 "Ghost Orchid" Release (077589bc, 2021-09-20) for macOS.
test9.RData.zip

EDIT:
now that I think about it... both files are made in the macOS R version 4.0.2. I made a reproducible example using my own computer. the original-original file is much bigger, but also has some information in it that I am unable to share. this is just the first 4 lines of the original file, saved in macOS R version 4.0.2 (test8).
after changing the factors to character (as explained earlier) the same dataframe works again (test9)

@ofajardo
Copy link
Owner

OK anyway, saving the file again with 4.02 gives exactly the same error, I think somehow the C library is not reading one of the fields in the binary file from the correct byte.

@FiniDG
Copy link

FiniDG commented Dec 14, 2021

I saved a working version for you in a previous post. Might be a good way to compare the two.
Screen Shot 2021-12-14 at 15 54 01

@ofajardo ofajardo added the bug Something isn't working label Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants