Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 error when using ccread #1425

Open
cote3804 opened this issue May 10, 2024 · 6 comments
Open

UTF-8 error when using ccread #1425

cote3804 opened this issue May 10, 2024 · 6 comments
Assignees
Labels
Milestone

Comments

@cote3804
Copy link

cote3804 commented May 10, 2024

I'm getting a UTF-8 coding error when trying to parse an ORCA file using ccread:

from cclib.io.ccio import ccread

with open("out.txt", "r") as handle:
        parse_obj = ccread(handle)

parsed_dict = parse_obj.getattributes()
print(parsed_dict)

Which returns this error:

Traceback (most recent call last):
  File "/gpfs/alpine1/scratch/cote3804/orca/ca/4c/test/./parse_test.py", line 6, in <module>
    parse_obj = ccread(handle)
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/io/ccio.py", line 185, in ccread
    return log.parse()
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/parser/logfileparser.py", line 157, in parse
    for line in self.inputfile:
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/parser/logfilewrapper.py", line 240, in __next__
    return self.next()
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/parser/logfilewrapper.py", line 219, in next
    line = next(self.files[self.file_pointer])
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 803: invalid start byte

However, if I use ccopen within a script or ccget on the command line, the file is parsed without issue. I need to use a file handle as an input to cclib (not a filename), so I either need to fix this issue with ccread or pass the file handle to the API in some other way that works. I've attached the output file I'm trying to parse:
out.txt

I'm using Python 3.9.13 and cclib 1.8.1

Thanks!

@oliver-s-lee oliver-s-lee self-assigned this May 13, 2024
@oliver-s-lee
Copy link
Contributor

Hi @cote3804, thanks for the bug report!

This isn't actually an issue with cclib at all, but rather with the default behaviour of python's open() function. You'll get the same error if you do this, for example:

with open("out.txt", "r") as handle:
        handle.read()

The problem is that open() has rather strict error handling for decoding errors. To instead ignore the errors and replace the missing characters with some placeholder (I think '?' is the default), set errors="replace":

with open("out.txt", "r", errors="replace") as handle:
        parse_obj = ccread(handle)

When you pass a filename to cclib.io.ccread, cclib sets the error handler automatically (which is why you don't see the error on the command line).

Of course, if you know that the file is not actually in utf-8 (the default encoding), set encoding to the correct value instead. In this case, it's most probably Latin-1 (ISO-8859-1):

with open("out.txt", "r", encoding="ISO-8859-1") as handle:
        parse_obj = ccread(handle)

For more information on these options to open(), have a look at the Python documentation.

@berquist We should probably add this to the documentation somewhere, as it's likely to crop up again...

@berquist berquist added the docs label May 13, 2024
@berquist berquist added this to the v2.0 milestone May 13, 2024
@cote3804
Copy link
Author

Thanks for the great response @oliver-s-lee. You were indeed correct that changing the encoding to latin-1 fixed the issue. What's puzzling to me is that all of my ORCA files are encoded in UTF-8 except this one log file. Definitely not a cclib issue!

@oliver-s-lee
Copy link
Contributor

@cote3804 no problem at all.

Yeah that is weird. 99% of the time these encoding errors are caused by umlauts, accents, or other such characters in scientist's names getting printed by the program, but I don't think that's the problem here.
Like 1125 of your output has this:

ÿÿÿÿThe total number of vibrations considered is 0

Which I've not seen before, it's almost like the output file got slightly mangled at some point...

@berquist berquist reopened this May 13, 2024
@berquist
Copy link
Member

Reopening so we can use this to track the docs update.

@cote3804
Copy link
Author

In case anyone else encounters this issue, the ORCA developers acknowledged it is a bug that will be fixed in the next version

@oliver-s-lee
Copy link
Contributor

Just repeating the developer's response here as the Orca forums are annoyingly locked behind a sign-up:

Re: Different encoding when calculating vibrations of a single atom
Unread post by Axel_K. » Wed May 15, 2024 6:24 pm

Hello,

Actually, the ORCA output should never contain any characters other than 7-bit ASCII, but this is probably not entirely fulfilled. The strange characters you observe are clearly not wanted, and I can reproduce them with the official ORCA 5.0.4 binaries. The error will be fixed in the next release.

Best regards,
Axel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants