UTF-8 error when using ccread #1425

cote3804 · 2024-05-10T19:49:52Z

I'm getting a UTF-8 coding error when trying to parse an ORCA file using ccread:

from cclib.io.ccio import ccread

with open("out.txt", "r") as handle:
        parse_obj = ccread(handle)

parsed_dict = parse_obj.getattributes()
print(parsed_dict)

Which returns this error:

Traceback (most recent call last):
  File "/gpfs/alpine1/scratch/cote3804/orca/ca/4c/test/./parse_test.py", line 6, in <module>
    parse_obj = ccread(handle)
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/io/ccio.py", line 185, in ccread
    return log.parse()
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/parser/logfileparser.py", line 157, in parse
    for line in self.inputfile:
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/parser/logfilewrapper.py", line 240, in __next__
    return self.next()
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/site-packages/cclib/parser/logfilewrapper.py", line 219, in next
    line = next(self.files[self.file_pointer])
  File "/projects/cote3804/software/anaconda/envs/jdft/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 803: invalid start byte

However, if I use ccopen within a script or ccget on the command line, the file is parsed without issue. I need to use a file handle as an input to cclib (not a filename), so I either need to fix this issue with ccread or pass the file handle to the API in some other way that works. I've attached the output file I'm trying to parse:
out.txt

I'm using Python 3.9.13 and cclib 1.8.1

Thanks!

The text was updated successfully, but these errors were encountered:

oliver-s-lee · 2024-05-13T07:37:26Z

Hi @cote3804, thanks for the bug report!

This isn't actually an issue with cclib at all, but rather with the default behaviour of python's open() function. You'll get the same error if you do this, for example:

with open("out.txt", "r") as handle:
        handle.read()

The problem is that open() has rather strict error handling for decoding errors. To instead ignore the errors and replace the missing characters with some placeholder (I think '?' is the default), set errors="replace":

with open("out.txt", "r", errors="replace") as handle:
        parse_obj = ccread(handle)

When you pass a filename to cclib.io.ccread, cclib sets the error handler automatically (which is why you don't see the error on the command line).

Of course, if you know that the file is not actually in utf-8 (the default encoding), set encoding to the correct value instead. In this case, it's most probably Latin-1 (ISO-8859-1):

with open("out.txt", "r", encoding="ISO-8859-1") as handle:
        parse_obj = ccread(handle)

For more information on these options to open(), have a look at the Python documentation.

@berquist We should probably add this to the documentation somewhere, as it's likely to crop up again...

cote3804 · 2024-05-13T15:31:38Z

Thanks for the great response @oliver-s-lee. You were indeed correct that changing the encoding to latin-1 fixed the issue. What's puzzling to me is that all of my ORCA files are encoded in UTF-8 except this one log file. Definitely not a cclib issue!

oliver-s-lee · 2024-05-13T16:04:09Z

@cote3804 no problem at all.

Yeah that is weird. 99% of the time these encoding errors are caused by umlauts, accents, or other such characters in scientist's names getting printed by the program, but I don't think that's the problem here.
Like 1125 of your output has this:

ÿÿÿÿThe total number of vibrations considered is 0

Which I've not seen before, it's almost like the output file got slightly mangled at some point...

berquist · 2024-05-13T16:46:08Z

Reopening so we can use this to track the docs update.

cote3804 · 2024-05-15T17:52:32Z

In case anyone else encounters this issue, the ORCA developers acknowledged it is a bug that will be fixed in the next version

oliver-s-lee · 2024-05-17T06:57:47Z

Just repeating the developer's response here as the Orca forums are annoyingly locked behind a sign-up:

Re: Different encoding when calculating vibrations of a single atom
Unread post by Axel_K. » Wed May 15, 2024 6:24 pm

Hello,

Actually, the ORCA output should never contain any characters other than 7-bit ASCII, but this is probably not entirely fulfilled. The strange characters you observe are clearly not wanted, and I can reproduce them with the official ORCA 5.0.4 binaries. The error will be fixed in the next release.

Best regards,
Axel

oliver-s-lee self-assigned this May 13, 2024

berquist added the docs label May 13, 2024

berquist added this to the v2.0 milestone May 13, 2024

cote3804 closed this as completed May 13, 2024

berquist reopened this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 error when using ccread #1425

UTF-8 error when using ccread #1425

cote3804 commented May 10, 2024 •

edited

oliver-s-lee commented May 13, 2024

cote3804 commented May 13, 2024

oliver-s-lee commented May 13, 2024

berquist commented May 13, 2024

cote3804 commented May 15, 2024

oliver-s-lee commented May 17, 2024

UTF-8 error when using ccread #1425

UTF-8 error when using ccread #1425

Comments

cote3804 commented May 10, 2024 • edited

oliver-s-lee commented May 13, 2024

cote3804 commented May 13, 2024

oliver-s-lee commented May 13, 2024

berquist commented May 13, 2024

cote3804 commented May 15, 2024

oliver-s-lee commented May 17, 2024

cote3804 commented May 10, 2024 •

edited