Bad UTF-8 filename encoding #30

rodarima · 2018-03-17T21:16:24Z

I have some files with a bad encoding. I think they came from a FAT32 pendrive using the latin-1 encoding, and are now in a EXT4 filesystem. Once I try to see the directory in rover, those files appear in a empty line; not even the size is shown.

I can replicate the behaviour by creating a bogus file (I'm using spanish locale, but any UTF-8 should work):

$ touch $(echo "bad\0355char")
$ ls
'bad'$'\355''char'
$ locale
LANG=es_ES.UTF-8
LC_CTYPE="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_PAPER="es_ES.UTF-8"
LC_NAME="es_ES.UTF-8"
LC_ADDRESS="es_ES.UTF-8"
LC_TELEPHONE="es_ES.UTF-8"
LC_MEASUREMENT="es_ES.UTF-8"
LC_IDENTIFICATION="es_ES.UTF-8"
LC_ALL=
$ rover

The problem is that the \355 character (í in latin encoding) is 0xED and hence a 2 multibyte starting byte in UTF-8. As the next character does't continue the 2 multibyte encoding, is an incorrect UTF-8 string.

The functions mbstowcs() and swprintf() are failling silently, returning -1, as they cannot deal with the string. So nothing gets copied to the WBUF buffer, and the row remains empty.

If you create a bogus directory, the behavior is even more interesting. The WBUF gets reused from the last usage, and the filename seems to be named as the CWD or the previous directory.

$ mkdir bad-$'\355'
$ rover

I was thinking in how to solve the issue, perhaps some workaround like the ls(1) program does, replacing the spurious character with an ? symbol or similar. Deletion and other operations work fine.

The text was updated successfully, but these errors were encountered:

lecram · 2018-05-03T23:12:16Z

Thanks for the detailed report, @rodarima!
The ls(1) approach looks good. I'll try to implement this in Rover when I have some time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad UTF-8 filename encoding #30

Bad UTF-8 filename encoding #30

rodarima commented Mar 17, 2018

lecram commented May 3, 2018

Bad UTF-8 filename encoding #30

Bad UTF-8 filename encoding #30

Comments

rodarima commented Mar 17, 2018

lecram commented May 3, 2018