Rework calls to git #30

andrewshadura · 2021-02-03T12:09:04Z

Right now, git-crecord calls git to perform all Git operations; it uses code derived from Mercurial to do that. However, it’s quite verbose and sometimes using subprocess directly was still necessary.

pristine-lfs uses python-sh to talk to Git, but maybe using Dulwich or GitPython could work too?

In the past, Dulwich was used for some functionality, but other parts required Git anyway, so Dulwich was replaced by a hand-rolled alternative. Maybe this is the time to reconsider.

The text was updated successfully, but these errors were encountered:

okeeblow · 2021-11-28T09:30:21Z

Relevant: I had to work around an issue caused by the parsing of git diff's stdout: #31

okeeblow · 2021-12-03T05:57:55Z

I also have a file which breaks git-crecord's assumption that git-diff's bytes output can be decoded as valid UTF-8:

[okeeblow@emi#CHECKING YOU OUT] git crecord
Traceback (most recent call last):
  File "/home/okeeblow/.local/bin/git-crecord", line 4, in <module>
    main()
  File "/home/okeeblow/Works/git-crecord/git_crecord/main.py", line 203, in main
    crecord_core.dorecord(ui, repo, None, **(opts))
  File "/home/okeeblow/Works/git-crecord/git_crecord/crecord_core.py", line 202, in dorecord
    return recordfunc(ui, repo, "", None, opts)
  File "/home/okeeblow/Works/git-crecord/git_crecord/crecord_core.py", line 61, in recordfunc
    chunks = crpatch.parsepatch(fp)
  File "/home/okeeblow/Works/git-crecord/git_crecord/crpatch.py", line 665, in parsepatch
    for newstate, data in scanpatch(fp):
  File "/home/okeeblow/Works/git-crecord/git_crecord/crpatch.py", line 55, in scanpatch
    for line in iter(lr.readline, ''):
  File "/home/okeeblow/Works/git-crecord/git_crecord/crpatch.py", line 29, in readline
    return self.fp.readline().decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 69: invalid start byte

Here's the line it's trying to decode, the first line of a MacJapanese-encoded Rich Text Format document:

b'+{\\rtf0\\mac\\deff21{\\fonttbl{\\f16384 \\fnil Osaka;}{\\f16436 \\fnil Osaka\x81|\x93\x99\x95\x9d;}

The problematic bytes there (\x81\x7c\x93\x99\x95\x9d) fail to decode as valid UTF-8 because they fall within the range which must be expressed as UTF8-2, two bytes with a preceding control-plane identifier:

>>> b'\x81'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
>>> b'\xc2\x81'.decode('utf-8')
'\x81'
>>> b'\xc3\x81'.decode('utf-8')
'Á'

However in MacJapanese those are valid two-byte sequences representing part of the font name Osaka－等幅, the default System font for KanjiTalk and the language kit, and that does match the raw bytes in the file:

irb(main):134:0> "Osaka\u2212等幅".encode(Encoding::Shift_JIS).bytes.map { _1.to_s(16) }
=> ["4f", "73", "61", "6b", "61", "81", "7c", "93", "99", "95", "9d"]

[okeeblow@emi#rtf] xxd -l 80 CYO\ MacWrite\ Pro\ 1.5\ RTF
00000000: 7b5c 7274 6630 5c6d 6163 5c64 6566 6632  {\rtf0\mac\deff2
00000010: 317b 5c66 6f6e 7474 626c 7b5c 6631 3633  1{\fonttbl{\f163
00000020: 3834 205c 666e 696c 204f 7361 6b61 3b7d  84 \fnil Osaka;}
00000030: 7b5c 6631 3634 3336 205c 666e 696c 204f  {\f16436 \fnil O
00000040: 7361 6b61 817c 9399 959d 3b7d 7b5c 6631  saka.|....;}{\f1

I started working on a fix for this by setting decode's errors argument in crpatch.py to ignore or replace, causing any invalid sequences to be skipped or replaced with U+FDD0 NONCHARACTER (�), respectively. That did fix the crecord interface, but then the ignored or replaced contents ended up getting committed instead of the actual bytes.

Then I started converting git-crecord to use Python3's bytes internally instead of the unicode str and got it kinda working (UI functional but patch contents still getting messed up somewhere), but it got to be a pretty involved change that I don't have time to test thoroughly so I stashed it and just committed this file with git commit <filename> before I got back to using crecord for other files.

I'm intentionally doing things which put me in contact with esoteric formats and encodings as part of a file-identification library I'm working on, so this is probably not something many other users will encounter, but FYI regardless :)

andrewshadura · 2022-03-06T21:31:58Z

I believe your file should work when I finish my current refactoring effort: #33.

andrewshadura · 2022-03-06T21:34:27Z

andrewshadura added this to To Do in git-crecord Mar 8, 2021

andrewshadura added the enhancement label Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework calls to git #30

Rework calls to git #30

andrewshadura commented Feb 3, 2021 •

edited

okeeblow commented Nov 28, 2021

okeeblow commented Dec 3, 2021

andrewshadura commented Mar 6, 2022

andrewshadura commented Mar 6, 2022

Rework calls to git #30

Rework calls to git #30

Comments

andrewshadura commented Feb 3, 2021 • edited

okeeblow commented Nov 28, 2021

okeeblow commented Dec 3, 2021

andrewshadura commented Mar 6, 2022

andrewshadura commented Mar 6, 2022

andrewshadura commented Feb 3, 2021 •

edited