Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework calls to git #30

Open
andrewshadura opened this issue Feb 3, 2021 · 4 comments
Open

Rework calls to git #30

andrewshadura opened this issue Feb 3, 2021 · 4 comments

Comments

@andrewshadura
Copy link
Owner

andrewshadura commented Feb 3, 2021

Right now, git-crecord calls git to perform all Git operations; it uses code derived from Mercurial to do that. However, it’s quite verbose and sometimes using subprocess directly was still necessary.

pristine-lfs uses python-sh to talk to Git, but maybe using Dulwich or GitPython could work too?

In the past, Dulwich was used for some functionality, but other parts required Git anyway, so Dulwich was replaced by a hand-rolled alternative. Maybe this is the time to reconsider.

@andrewshadura andrewshadura added this to To Do in git-crecord Mar 8, 2021
@okeeblow
Copy link
Contributor

Relevant: I had to work around an issue caused by the parsing of git diff's stdout: #31

@okeeblow
Copy link
Contributor

okeeblow commented Dec 3, 2021

I also have a file which breaks git-crecord's assumption that git-diff's bytes output can be decoded as valid UTF-8:

[okeeblow@emi#CHECKING YOU OUT] git crecord
Traceback (most recent call last):
  File "/home/okeeblow/.local/bin/git-crecord", line 4, in <module>
    main()
  File "/home/okeeblow/Works/git-crecord/git_crecord/main.py", line 203, in main
    crecord_core.dorecord(ui, repo, None, **(opts))
  File "/home/okeeblow/Works/git-crecord/git_crecord/crecord_core.py", line 202, in dorecord
    return recordfunc(ui, repo, "", None, opts)
  File "/home/okeeblow/Works/git-crecord/git_crecord/crecord_core.py", line 61, in recordfunc
    chunks = crpatch.parsepatch(fp)
  File "/home/okeeblow/Works/git-crecord/git_crecord/crpatch.py", line 665, in parsepatch
    for newstate, data in scanpatch(fp):
  File "/home/okeeblow/Works/git-crecord/git_crecord/crpatch.py", line 55, in scanpatch
    for line in iter(lr.readline, ''):
  File "/home/okeeblow/Works/git-crecord/git_crecord/crpatch.py", line 29, in readline
    return self.fp.readline().decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 69: invalid start byte

Here's the line it's trying to decode, the first line of a MacJapanese-encoded Rich Text Format document:

b'+{\\rtf0\\mac\\deff21{\\fonttbl{\\f16384 \\fnil Osaka;}{\\f16436 \\fnil Osaka\x81|\x93\x99\x95\x9d;}

The problematic bytes there (\x81\x7c\x93\x99\x95\x9d) fail to decode as valid UTF-8 because they fall within the range which must be expressed as UTF8-2, two bytes with a preceding control-plane identifier:

>>> b'\x81'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
>>> b'\xc2\x81'.decode('utf-8')
'\x81'
>>> b'\xc3\x81'.decode('utf-8')
'Á'

However in MacJapanese those are valid two-byte sequences representing part of the font name Osaka-等幅, the default System font for KanjiTalk and the language kit, and that does match the raw bytes in the file:

irb(main):134:0> "Osaka\u2212等幅".encode(Encoding::Shift_JIS).bytes.map { _1.to_s(16) }
=> ["4f", "73", "61", "6b", "61", "81", "7c", "93", "99", "95", "9d"]
[okeeblow@emi#rtf] xxd -l 80 CYO\ MacWrite\ Pro\ 1.5\ RTF
00000000: 7b5c 7274 6630 5c6d 6163 5c64 6566 6632  {\rtf0\mac\deff2
00000010: 317b 5c66 6f6e 7474 626c 7b5c 6631 3633  1{\fonttbl{\f163
00000020: 3834 205c 666e 696c 204f 7361 6b61 3b7d  84 \fnil Osaka;}
00000030: 7b5c 6631 3634 3336 205c 666e 696c 204f  {\f16436 \fnil O
00000040: 7361 6b61 817c 9399 959d 3b7d 7b5c 6631  saka.|....;}{\f1

I started working on a fix for this by setting decode's errors argument in crpatch.py to ignore or replace, causing any invalid sequences to be skipped or replaced with U+FDD0 NONCHARACTER (�), respectively. That did fix the crecord interface, but then the ignored or replaced contents ended up getting committed instead of the actual bytes.

Then I started converting git-crecord to use Python3's bytes internally instead of the unicode str and got it kinda working (UI functional but patch contents still getting messed up somewhere), but it got to be a pretty involved change that I don't have time to test thoroughly so I stashed it and just committed this file with git commit <filename> before I got back to using crecord for other files.

I'm intentionally doing things which put me in contact with esoteric formats and encodings as part of a file-identification library I'm working on, so this is probably not something many other users will encounter, but FYI regardless :)

@andrewshadura
Copy link
Owner Author

I believe your file should work when I finish my current refactoring effort: #33.

@andrewshadura
Copy link
Owner Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants