[feature] improved document hashing and duplicate checking #841

pmiam · 2024-05-02T20:25:10Z

Hello! This issue is a follow-up to my post here:
Originally posted by @pmiam in #549 (comment)

My PR branch is on my fork. I've not passed all the tests yet.
https://github.com/pmiam/papis/tree/PR/redefine-papis-id-to-aid-duplicate-sniffing

The scope of my changes has grown slightly, thanks to your feedback @alexfikl and @choreutes. I saw an opportunity to enable the option to use a deterministic papis_id as a useful duplicate checker for the rare case in which reference files are redundant despite differing metadata.

Summary of current PR branches, Instead of a single branch I've split them up for the time being.

re-organized the add command's run function
- add files to tmp_document asap
- add tmp_document to database early
- delete get_document_hash_folder
instead of hashing a byte string of the doc dict, I reused some of the code from the old get_document_hash_folder to hash file contents so this should be no different from old behavior.
- using hashlib's md5 constructor with useforsecurity=False for the benefit of FIPS compliant users
- this requires bumping the minimum python version to 3.9

My current issue is that I need to redress the duplicate checking in order to pass tests. Currently duplicate checking is a try-except-else block in the add command's run function.

In order to use papis_id in duplicate checking, the database entry must be made before duplicate checking. Currently, that same entry is trivially found by the duplicate check. There's no simple way around that with the current code.

I think duplicate checking should be (or possibly already is) a database function, but I welcome some more feedback.

The text was updated successfully, but these errors were encountered:

alexfikl · 2024-05-02T21:04:53Z

I haven't looked at your duplicate checking changes, but I can't say I understand why you'd want to use papis_id as a duplicate check.

I suggest you start with some PR with the clearer changes, e.g. using papis_id instead of get_document_hash_folder, and then we can see if it's worth reworking anything else.

pmiam · 2024-05-03T09:39:45Z

No duplicate checking changes yet. I'd like to discuss what/if they should be before I move forward.

Thanks for the advice. I refocused on my original goal and opened my PR. I had no idea you could make a draft PR. Seems obvious in hindsight.

Using a checksum to compare files is common in my work. Comparisons based on metadata are definitely more useful when collecting e.g. literature or videos but I also do a lot of archiving. Note, I'm brand new to Papis but I think it could help me there especially if it kept track of meaningful checksums. Of course, it will be completely opt-in.

alexfikl · 2024-05-03T10:33:28Z

No duplicate checking changes yet. I'd like to discuss what/if they should be before I move forward.

Thanks for the advice. I refocused on my original goal and opened my PR. I had no idea you could make a draft PR. Seems obvious in hindsight.

Thank you! That looks a lot more manageable in both scope and code 😁 (always nice to see as a reviewer!)

Using a checksum to compare files is common in my work. Comparisons based on metadata are definitely more useful when collecting e.g. literature or videos but I also do a lot of archiving. Note, I'm brand new to Papis but I think it could help me there especially if it kept track of meaningful checksums. Of course, it will be completely opt-in.

I'm not sure what benefit you would have in storing a checksum in the info.yml file? Can you give some examples of what workflow you had in mind? I would have thought other tools are much better at archiving and checksumming a bunch of folders.

pmiam · 2024-05-03T11:04:25Z

I am very glad to hear it. I'm a novice contributor, so I'm sure to learn a lot.

I'm not sure what benefit you would have in storing a checksum in the info.yml file?

The papis_id is already being stored in the info file no? What's the harm in doubling as a checksum?

This is something you could help me better understand: I noticed in several places that you go to some effort to keep the info.yaml file "pure." At least, saving a document after adding a computed key to it is often avoided. In the yaml dump function used to feed the editor, even the papis ID is popped off (though it still appears in the file, I think).

I see the value in keeping the info file friendly, but the option to pull a checksum from a file with jq or something is pretty friendly to me. Especially if that checksum is the root of great big hash tree.

Can you give some examples of what workflow you had in mind?

The main thing I have in mind is periodically performed webpage archives. If nothing changes in a period, the checksum catches it. You may be very correct that there's "better" tools out there. I have a tendency to miss them, evidently. I slept on Papis for years, hahaha.

I'm not a professional archivist, I just like to keep track of some things. And I really like papis for it's interface(s), composability, hackability, etc. If the option is available, why not have it?

alexfikl · 2024-05-04T05:23:59Z

The papis_id is already being stored in the info file no? What's the harm in doubling as a checksum?

Calling it a checksum has some important implications, e.g. related to security and integrity of the data, that I'm not sure how we can guarantee. For example, since the papis library is plain text, anyone can do a sed 's/title/titles/g' -i doc/info.yml and invalidate the checksum.

This is something you could help me better understand: I noticed in several places that you go to some effort to keep the info.yaml file "pure." At least, saving a document after adding a computed key to it is often avoided. In the yaml dump function used to feed the editor, even the papis ID is popped off (though it still appears in the file, I think).

I'm not sure I understand the question. In papis.document.dump, we just pop the papis_id because it's an internal identifier mostly and isn't useful when displaying a document to the user.

Can you give some examples of what workflow you had in mind?

The main thing I have in mind is periodically performed webpage archives. If nothing changes in a period, the checksum catches it. You may be very correct that there's "better" tools out there. I have a tendency to miss them, evidently. I slept on Papis for years, hahaha.

I think any backup software will do that for you, i.e. compare the latest backup with the current state and only archive the changes. To keep in on the command-line, I've heard good things about borg, but I've never actually used it.

pmiam · 2024-05-04T08:10:26Z

I think any backup software will [...] compare the latest backup with the current state and only archive the changes. To keep in on the command-line, I've heard good things about borg

I actually already use borg for all my local backups. It's great.

That said, borg doesn't provision collector programs to populate its archives with random internet pages. If I really prioritized de-duplicating storage I could use papis to fetch web snapshots into a borg archive mounted to my filesystem through fuse. Just the same, I could put my papis library on a ZFS dataset with de-duplication enabled. Then I'd really be limiting my on-disk storage to unique 512 byte blocks.

a checksum has some important implications [...] related to security and integrity of the data

Fair enough. I mean "checksum" as a representative digest of some data that could be used to check its integrity and identity. Checking integrity is not relevant, neither is defending from collision attacks. Checking identity requires keeping indexes/hashtables/caches like those already employed by borg/zfs/papis.

Borg and zfs definitely beat Papis at this particular game, but I'm not interested in playing at that level. The simple option to check incoming data against a cache of prior digests is useful, and if papis won't do it another tool will have to for those few users who want it, like me.

Not much to be done about intentionally invalidated checksums, I think. As of my PR, they would be recalculated with papis add operations anyway.

we just pop the papis_id because it's an internal identifier mostly and isn't useful when displaying a document to the user.

Indeed, the papis_id is just a random string right now, and I think it should stay random by default, but it doesn't have to be random. Even if it isn't random, hiding it in curated displays is fair enough.

I think I'll make a draft PR with the (mostly done) hashing changes. That might do a better job of explaining my goal here. If it does, we can keep talking about the best way to enable duplicate checking with it.

pmiam · 2024-05-05T04:54:17Z

I slept on this and I think I've come around to your side.

The uid is not a checksum. Again, obvious in hindsight. I still think I'll make that draft. It might be useful if papis.id.compute_an_id can digest entry contents sometime in the future. However, that won't have anything to do with papis.commands.add, the typical use of papis.database.Database.maybe_compute_id, or duplicate checking. I agree data de-duplication is much better handled by any of the previously mentioned archival tools.

However, it's still possible that duplicate checking might benefit from refactoring. At least, that try-except-else block could be its own function.

alexfikl · 2024-05-05T06:28:27Z

However, it's still possible that duplicate checking might benefit from refactoring. At least, that try-except-else block could be its own function.

Yeah, that definitely sounds like a useful function to have around! At the moment, I don't think any other papis command would need it, but it would be nice to clean it up and make it available if any plugin wants to use it.

pmiam mentioned this issue May 3, 2024

organize: simplified default folder naming #842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] improved document hashing and duplicate checking #841

[feature] improved document hashing and duplicate checking #841

pmiam commented May 2, 2024 •

edited

alexfikl commented May 2, 2024

pmiam commented May 3, 2024

alexfikl commented May 3, 2024 •

edited

pmiam commented May 3, 2024

alexfikl commented May 4, 2024

pmiam commented May 4, 2024

pmiam commented May 5, 2024 •

edited

alexfikl commented May 5, 2024

[feature] improved document hashing and duplicate checking #841

[feature] improved document hashing and duplicate checking #841

Comments

pmiam commented May 2, 2024 • edited

alexfikl commented May 2, 2024

pmiam commented May 3, 2024

alexfikl commented May 3, 2024 • edited

pmiam commented May 3, 2024

alexfikl commented May 4, 2024

pmiam commented May 4, 2024

pmiam commented May 5, 2024 • edited

alexfikl commented May 5, 2024

pmiam commented May 2, 2024 •

edited

alexfikl commented May 3, 2024 •

edited

pmiam commented May 5, 2024 •

edited