Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] improved document hashing and duplicate checking #841

Open
pmiam opened this issue May 2, 2024 · 8 comments
Open

[feature] improved document hashing and duplicate checking #841

pmiam opened this issue May 2, 2024 · 8 comments

Comments

@pmiam
Copy link
Contributor

pmiam commented May 2, 2024

Hello! This issue is a follow-up to my post here:
Originally posted by @pmiam in #549 (comment)

My PR branch is on my fork. I've not passed all the tests yet.
https://github.com/pmiam/papis/tree/PR/redefine-papis-id-to-aid-duplicate-sniffing

The scope of my changes has grown slightly, thanks to your feedback @alexfikl and @choreutes. I saw an opportunity to enable the option to use a deterministic papis_id as a useful duplicate checker for the rare case in which reference files are redundant despite differing metadata.

Summary of current PR branches, Instead of a single branch I've split them up for the time being.

  1. re-organized the add command's run function
    • add files to tmp_document asap
    • add tmp_document to database early
    • delete get_document_hash_folder
  2. instead of hashing a byte string of the doc dict, I reused some of the code from the old get_document_hash_folder to hash file contents so this should be no different from old behavior.
    • using hashlib's md5 constructor with useforsecurity=False for the benefit of FIPS compliant users
    • this requires bumping the minimum python version to 3.9

My current issue is that I need to redress the duplicate checking in order to pass tests. Currently duplicate checking is a try-except-else block in the add command's run function.

In order to use papis_id in duplicate checking, the database entry must be made before duplicate checking. Currently, that same entry is trivially found by the duplicate check. There's no simple way around that with the current code.

I think duplicate checking should be (or possibly already is) a database function, but I welcome some more feedback.

@alexfikl
Copy link
Collaborator

alexfikl commented May 2, 2024

I haven't looked at your duplicate checking changes, but I can't say I understand why you'd want to use papis_id as a duplicate check.

I suggest you start with some PR with the clearer changes, e.g. using papis_id instead of get_document_hash_folder, and then we can see if it's worth reworking anything else.

@pmiam
Copy link
Contributor Author

pmiam commented May 3, 2024

No duplicate checking changes yet. I'd like to discuss what/if they should be before I move forward.

Thanks for the advice. I refocused on my original goal and opened my PR. I had no idea you could make a draft PR. Seems obvious in hindsight.

Using a checksum to compare files is common in my work. Comparisons based on metadata are definitely more useful when collecting e.g. literature or videos but I also do a lot of archiving. Note, I'm brand new to Papis but I think it could help me there especially if it kept track of meaningful checksums. Of course, it will be completely opt-in.

@alexfikl
Copy link
Collaborator

alexfikl commented May 3, 2024

No duplicate checking changes yet. I'd like to discuss what/if they should be before I move forward.

Thanks for the advice. I refocused on my original goal and opened my PR. I had no idea you could make a draft PR. Seems obvious in hindsight.

Thank you! That looks a lot more manageable in both scope and code 😁 (always nice to see as a reviewer!)

Using a checksum to compare files is common in my work. Comparisons based on metadata are definitely more useful when collecting e.g. literature or videos but I also do a lot of archiving. Note, I'm brand new to Papis but I think it could help me there especially if it kept track of meaningful checksums. Of course, it will be completely opt-in.

I'm not sure what benefit you would have in storing a checksum in the info.yml file? Can you give some examples of what workflow you had in mind? I would have thought other tools are much better at archiving and checksumming a bunch of folders.

@pmiam
Copy link
Contributor Author

pmiam commented May 3, 2024

I am very glad to hear it. I'm a novice contributor, so I'm sure to learn a lot.

I'm not sure what benefit you would have in storing a checksum in the info.yml file?

The papis_id is already being stored in the info file no? What's the harm in doubling as a checksum?

This is something you could help me better understand: I noticed in several places that you go to some effort to keep the info.yaml file "pure." At least, saving a document after adding a computed key to it is often avoided. In the yaml dump function used to feed the editor, even the papis ID is popped off (though it still appears in the file, I think).

I see the value in keeping the info file friendly, but the option to pull a checksum from a file with jq or something is pretty friendly to me. Especially if that checksum is the root of great big hash tree.

Can you give some examples of what workflow you had in mind?

The main thing I have in mind is periodically performed webpage archives. If nothing changes in a period, the checksum catches it. You may be very correct that there's "better" tools out there. I have a tendency to miss them, evidently. I slept on Papis for years, hahaha.

I'm not a professional archivist, I just like to keep track of some things. And I really like papis for it's interface(s), composability, hackability, etc. If the option is available, why not have it?

@alexfikl
Copy link
Collaborator

alexfikl commented May 4, 2024

The papis_id is already being stored in the info file no? What's the harm in doubling as a checksum?

Calling it a checksum has some important implications, e.g. related to security and integrity of the data, that I'm not sure how we can guarantee. For example, since the papis library is plain text, anyone can do a sed 's/title/titles/g' -i doc/info.yml and invalidate the checksum.

This is something you could help me better understand: I noticed in several places that you go to some effort to keep the info.yaml file "pure." At least, saving a document after adding a computed key to it is often avoided. In the yaml dump function used to feed the editor, even the papis ID is popped off (though it still appears in the file, I think).

I'm not sure I understand the question. In papis.document.dump, we just pop the papis_id because it's an internal identifier mostly and isn't useful when displaying a document to the user.

Can you give some examples of what workflow you had in mind?

The main thing I have in mind is periodically performed webpage archives. If nothing changes in a period, the checksum catches it. You may be very correct that there's "better" tools out there. I have a tendency to miss them, evidently. I slept on Papis for years, hahaha.

I think any backup software will do that for you, i.e. compare the latest backup with the current state and only archive the changes. To keep in on the command-line, I've heard good things about borg, but I've never actually used it.

@pmiam
Copy link
Contributor Author

pmiam commented May 4, 2024

I think any backup software will [...] compare the latest backup with the current state and only archive the changes. To keep in on the command-line, I've heard good things about borg

I actually already use borg for all my local backups. It's great.

That said, borg doesn't provision collector programs to populate its archives with random internet pages. If I really prioritized de-duplicating storage I could use papis to fetch web snapshots into a borg archive mounted to my filesystem through fuse. Just the same, I could put my papis library on a ZFS dataset with de-duplication enabled. Then I'd really be limiting my on-disk storage to unique 512 byte blocks.

a checksum has some important implications [...] related to security and integrity of the data

Fair enough. I mean "checksum" as a representative digest of some data that could be used to check its integrity and identity. Checking integrity is not relevant, neither is defending from collision attacks. Checking identity requires keeping indexes/hashtables/caches like those already employed by borg/zfs/papis.

Borg and zfs definitely beat Papis at this particular game, but I'm not interested in playing at that level. The simple option to check incoming data against a cache of prior digests is useful, and if papis won't do it another tool will have to for those few users who want it, like me.

Not much to be done about intentionally invalidated checksums, I think. As of my PR, they would be recalculated with papis add operations anyway.

we just pop the papis_id because it's an internal identifier mostly and isn't useful when displaying a document to the user.

Indeed, the papis_id is just a random string right now, and I think it should stay random by default, but it doesn't have to be random. Even if it isn't random, hiding it in curated displays is fair enough.

I think I'll make a draft PR with the (mostly done) hashing changes. That might do a better job of explaining my goal here. If it does, we can keep talking about the best way to enable duplicate checking with it.

@pmiam
Copy link
Contributor Author

pmiam commented May 5, 2024

I slept on this and I think I've come around to your side.

The uid is not a checksum. Again, obvious in hindsight. I still think I'll make that draft. It might be useful if papis.id.compute_an_id can digest entry contents sometime in the future. However, that won't have anything to do with papis.commands.add, the typical use of papis.database.Database.maybe_compute_id, or duplicate checking. I agree data de-duplication is much better handled by any of the previously mentioned archival tools.

However, it's still possible that duplicate checking might benefit from refactoring. At least, that try-except-else block could be its own function.

@alexfikl
Copy link
Collaborator

alexfikl commented May 5, 2024

However, it's still possible that duplicate checking might benefit from refactoring. At least, that try-except-else block could be its own function.

Yeah, that definitely sounds like a useful function to have around! At the moment, I don't think any other papis command would need it, but it would be nice to clean it up and make it available if any plugin wants to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants