-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] improved document hashing and duplicate checking #841
Comments
I haven't looked at your duplicate checking changes, but I can't say I understand why you'd want to use I suggest you start with some PR with the clearer changes, e.g. using |
No duplicate checking changes yet. I'd like to discuss what/if they should be before I move forward. Thanks for the advice. I refocused on my original goal and opened my PR. I had no idea you could make a draft PR. Seems obvious in hindsight. Using a checksum to compare files is common in my work. Comparisons based on metadata are definitely more useful when collecting e.g. literature or videos but I also do a lot of archiving. Note, I'm brand new to Papis but I think it could help me there especially if it kept track of meaningful checksums. Of course, it will be completely opt-in. |
Thank you! That looks a lot more manageable in both scope and code 😁 (always nice to see as a reviewer!)
I'm not sure what benefit you would have in storing a checksum in the |
I am very glad to hear it. I'm a novice contributor, so I'm sure to learn a lot.
The papis_id is already being stored in the info file no? What's the harm in doubling as a checksum? This is something you could help me better understand: I noticed in several places that you go to some effort to keep the info.yaml file "pure." At least, saving a document after adding a computed key to it is often avoided. In the yaml dump function used to feed the editor, even the papis ID is popped off (though it still appears in the file, I think). I see the value in keeping the info file friendly, but the option to pull a checksum from a file with jq or something is pretty friendly to me. Especially if that checksum is the root of great big hash tree.
The main thing I have in mind is periodically performed webpage archives. If nothing changes in a period, the checksum catches it. You may be very correct that there's "better" tools out there. I have a tendency to miss them, evidently. I slept on Papis for years, hahaha. I'm not a professional archivist, I just like to keep track of some things. And I really like papis for it's interface(s), composability, hackability, etc. If the option is available, why not have it? |
Calling it a checksum has some important implications, e.g. related to security and integrity of the data, that I'm not sure how we can guarantee. For example, since the papis library is plain text, anyone can do a
I'm not sure I understand the question. In
I think any backup software will do that for you, i.e. compare the latest backup with the current state and only archive the changes. To keep in on the command-line, I've heard good things about borg, but I've never actually used it. |
I actually already use borg for all my local backups. It's great. That said, borg doesn't provision collector programs to populate its archives with random internet pages. If I really prioritized de-duplicating storage I could use papis to fetch web snapshots into a borg archive mounted to my filesystem through fuse. Just the same, I could put my papis library on a ZFS dataset with de-duplication enabled. Then I'd really be limiting my on-disk storage to unique 512 byte blocks.
Fair enough. I mean "checksum" as a representative digest of some data that could be used to check its integrity and identity. Checking integrity is not relevant, neither is defending from collision attacks. Checking identity requires keeping indexes/hashtables/caches like those already employed by borg/zfs/papis. Borg and zfs definitely beat Papis at this particular game, but I'm not interested in playing at that level. The simple option to check incoming data against a cache of prior digests is useful, and if papis won't do it another tool will have to for those few users who want it, like me. Not much to be done about intentionally invalidated checksums, I think. As of my PR, they would be recalculated with
Indeed, the papis_id is just a random string right now, and I think it should stay random by default, but it doesn't have to be random. Even if it isn't random, hiding it in curated displays is fair enough. I think I'll make a draft PR with the (mostly done) hashing changes. That might do a better job of explaining my goal here. If it does, we can keep talking about the best way to enable duplicate checking with it. |
I slept on this and I think I've come around to your side. The uid is not a checksum. Again, obvious in hindsight. I still think I'll make that draft. It might be useful if papis.id.compute_an_id can digest entry contents sometime in the future. However, that won't have anything to do with papis.commands.add, the typical use of papis.database.Database.maybe_compute_id, or duplicate checking. I agree data de-duplication is much better handled by any of the previously mentioned archival tools. However, it's still possible that duplicate checking might benefit from refactoring. At least, that try-except-else block could be its own function. |
Yeah, that definitely sounds like a useful function to have around! At the moment, I don't think any other papis command would need it, but it would be nice to clean it up and make it available if any plugin wants to use it. |
Hello! This issue is a follow-up to my post here:
Originally posted by @pmiam in #549 (comment)
My PR branch is on my fork. I've not passed all the tests yet.
https://github.com/pmiam/papis/tree/PR/redefine-papis-id-to-aid-duplicate-sniffing
The scope of my changes has grown slightly, thanks to your feedback @alexfikl and @choreutes. I saw an opportunity to enable the option to use a deterministic
papis_id
as a useful duplicate checker for the rare case in which reference files are redundant despite differing metadata.Summary of current PR branches, Instead of a single branch I've split them up for the time being.
run
functionget_document_hash_folder
get_document_hash_folder
to hash file contents so this should be no different from old behavior.useforsecurity=False
for the benefit of FIPS compliant usersMy current issue is that I need to redress the duplicate checking in order to pass tests. Currently duplicate checking is a try-except-else block in the add command's
run
function.In order to use
papis_id
in duplicate checking, the database entry must be made before duplicate checking. Currently, that same entry is trivially found by the duplicate check. There's no simple way around that with the current code.I think duplicate checking should be (or possibly already is) a database function, but I welcome some more feedback.
The text was updated successfully, but these errors were encountered: