-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] papis doctor: Find documents with duplicate files #702
Comments
The check in #695 should report some duplication, but if the files are named differently it's a bit more complicated (and costly) to robustly compare them. Would be very useful though! |
Perhaps have the author: Isaac Newton
title: Opticks, or a treatise of the reflections refractions, inflections and
colours of light
files:
- path: document.pdf
md5: "c42d011aae85a44e265a8690aaf0e585"
- path: supporting-information.pdf
md5: "34c222ceeb0da087790f18d5ddd14662" one could add |
thank you @hseg for your valuable input. I would discourage to change the files field for compatiblity issues, and it is not maintainable. I would suggest to implement a But one of this checks can be to have a file in the cache
then the After all these checks, we can check for the md5sum. It will always take quite a long time, but that is ok. time find . -type f -name '*.pdf' -exec md5sum {} \; What do you think about this? |
Hadn't thought of that usecase, thanks for the pushback. Out of curiosity, tried hashing the 1302 papers on my system and got wildly diverging results (one run of b2 took 78 seconds, whereas the next one took 4s. presumably a caching artifact?)
|
$ papis --version
or commit number): 0.13Sometimes multiple downloaders match a given query and download to a document, leading to duplication. It would be nice to have a
papis doctor
check for this situation. Possible extensions might include looking for duplicate documents (eg by high similarity of metadata/shared files).The text was updated successfully, but these errors were encountered: