Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] papis doctor: Find documents with duplicate files #702

Open
hseg opened this issue Nov 13, 2023 · 4 comments
Open

[Feature request] papis doctor: Find documents with duplicate files #702

hseg opened this issue Nov 13, 2023 · 4 comments

Comments

@hseg
Copy link
Contributor

hseg commented Nov 13, 2023

  • papis version ($ papis --version or commit number): 0.13

Sometimes multiple downloaders match a given query and download to a document, leading to duplication. It would be nice to have a papis doctor check for this situation. Possible extensions might include looking for duplicate documents (eg by high similarity of metadata/shared files).

@alexfikl
Copy link
Collaborator

The check in #695 should report some duplication, but if the files are named differently it's a bit more complicated (and costly) to robustly compare them. Would be very useful though!

@hseg
Copy link
Contributor Author

hseg commented Nov 13, 2023

Perhaps have the files entry in the info.yaml carry a checksum of each file, and compare the checksums instead?
Indeed, with such a setup, both #695 and duplicated-keys could satisfy the request here, modulo an extension to allow the keys they check to contain a restricted jq expression. ie with a format like

author: Isaac Newton
title: Opticks, or a treatise of the reflections refractions, inflections and
  colours of light
files:
  - path: document.pdf
    md5: "c42d011aae85a44e265a8690aaf0e585"
  - path: supporting-information.pdf
    md5: "34c222ceeb0da087790f18d5ddd14662"

one could add .files.[].md5 to doctor-duplicated-keys-keys to get the desired behaviour (a different spelling that looks closer to what's used elsewhere in papis is files[][md5])

@alejandrogallo
Copy link
Member

alejandrogallo commented Nov 13, 2023

thank you @hseg for your valuable input.

I would discourage to change the files field for compatiblity issues, and it is not maintainable.
Sometimes I put comments in the files and modify the pdf file, stuff like that.
The workflow to update the checksums seems to me to be a big hurdle.

I would suggest to implement a duplicate-document check in doctor.
This check should work by implementing several checks, in the future
we can make the model for the check better and better.

But one of this checks can be to have a file in the cache
${XDG_CACHE}/.../papis/duplicate-check a dictionary
with

{
    papis_id: {
        files: [
            "md5sum of first file",
            ...
        ],
        metadata_parsed_date: "date of when this metadata was collected"
    }
}

then the duplicate-document check can go through the selecte documents
and check in this dictionary if the papis_id is in there,
compare the metadata_parsed_date date with the unix last edited timestamp of the document,
and if then recreate the metadata if the metadata_parsed_date is older and so on.

After all these checks, we can check for the md5sum.

It will always take quite a long time, but that is ok.
For my computer at work it lasts around 6 seconds to do this for 2k documents, which is ok.

time find . -type f -name '*.pdf' -exec md5sum {} \;

What do you think about this?

@hseg
Copy link
Contributor Author

hseg commented Nov 14, 2023

Hadn't thought of that usecase, thanks for the pushback.
That seems a perfectly reasonable alternative, moreover the checksums could be used for the per-document deduplication I suggested as an extension, which is nice.
In re the speed question, besides the fact that this is probably going to be an on-demand operation, we could always expose a configuration option for hash selection.

Out of curiosity, tried hashing the 1302 papers on my system and got wildly diverging results (one run of b2 took 78 seconds, whereas the next one took 4s. presumably a caching artifact?)

b2sum: 4.28s
cksum: 1.97s
md5sum: 5.0s
sha1sum: 4.0s
sha224sum: 7.1s
sha256sum: 7.2s
sha384sum: 5.4s
sha512sum: 5.4s
sum: 4.8s
xxh128sum: 1.6s
xxh32sum: 1.6s
xxh64sum: 1.5s
xxhsum: 1.4s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants