[Feature request] papis doctor: Find documents with duplicate files #702

hseg · 2023-11-13T11:40:03Z

papis version ($ papis --version or commit number): 0.13

Sometimes multiple downloaders match a given query and download to a document, leading to duplication. It would be nice to have a papis doctor check for this situation. Possible extensions might include looking for duplicate documents (eg by high similarity of metadata/shared files).

The text was updated successfully, but these errors were encountered:

alexfikl · 2023-11-13T11:56:26Z

The check in #695 should report some duplication, but if the files are named differently it's a bit more complicated (and costly) to robustly compare them. Would be very useful though!

hseg · 2023-11-13T12:06:21Z

Perhaps have the files entry in the info.yaml carry a checksum of each file, and compare the checksums instead?
Indeed, with such a setup, both #695 and duplicated-keys could satisfy the request here, modulo an extension to allow the keys they check to contain a restricted jq expression. ie with a format like

author: Isaac Newton
title: Opticks, or a treatise of the reflections refractions, inflections and
  colours of light
files:
  - path: document.pdf
    md5: "c42d011aae85a44e265a8690aaf0e585"
  - path: supporting-information.pdf
    md5: "34c222ceeb0da087790f18d5ddd14662"

one could add .files.[].md5 to doctor-duplicated-keys-keys to get the desired behaviour (a different spelling that looks closer to what's used elsewhere in papis is files[][md5])

alejandrogallo · 2023-11-13T20:13:51Z

thank you @hseg for your valuable input.

I would discourage to change the files field for compatiblity issues, and it is not maintainable.
Sometimes I put comments in the files and modify the pdf file, stuff like that.
The workflow to update the checksums seems to me to be a big hurdle.

I would suggest to implement a duplicate-document check in doctor.
This check should work by implementing several checks, in the future
we can make the model for the check better and better.

But one of this checks can be to have a file in the cache
${XDG_CACHE}/.../papis/duplicate-check a dictionary
with

{
    papis_id: {
        files: [
            "md5sum of first file",
            ...
        ],
        metadata_parsed_date: "date of when this metadata was collected"
    }
}

then the duplicate-document check can go through the selecte documents
and check in this dictionary if the papis_id is in there,
compare the metadata_parsed_date date with the unix last edited timestamp of the document,
and if then recreate the metadata if the metadata_parsed_date is older and so on.

After all these checks, we can check for the md5sum.

It will always take quite a long time, but that is ok.
For my computer at work it lasts around 6 seconds to do this for 2k documents, which is ok.

time find . -type f -name '*.pdf' -exec md5sum {} \;

What do you think about this?

hseg · 2023-11-14T17:39:00Z

Hadn't thought of that usecase, thanks for the pushback.
That seems a perfectly reasonable alternative, moreover the checksums could be used for the per-document deduplication I suggested as an extension, which is nice.
In re the speed question, besides the fact that this is probably going to be an on-demand operation, we could always expose a configuration option for hash selection.

Out of curiosity, tried hashing the 1302 papers on my system and got wildly diverging results (one run of b2 took 78 seconds, whereas the next one took 4s. presumably a caching artifact?)

b2sum: 4.28s
cksum: 1.97s
md5sum: 5.0s
sha1sum: 4.0s
sha224sum: 7.1s
sha256sum: 7.2s
sha384sum: 5.4s
sha512sum: 5.4s
sum: 4.8s
xxh128sum: 1.6s
xxh32sum: 1.6s
xxh64sum: 1.5s
xxhsum: 1.4s

hseg mentioned this issue Nov 13, 2023

Papis add: Warn when no metadata obtained #699

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] papis doctor: Find documents with duplicate files #702

[Feature request] papis doctor: Find documents with duplicate files #702

hseg commented Nov 13, 2023

alexfikl commented Nov 13, 2023

hseg commented Nov 13, 2023 •

edited

alejandrogallo commented Nov 13, 2023 •

edited

hseg commented Nov 14, 2023

[Feature request] papis doctor: Find documents with duplicate files #702

[Feature request] papis doctor: Find documents with duplicate files #702

Comments

hseg commented Nov 13, 2023

alexfikl commented Nov 13, 2023

hseg commented Nov 13, 2023 • edited

alejandrogallo commented Nov 13, 2023 • edited

hseg commented Nov 14, 2023

hseg commented Nov 13, 2023 •

edited

alejandrogallo commented Nov 13, 2023 •

edited