Skip to content

v4.1.0

Latest
Compare
Choose a tag to compare
@elisemercury elisemercury released this 20 Feb 22:37
· 1 commit to main since this release

Major enhancements and new features:

  • Enhancement: provides a fix for #80. difPy now comes with an improved algorithm for handling larger datasets in order to be more memory efficient, see Using difPy with Large Datasets. As part of this enhancement, two new parameters were added:
    • processes has been added to difPy.build and difPy.search, and defines the number of worker processes for multiprocessing. Read more here.
    • chunksize has been added to difPy.search and sets the batch size at which the job is simultaneously processed when multiprocessing. Read more here.
  • Enhancement: difPy comes with improved performance due to major improvements in the comparison algorithms. As part of this enhancement, a new parameter was added:
    • lazy was added to difPy.search which allows difPy to search more efficiently for exact duplicates (i. e. two exact file copies). By default, lazy is set to True and should only be turned off when searching for images that are not exact duplicates (i. e. having different dimensions, different file types, etc.). Read more here.
  • Enhancement: the default value of the similarity parameter was reduced from 50 to 5.
  • Enhancement: the progress bar has been improved.
  • New feature: difPy.search now supports the rotate parameter. If set to False, images will not be rotated on comparison, which can significantly reduce comparison times. Read more here.
  • New feature: the output structure of difPy has been adjusted for improved user-friendliness: the structure of search.result is now simpler with less levels of depth, and search.lower_quality now comes as a list. When invoked via the CLI, the lower_quality output file will now be in .txt format.

See the difPy usage guide for more details. Happy deduplicating! 🎉