Enhancement - Optional parameter set for source folder / comparison folder mode #72

Get-Ryan · 2023-04-16T20:10:25Z

My apologies if this is already possible, but I didn't find any method within the documentation at https://difpy.readthedocs.io/en/latest/usage.html.

Currently, the multiple folder method within difPy searches for duplicates amongst all the folders listed. However, once you have de-duplicated the images within a folder, if you then look to compare an additional folder against those that you have already de-duplicated, the de-duplicated folder's contents again get compared against themselves.

It would be nice if there was an optional parameter set where a source folder could be set that multiple comparison folders could check against. Basically, difPy would assume the image contents in the source folder are unique, and only need to be processed against duplicates within the provided comparison folders.

I believe this would help in larger image projects like mine. In my scenario, I've downloaded photos from my partner and my phone's and tablets multiple times throughout the years. As I started de-duplicating with difPy, I would move the de-duplicated files into a central project folder and then scan the next photo dump folder against the project folder. Since the project folder already contains unique images, I don't need difPy to check those images against each other again, but I'm not seeing any existing method to do that. I think this would be a noticeable performance improvement, especially as image sets get larger and larger.

dedefelrodrigues · 2023-05-03T19:51:22Z

I have the same wish as well to better use difpy in my projects.

UplandsDynamic · 2023-05-07T18:57:17Z

@elisemercury, just noticed this issue and had a quick look to see how it might be implemented.

I've not coded/tested this yet - just did a very quick code review and noted down the idea, so may not work (and may well have idiotic mistakes!). But if you think it's a valid approach - and want to add this feature - let me know I'll code/test/pull request.

Add an input param to take a list of directories where files should only be checked against files located outwith directories in this list, and assign its value to a new dif class parameter, e.g., self.dupe_free_dirs
Change the _search.exclude_from_search variable to a dif class parameter and pass that in as an arg to both the _search.matches and _compute.id_by_location methods.
Amend line 339 to if (number_B > number_A) and id_B not in self.exclude_from_search.
Use the existing directory for-loop in _compute.id_by_location, to check the file locations against directories stored in self.dupe_free_dirs. If found, add the file ID (once created) to self.exclude_from_search.

jdoe1917 · 2023-10-18T03:28:56Z

I had implemented a very rough way to do pairwise comparison between folders in the difpy V3 but I don't have the knowledge to do it in V4. This only works for two folders but was useful sometimes if you need to compare a small number of files (500) against a much larger set (20,000) and don't want to run in exponential time. the break point (bp) between folders is hard coded here and is the number of images in the smaller folder.

` def _matches(imgs_matrices, id_by_location, similarity, show_output, show_progress, fast_search):
# Function that searches the images on duplicates/similarity matches
progress_count = 0
duplicate_count, similar_count = 0, 0
total_count = len(imgs_matrices)
exclude_from_search = []
result = {}

    bp=89 #EDIT
    for number_A, (id_A, matrix_A) in enumerate(imgs_matrices.items()):
        if number_A>bp: #EDIT
            break
        if show_progress:
            _help._show_progress(progress_count, total_count, task='comparing images')
        if id_A in exclude_from_search:
            progress_count += 1
        else:
            for number_B, (id_B, matrix_B) in enumerate(imgs_matrices.items()):
                if number_B > number_A and number_B>bp-2: #EDIT
                    rotations = 0
                    while rotations <= 3:
                        if rotations != 0:
                            matrix_B = _help._rotate_img(matrix_B)
                        try:
                            mse = _compute._mse(matrix_A, matrix_B)
                        except:
                            MSE = 0
                        if mse <= similarity:
                            check = False
                            for key in result.keys():
                                if id_A in result[key]['matches']:
                                    result[key]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
                                                                    'mse': mse }  
                                    check = True
                            if not check:                                      
                                if id_A not in result.keys():
                                    result[id_A] = {'location': str(Path(id_by_location[id_A])),
                                                    'matches': {id_B: {'location': str(Path(id_by_location[id_B])),
                                                                        'mse': mse }}}
                                else:
                                    result[id_A]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
                                                                    'mse': mse }
                            if show_output:
                                _help._show_img_figs(matrix_A, matrix_B, mse)
                                _help._show_file_info(str(Path(id_by_location[id_A])), str(Path(id_by_location[id_B])))
                            if fast_search == True:
                                exclude_from_search.append(id_B)
                            rotations = 4
                        else:
                            rotations += 1
            progress_count += 1
    
    if similarity > 0:
        for id in result:
            if similarity > 0:
                for matchid in result[id]['matches']:
                    if result[id]['matches'][matchid]['mse'] > 0:
                        similar_count += 1
                    else:
                        duplicate_count +=1        
    else:
        for id in result:
            duplicate_count += len(result[id]['matches'])
    return result, exclude_from_search, total_count, duplicate_count, similar_count

`

elisemercury added the feature : new New feature for difPy. label Aug 31, 2023

elisemercury mentioned this issue Feb 21, 2024

is it not possible to leverage difPy to match just a single image to a folder of images? #92

Closed

elisemercury added status : likely Feature will likely be implemented in difPy. and removed status : likely Feature will likely be implemented in difPy. labels Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement - Optional parameter set for source folder / comparison folder mode #72

Enhancement - Optional parameter set for source folder / comparison folder mode #72

Get-Ryan commented Apr 16, 2023

dedefelrodrigues commented May 3, 2023

UplandsDynamic commented May 7, 2023

jdoe1917 commented Oct 18, 2023

Enhancement - Optional parameter set for source folder / comparison folder mode #72

Enhancement - Optional parameter set for source folder / comparison folder mode #72

Comments

Get-Ryan commented Apr 16, 2023

dedefelrodrigues commented May 3, 2023

UplandsDynamic commented May 7, 2023

jdoe1917 commented Oct 18, 2023