Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement - Optional parameter set for source folder / comparison folder mode #72

Open
Get-Ryan opened this issue Apr 16, 2023 · 3 comments
Labels
feature : new New feature for difPy. status : likely Feature will likely be implemented in difPy.

Comments

@Get-Ryan
Copy link

My apologies if this is already possible, but I didn't find any method within the documentation at https://difpy.readthedocs.io/en/latest/usage.html.

Currently, the multiple folder method within difPy searches for duplicates amongst all the folders listed. However, once you have de-duplicated the images within a folder, if you then look to compare an additional folder against those that you have already de-duplicated, the de-duplicated folder's contents again get compared against themselves.

It would be nice if there was an optional parameter set where a source folder could be set that multiple comparison folders could check against. Basically, difPy would assume the image contents in the source folder are unique, and only need to be processed against duplicates within the provided comparison folders.

I believe this would help in larger image projects like mine. In my scenario, I've downloaded photos from my partner and my phone's and tablets multiple times throughout the years. As I started de-duplicating with difPy, I would move the de-duplicated files into a central project folder and then scan the next photo dump folder against the project folder. Since the project folder already contains unique images, I don't need difPy to check those images against each other again, but I'm not seeing any existing method to do that. I think this would be a noticeable performance improvement, especially as image sets get larger and larger.

@dedefelrodrigues
Copy link

I have the same wish as well to better use difpy in my projects.

@UplandsDynamic
Copy link

@elisemercury, just noticed this issue and had a quick look to see how it might be implemented.

I've not coded/tested this yet - just did a very quick code review and noted down the idea, so may not work (and may well have idiotic mistakes!). But if you think it's a valid approach - and want to add this feature - let me know I'll code/test/pull request.

  1. Add an input param to take a list of directories where files should only be checked against files located outwith directories in this list, and assign its value to a new dif class parameter, e.g., self.dupe_free_dirs

  2. Change the _search.exclude_from_search variable to a dif class parameter and pass that in as an arg to both the _search.matches and _compute.id_by_location methods.

  3. Amend line 339 to if (number_B > number_A) and id_B not in self.exclude_from_search.

  4. Use the existing directory for-loop in _compute.id_by_location, to check the file locations against directories stored in self.dupe_free_dirs. If found, add the file ID (once created) to self.exclude_from_search.

@elisemercury elisemercury added the feature : new New feature for difPy. label Aug 31, 2023
@jdoe1917
Copy link

I had implemented a very rough way to do pairwise comparison between folders in the difpy V3 but I don't have the knowledge to do it in V4. This only works for two folders but was useful sometimes if you need to compare a small number of files (500) against a much larger set (20,000) and don't want to run in exponential time. the break point (bp) between folders is hard coded here and is the number of images in the smaller folder.

` def _matches(imgs_matrices, id_by_location, similarity, show_output, show_progress, fast_search):
# Function that searches the images on duplicates/similarity matches
progress_count = 0
duplicate_count, similar_count = 0, 0
total_count = len(imgs_matrices)
exclude_from_search = []
result = {}

    bp=89 #EDIT
    for number_A, (id_A, matrix_A) in enumerate(imgs_matrices.items()):
        if number_A>bp: #EDIT
            break
        if show_progress:
            _help._show_progress(progress_count, total_count, task='comparing images')
        if id_A in exclude_from_search:
            progress_count += 1
        else:
            for number_B, (id_B, matrix_B) in enumerate(imgs_matrices.items()):
                if number_B > number_A and number_B>bp-2: #EDIT
                    rotations = 0
                    while rotations <= 3:
                        if rotations != 0:
                            matrix_B = _help._rotate_img(matrix_B)
                        try:
                            mse = _compute._mse(matrix_A, matrix_B)
                        except:
                            MSE = 0
                        if mse <= similarity:
                            check = False
                            for key in result.keys():
                                if id_A in result[key]['matches']:
                                    result[key]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
                                                                    'mse': mse }  
                                    check = True
                            if not check:                                      
                                if id_A not in result.keys():
                                    result[id_A] = {'location': str(Path(id_by_location[id_A])),
                                                    'matches': {id_B: {'location': str(Path(id_by_location[id_B])),
                                                                        'mse': mse }}}
                                else:
                                    result[id_A]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
                                                                    'mse': mse }
                            if show_output:
                                _help._show_img_figs(matrix_A, matrix_B, mse)
                                _help._show_file_info(str(Path(id_by_location[id_A])), str(Path(id_by_location[id_B])))
                            if fast_search == True:
                                exclude_from_search.append(id_B)
                            rotations = 4
                        else:
                            rotations += 1
            progress_count += 1
    
    if similarity > 0:
        for id in result:
            if similarity > 0:
                for matchid in result[id]['matches']:
                    if result[id]['matches'][matchid]['mse'] > 0:
                        similar_count += 1
                    else:
                        duplicate_count +=1        
    else:
        for id in result:
            duplicate_count += len(result[id]['matches'])
    return result, exclude_from_search, total_count, duplicate_count, similar_count

`

@elisemercury elisemercury added status : likely Feature will likely be implemented in difPy. and removed status : likely Feature will likely be implemented in difPy. labels Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature : new New feature for difPy. status : likely Feature will likely be implemented in difPy.
Projects
None yet
Development

No branches or pull requests

5 participants