Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Free text image search using CLIP features #1269

Closed
TheStealthReporter opened this issue Jan 8, 2023 · 6 comments
Closed

[Feature]: Free text image search using CLIP features #1269

TheStealthReporter opened this issue Jan 8, 2023 · 6 comments
Labels
feature needs triage Bug that needs triage from maintainer

Comments

@TheStealthReporter
Copy link

Feature detail

I've seen that currently work is done in Immich to implement image search. If this search system is based on "fixed" tags/labeling it might be worth looking into CLIP embeddings. I tried the CLIP embedding approach on my photo collection and it was vastly superior in retrieving images compared to any class-output-based (like the 1000 ImageNet classes) neural network that I tried.

How it works

The idea behind the embeddings is that there are two different neural networks that transform the input in a common "semantic" space (where related concepts are positioned closely together in that space):

  • text -> CLIP embedding space
  • images -> CLIP embedding space

The CLIP embeddings for the photos can be precomputed once. The "text -> CLIP embedding" model has to be run every time the user enters a search query. Through a standard nearest-neighbor search inside the CLIP space we can retrieve the most related photos to a given search query.

This idea has been discussed for PhotoPrism before. The code I used on my photo collection was derived from the example given there (otherwise a minimal example is also provided here).

Advantage compared to class-based image search

The advantage of this approach is that you can also successfully search for more complicated queries like "three people" or "a person wearing a hat next to a dog". With queries like this, I was able to accurately find any specific photo within the five nearest-neighbors results in a database of 5000 images usually with the first query that came to mind.

If you were already aware of this approach feel free to close this issue (I haven't seen it discussed on this repo before though) - I'm just hoping to spread awareness about it.

Platform

Server

@TheStealthReporter TheStealthReporter added feature needs triage Bug that needs triage from maintainer labels Jan 8, 2023
@jrasm91
Copy link
Contributor

jrasm91 commented Jan 8, 2023

So you use an existing model, encode each image (convert to clip space) and save it as a binary file, then for queries you encode the query, load the binary file, and do a nearest neighbor search? Am i understanding that correctly?

Do you know how long it takes to train 5000ish pictures? Or, how big the binary file is in relation to image count? This looks really interesting and potentially a better approach than the image classification we're doing now.

I assume we would index new files as they're uploaded. Is it possible to remove an image from the index as well?

@TheStealthReporter
Copy link
Author

Yes, the pre-trained model clip-ViT-B-32 (if I remember correctly ~600MB) is what I've used in my experiments. Each embedding is a 512-dimensional vector. The "database" file of my 5000 photos has a final size of 11MB.

Running the "image -> CLIP space" model on my Ryzen 7 5800X CPU (but only single threaded) took about something 30min-45min for the 5000 photos. So each image takes a bit less than 1s.

I've tried a bit to achieve multi-threading but I've not managed to get it to work with my first tries. Not sure how complicated it is to apply the model to multiple images concurrently/multi-threaded with Python (without loading the model for each thread individually)...

I'd advise using a spatial acceleration structure for the (approximate) nearest-neighbor search. For 5000 photos it's fine to iterate over all of them but for more larger image databases we'd probably like the logarithmic complexity. In a pull request for PhotoPrism, the qdrant database was proposed. I've also stumbled upon the FAISS library for that purpose. I don't know these libraries, this is just what others have used. An investigation which nearest-neighbor databases exist might be necessary.

Loading the model takes a few seconds. So the script should not be used as-is but rather used as a place for experimentation.

Here my code which also visualizes the results:

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
import os
import numpy as np
from tqdm.auto import tqdm
from PIL import Image, ImageOps
import multiprocessing

from matplotlib import pyplot as plt
import sys

yourimageglob = '/home/user/Pictures/Camera/*.jpg'


if __name__ == "__main__":
    #First, we load the respective CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    #model = SentenceTransformer('clip-ViT-L-14')

    use_precomputed_embeddings = True

    emb_file = 'pretrained_embeddings.pkl'

    try:
        with open(emb_file, 'rb') as fIn:
            img_names, img_emb = pickle.load(fIn)
    except:
        img_names = list(glob.glob(yourimageglob))
        img_names = img_names[0:5000]
        #print("Images:", len(img_names))

        def compute_embedding(i, img_name):
            global model
            print("analyze {}/{} {} ".format(i + 1, len(img_names), img_name))
            img = Image.open(img_name)
            img = ImageOps.exif_transpose(img)
            img_emb = model.encode(img, device='cpu')
            img.close()
            return img_emb

        img_emb = []
        for this_img_emb in map(compute_embedding, range(len(img_names)), img_names):
            img_emb.append(this_img_emb)
        # print(img_emb)
        img_emb = np.array(img_emb)
        img_emb = torch.tensor(img_emb)
        # img_emb = list()
        # for filepath in img_names:
        #     print('Analyzing {}'.format(filepath))
        #     this_img_emb = model.encode(Image.open(filepath))
        #     img_emb.append(this_img_emb)

        data = (img_names, img_emb)
        file = open(emb_file, 'wb')

        # dump information to that file
        pickle.dump(data, file)

        # close the file
        file.close()


    query = sys.argv[1]
    text_emb = model.encode([query])
    scores = list()
    for img_name, img_em in zip(img_names, img_emb):
        cos_score = util.cos_sim(img_em, text_emb).tolist()[0][0] * 100
        scores.append(cos_score)

    scored_imgs = list(zip(img_names, scores))
    scored_imgs.sort(key=lambda v: v[1], reverse=True)
    for img_name, score in reversed(scored_imgs):
        print("{:.2f} {}".format(score, img_name))

    print(text_emb.shape)

    # create figure
    fig = plt.figure(figsize=(10, 7))


    for i, (img_name, score) in enumerate(scored_imgs[:9]):
        # Adds a subplot at the 1st position
        fig.add_subplot(3, 3, i + 1)

        img = Image.open(img_name)
        img = ImageOps.exif_transpose(img)

        # showing image
        plt.imshow(img)
        plt.axis('off')
        plt.title("{:.2f}".format(score))

    plt.show()

@bo0tzz
Copy link
Member

bo0tzz commented Jan 8, 2023

Very cool stuff! For the nearest-neighbour, do you know whether it'd be possible to use Postgres for that somehow? That way we wouldn't need to add another stateful container :)

@TheStealthReporter
Copy link
Author

I'm not familiar with Postgres. But natively it isn't supported as far as I can tell. A quick Google search "postgresql high dimensional nearest neighbor search extension" yields the Postgres extension PASE (paper). How easy to include this compared to qdrant I don't know.

@yowmamasita
Copy link

@jrasm91
Copy link
Contributor

jrasm91 commented Feb 7, 2023

It's funny - we were just talking about this internally 👍

@immich-app immich-app locked and limited conversation to collaborators Feb 8, 2023
@alextran1502 alextran1502 converted this issue into discussion #1613 Feb 8, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feature needs triage Bug that needs triage from maintainer
Projects
None yet
Development

No branches or pull requests

4 participants