[Feature]: Free text image search using CLIP features #1269

TheStealthReporter · 2023-01-08T00:37:37Z

Feature detail

I've seen that currently work is done in Immich to implement image search. If this search system is based on "fixed" tags/labeling it might be worth looking into CLIP embeddings. I tried the CLIP embedding approach on my photo collection and it was vastly superior in retrieving images compared to any class-output-based (like the 1000 ImageNet classes) neural network that I tried.

How it works

The idea behind the embeddings is that there are two different neural networks that transform the input in a common "semantic" space (where related concepts are positioned closely together in that space):

text -> CLIP embedding space
images -> CLIP embedding space

The CLIP embeddings for the photos can be precomputed once. The "text -> CLIP embedding" model has to be run every time the user enters a search query. Through a standard nearest-neighbor search inside the CLIP space we can retrieve the most related photos to a given search query.

This idea has been discussed for PhotoPrism before. The code I used on my photo collection was derived from the example given there (otherwise a minimal example is also provided here).

Advantage compared to class-based image search

The advantage of this approach is that you can also successfully search for more complicated queries like "three people" or "a person wearing a hat next to a dog". With queries like this, I was able to accurately find any specific photo within the five nearest-neighbors results in a database of 5000 images usually with the first query that came to mind.

If you were already aware of this approach feel free to close this issue (I haven't seen it discussed on this repo before though) - I'm just hoping to spread awareness about it.

Platform

Server

jrasm91 · 2023-01-08T04:02:03Z

So you use an existing model, encode each image (convert to clip space) and save it as a binary file, then for queries you encode the query, load the binary file, and do a nearest neighbor search? Am i understanding that correctly?

Do you know how long it takes to train 5000ish pictures? Or, how big the binary file is in relation to image count? This looks really interesting and potentially a better approach than the image classification we're doing now.

I assume we would index new files as they're uploaded. Is it possible to remove an image from the index as well?

TheStealthReporter · 2023-01-08T11:55:46Z

Yes, the pre-trained model clip-ViT-B-32 (if I remember correctly ~600MB) is what I've used in my experiments. Each embedding is a 512-dimensional vector. The "database" file of my 5000 photos has a final size of 11MB.

Running the "image -> CLIP space" model on my Ryzen 7 5800X CPU (but only single threaded) took about something 30min-45min for the 5000 photos. So each image takes a bit less than 1s.

I've tried a bit to achieve multi-threading but I've not managed to get it to work with my first tries. Not sure how complicated it is to apply the model to multiple images concurrently/multi-threaded with Python (without loading the model for each thread individually)...

I'd advise using a spatial acceleration structure for the (approximate) nearest-neighbor search. For 5000 photos it's fine to iterate over all of them but for more larger image databases we'd probably like the logarithmic complexity. In a pull request for PhotoPrism, the qdrant database was proposed. I've also stumbled upon the FAISS library for that purpose. I don't know these libraries, this is just what others have used. An investigation which nearest-neighbor databases exist might be necessary.

Loading the model takes a few seconds. So the script should not be used as-is but rather used as a place for experimentation.

Here my code which also visualizes the results:

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
import os
import numpy as np
from tqdm.auto import tqdm
from PIL import Image, ImageOps
import multiprocessing

from matplotlib import pyplot as plt
import sys

yourimageglob = '/home/user/Pictures/Camera/*.jpg'


if __name__ == "__main__":
    #First, we load the respective CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    #model = SentenceTransformer('clip-ViT-L-14')

    use_precomputed_embeddings = True

    emb_file = 'pretrained_embeddings.pkl'

    try:
        with open(emb_file, 'rb') as fIn:
            img_names, img_emb = pickle.load(fIn)
    except:
        img_names = list(glob.glob(yourimageglob))
        img_names = img_names[0:5000]
        #print("Images:", len(img_names))

        def compute_embedding(i, img_name):
            global model
            print("analyze {}/{} {} ".format(i + 1, len(img_names), img_name))
            img = Image.open(img_name)
            img = ImageOps.exif_transpose(img)
            img_emb = model.encode(img, device='cpu')
            img.close()
            return img_emb

        img_emb = []
        for this_img_emb in map(compute_embedding, range(len(img_names)), img_names):
            img_emb.append(this_img_emb)
        # print(img_emb)
        img_emb = np.array(img_emb)
        img_emb = torch.tensor(img_emb)
        # img_emb = list()
        # for filepath in img_names:
        #     print('Analyzing {}'.format(filepath))
        #     this_img_emb = model.encode(Image.open(filepath))
        #     img_emb.append(this_img_emb)

        data = (img_names, img_emb)
        file = open(emb_file, 'wb')

        # dump information to that file
        pickle.dump(data, file)

        # close the file
        file.close()


    query = sys.argv[1]
    text_emb = model.encode([query])
    scores = list()
    for img_name, img_em in zip(img_names, img_emb):
        cos_score = util.cos_sim(img_em, text_emb).tolist()[0][0] * 100
        scores.append(cos_score)

    scored_imgs = list(zip(img_names, scores))
    scored_imgs.sort(key=lambda v: v[1], reverse=True)
    for img_name, score in reversed(scored_imgs):
        print("{:.2f} {}".format(score, img_name))

    print(text_emb.shape)

    # create figure
    fig = plt.figure(figsize=(10, 7))


    for i, (img_name, score) in enumerate(scored_imgs[:9]):
        # Adds a subplot at the 1st position
        fig.add_subplot(3, 3, i + 1)

        img = Image.open(img_name)
        img = ImageOps.exif_transpose(img)

        # showing image
        plt.imshow(img)
        plt.axis('off')
        plt.title("{:.2f}".format(score))

    plt.show()

bo0tzz · 2023-01-08T12:50:32Z

Very cool stuff! For the nearest-neighbour, do you know whether it'd be possible to use Postgres for that somehow? That way we wouldn't need to add another stateful container :)

TheStealthReporter · 2023-01-08T16:06:04Z

I'm not familiar with Postgres. But natively it isn't supported as far as I can tell. A quick Google search "postgresql high dimensional nearest neighbor search extension" yields the Postgres extension PASE (paper). How easy to include this compared to qdrant I don't know.

yowmamasita · 2023-02-07T21:30:18Z

Saw this posted on HN https://mazzzystar.github.io/2022/12/29/Run-CLIP-on-iPhone-to-Search-Photos/

jrasm91 · 2023-02-07T22:37:50Z

It's funny - we were just talking about this internally 👍

TheStealthReporter added feature needs triage Bug that needs triage from maintainer labels Jan 8, 2023

immich-app locked and limited conversation to collaborators Feb 8, 2023

alextran1502 converted this issue into discussion #1613 Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[Feature]: Free text image search using CLIP features #1269

[Feature]: Free text image search using CLIP features #1269

TheStealthReporter commented Jan 8, 2023

jrasm91 commented Jan 8, 2023

TheStealthReporter commented Jan 8, 2023

bo0tzz commented Jan 8, 2023

TheStealthReporter commented Jan 8, 2023

yowmamasita commented Feb 7, 2023

jrasm91 commented Feb 7, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

[Feature]: Free text image search using CLIP features #1269

[Feature]: Free text image search using CLIP features #1269

Comments

TheStealthReporter commented Jan 8, 2023

Feature detail

How it works

Advantage compared to class-based image search

Platform

jrasm91 commented Jan 8, 2023

TheStealthReporter commented Jan 8, 2023

bo0tzz commented Jan 8, 2023

TheStealthReporter commented Jan 8, 2023

yowmamasita commented Feb 7, 2023

jrasm91 commented Feb 7, 2023

This issue was moved to a discussion.