Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: Find image content using CLIP #1287 #2005

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

tknobi
Copy link

@tknobi tknobi commented Feb 3, 2022

This pull request adds the possibility to search in image content.
The awesome CLIP model is used for that.
It implements the idea given in #1287

Contributions:

  1. Add a python based clip API as a new docker image.
  2. Add the existing vector database: qdrant
  3. Encode image while indexing and save their embeddings in the vector database
  4. Encode search query while searching and perform a k-nearest neighbor search to find matching images.

I know it's not a small change, so please take as much time as needed. I am in no way upset, even if it takes months (or years ? :D )

Acceptance Criteria:

  • Features and improvements are fully implemented so that they can be released at any time without additional work
  • Automated unit and/or acceptance tests have been added to ensure the changes work as expected and to reduce repetitive manual work
  • User interface changes are fully responsive and have been tested on all major browsers and various devices
  • Database-related changes are compatible with SQLite and MariaDB
  • Translations have been / will be updated (specify if needed)
  • Documentation has been / will be updated (specify if needed)
  • Contributor License Agreement (CLA) has been signed

@tknobi
Copy link
Author

tknobi commented Feb 3, 2022

Right now it's still a WIP, I'll ping you when it's ready for review :) 1-3 are kinda implemented (I think refactoring still needed) and I'm struggling with 4 at the moment.

@tknobi tknobi changed the title Image content search using CLIP #1287 WIP: Image content search using CLIP #1287 Feb 3, 2022
@tknobi tknobi marked this pull request as draft February 3, 2022 20:30
@lastzero
Copy link
Member

lastzero commented Feb 3, 2022

Would it be possible to use the model directly with Go instead of python? PhotoPrism already has a huge list of dependencies. We're worried that adding more will delay the release of stand alone packages (without Docker) and/or break the FreeBSD port.

@tknobi
Copy link
Author

tknobi commented Feb 3, 2022

You're right, of course, I started that way because I knew it would work and wanted to see a breakthrough first.

It's a pytorch model, not tensorflow, so the following options could be tried::

  1. Use go torch binding for it. There are 2 repos doing that: wangkuiyi/gotorch and sugarme/gotch. Both unfortunately do not look complete and I the model would have to be re-implemented there. Whether the weights can then be loaded I do not know.
  2. Convert the torch model to an ONNX model as mentioned here and someone did for clip here. We could then try to load that model with gorgonia/gorgonia

Whether this leads to success is questionable, Is this something that precludes a merge? The feature can of course be disabled, just like image classification, and the default for non-docker users could be disabled.

@CLAassistant
Copy link

CLAassistant commented Mar 1, 2022

CLA assistant check
All committers have signed the CLA.

@tknobi
Copy link
Author

tknobi commented May 10, 2022

Hello @lastzero,

after not having much time recently I would like to continue trying to integrate CLIP into photoprism.
I have made some effort to integrate CLIP in go directly. Unfortunately without success :( In addition to my two approaches described above, I have also tried unsuccessfully to make a tensorflow model out of CLIP. (I am not alone in this: openai/CLIP#236)

Here, by the way, is a video from reddit user Playgroundai that very impressively shows the potential of the CLIP model for a photo gallery.

What I would like to know now, should I continue with the external python container approach or will this never get merged? With further integration, I am waiting for a commitment.

@taha-yassine
Copy link

I think this is an excellent initiative as it would bring photoprism's capabilities to the next level. However, IMHO, trying to reimplement CLIP in Go can be a daunting task. The most straightforward way for me would be having CLIP run on a different service (e.g. dedicated docker container) and have it communicate with the main photoprism instance through an API as suggested above. This approach is certainly the fastest as the CLIP part is pretty much straight forward to get up and running and could be used to propose a PoC before further investigating a reimplementation.

Another way to consider the problem would be to completely separate photoprism from the ML heavy lifting part (image classification, face recognition, etc.) and to have plugable neural modules that would enhance photoprism's capabilities. The advantage of this approach is it would give the user total control over what backend they use for the different tasks and open doors for experimentation with new custom modules. This certainly involves a complete restructuration of a big part of the project, but moving forward I think is something to consider.

@hoeflechner
Copy link

Hi, i came across this commit, and i am very interested in using CLIP with photoprism! What ist the status of this project?

@thomasverelst
Copy link

thomasverelst commented Nov 4, 2022

With python (and PyTorch) being by far the most popular programming language/framework for image processing/computer vision/deep learning, it would be great to see support for extending Photoprism with python/PyTorch code (if necessary through some API as suggested by @taha-yassine ). I was looking to add semantic search with CLIP (but already done in this commit, thanks!), but there's still so much other potential for smarter processing e.g. automatic selection of photos for Moments/Memories, automatic archival of blurry photos, classifying based on photo type (screenshots, documents, receipts, selfies). While it's probably possible to port algorithms to Golang, it's just tedious and has more drawbacks than advantages I believe (e.g. GPU acceleration, quantization and other optimizations, which are easy with PyTorch).
Anyway, I think CLIP-like search using a (pretrained) model trained on billions of images is really the way to go when aiming to improve the search functionality, as imagenet-style classifying with pre-defined classes will always be limited, no matter how good the model is trained.

@tknobi
Copy link
Author

tknobi commented Nov 4, 2022

Thanks for your message @thomasverelst, @taha-yassine and @hoeflechner . I agree with you on every single point. There are still so many great deep learning approaches which could make Photoprism even better 🙂

However, there is still a need for a commitment from @lastzero if the container / python approach has a chance to get merged. Before that, further development is not worthwhile. (IMHO)

@lastzero
Copy link
Member

lastzero commented Nov 4, 2022

Generally yes. Whether we can merge and ship it to everyone will depend on the quality, stability, and documentation. Be aware our development resources are limited, so there's little chance I will have time before next year as we first need to finalize multi-user and stripe support.

@ghost
Copy link

ghost commented Nov 9, 2022

AI is a fast advancing field.
It would be nice to also allow API integration for newer facial recognition methods .

May also put different pytorch extensions (for example w/ or w/o GPU support) in separated docker images. People can choose what they want to use with photoprism @tknobi

@johnzielke
Copy link

First of all, amazing project! Now here is my opinion after trying to work on some of the AI features of Photoprism (see below).

IMHO, trying to integrate everything with Go will seriously decrease the possibility of other people contributing integrations for recent amazing AI models/ML algorithms that are available publicly (or could even be trained for this project in the future). And for an AI-powered Photos app, this is definitely one of the important selling points.

Apart from the fact that there are very limited bindings for ML frameworks in Go, the maturity of the existing ones is not there yet in my opinion.
In addition, doing everything in Go requires developers with experience in Deep Learning/ML to also be familiar with Go. And as you mentioned the development resources are very limited, making the easy integration in this way even more complicated/not happening.

I would prefer the approach with an additional docker container. Python is a very mature language with support for many platforms, so it on docker in different platforms should not be an issue. Generally, when it comes to how to integrate the ML models, I would suggest using ONNX developed/maintained by Microsoft, at least for the additional "AI/ML" container. Many available models can easily be converted to ONNX and ONNX supports many platforms (arm,x86,x64) and devices (nvidia&amd GPUs, apple CoreML). Even just running on CPU, it can provide significant performance benefits on inference compared to tensorflow or pytorch. They are also keeping new versions backward-compatible.

I used this approach to build a rudimentary Proof-of-concept face recognition version using Retinaface for much-improved face detection and converted the existing facenet model to ONNX and ran it on the GPU, providing massive speedups while freeing up the CPU for other tasks. I also used the existing battle- and unit-tested implementations of DBScan in scipy to run the cluster search significantly faster (as an example of how one can benefit from a language with a mature ML ecosystem).

As there are more and more of these models coming up, the way AI is applied at the moment should probably be refactored. As I would imagine a lot of people running their server on a small home server (Rpi, mini PC) or similar, running advanced AI models there might be difficult.
Ideally, we would therefore create a separation between the compute-intensive AI processing and the simple ingestion and WebUI/API. With the right design, this would allow you to run the AI asynchronously from the normal ingestion. And possibly even on a different device. Say you also have a gaming computer or even just a recent laptop you could use that to run the AI whenever it is turned on, allowing you to both use cheap and energy-efficient servers while still being able to apply larger AI models.

Another advantage of a more generic API for these AI models would be the potential to quickly attach other models to the system for evaluation or "community-maintained" plugins. The models that provide good results and/or are popular could then be converted to the "supported" model format and integrated into the built-in ML container.

I would be happy to contribute to this in my free time, but I think an agreement on the architecture/implementation of such an approach should be achieved first, as changing that later on will require a lot of effort. Possibly we could also join forces, with developers more experienced in Go contributing to that side and python/DL developers (like me) focusing on the other side.

@tknobi
Copy link
Author

tknobi commented Nov 20, 2022

Hey @johnzielke ,

Thanks for your detailed post! The current design relies on the container approach. But I tried to do everything that is feasible in go. So the python side does nothing more than convert an image or text into an embedding.

Since it's a proof-of-concept so far, I haven't made any efforts towards a "generic python ML container" yet.

I would suggest using ONNX developed/maintained by Microsoft

Here you are right. As mentioned above, Lednik7 has already created an ONNX model of CLIP and achieved an acceleration of factor 3 with it. I wanted to keep it simple at the beginning and haven't made any optimizations yet. If you want to contribute here: I would be thrilled 🙂

Another optimization on Python side would be the support of more languages, as proposed with Multilingual-CLIP

@johnzielke
Copy link

Thanks @tknobi, from your PR I can see you put quite some work into this as well! While looking through the PR and googling for CLIP models in general, I came across Jina/Clip as service. Now I have never used that and would love to get opinions on that, but maybe this could be something that could be leveraged as a more generic system. It seems to not only support Clip but also provides an easy way to integrate other models, already coming with a Go client, grpc APIs and protobuf support with support for compression.

I am not sure whether it makes more sense to first include one AI implementation with the first iteration of an external system, or first overhaul the AI system in Go and then start implementing the AI services. Maybe this is outside the scope of this PR, and it should be discussed somewhere else, but I could imagine something like this:

  1. A new image is added. The normal indexing is done (Thumbnails, Exif etc.)
  2. A worker checks for new images periodically. It will also check for images that were indexed with old versions of an AI plugin (or new AIs) and reevaluates those.
  3. The different AIs run over each image, add metadata to the image (such as tags, nsfw labels or faces, OCRed text) and possibly index additional data in their data stores (i.e. embeddings or similar, see below).
  4. The metadata added is then transferred to the Photoprism database.

An open question here is how AIs that provide search functionality should be included. Either they are transferred to the photoprism database and all the queries & ranking need to be implemented there, or the ranking&embedding are implemented on the service, and the service is simply called as part of the request, therefore keeping everything specific to a service outside of the Go implementation. While this might work for a lot of cases, other features will still require implementation on the front-end and other parts on the central photoprism server.

@ghost
Copy link

ghost commented Nov 30, 2022

the ranking&embedding are implemented on the service, and the service is simply called as part of the request, therefore keeping everything specific to a service outside of the Go implementation.

Maybe this is better as it allows people to choose from what algorithm or even tweaking the algorithm for their need, while leaving the rather stable components in golang (which is complied and distributed as binary).

@lastzero
Copy link
Member

I'd still like to have that, but my concerns about simplicity are the same - that is, by default, users shouldn't have to run more than one or two (with a database) Docker containers. Otherwise, we'd end up with a microservice architecture that would be good for commercial hosting, but not for home users who need to understand it. On the other hand, a native Go solution might not be feasible either. Let us know if you would like us to continue working on this!

@lastzero lastzero added waiting Impediment / blocked / waiting work-in-progress Please don't merge just yet ai AI Features, Machine Learning, Clustering and Models labels Jun 27, 2023
@johnzielke
Copy link

Hi, I'd still be interested in implementing features in this space, maybe not exclusive to the CLIP embedding mentioned here but something similar.
May I ask the specific reason for limiting yourself to one or two containers? While I agree that there should not be a hundred different services all being executed there, I think packaging AI parts in a separate container is a reasonable approach in order not to have to mix different sets of dependencies and maybe even be able to supply builds optimized for specific hardware (in the future, such as for ARM, Intel, Nvidia Gpus etc). Another big difference in my opinion is whether additional containers would be stateful or stateless. I think extra stateless containers are not that big of a problem, whereas stateful containers would make the system a lot more complicated.
While we might be able to switch out current models with better versions using something like https://pkg.go.dev/github.com/yalue/onnxruntime#section-readme I am not sure this will make new types of models and features easy. In my opinion, there is a lot of development going on in the field of tooling around different AI models that is mostly happening in Python. We can try to reimplement each of those features, but that will probably make development quite slow and error-prone when implementing new algorithms. Python has a lot of tooling and docs around AI and using accelerators, bindings to highly optimized libraries etc, but is bad at parallelism and type safety. Go is not that advanced in the AI/algorithms department but is great for developing safe and fast web servers. So why not combine those things together?

Since especially the search algorithms use vector embeddings and similarity search we would also need a way to quickly query that information. But maybe we would be able to get away with something like https://github.com/pgvector/pgvector for this instead of having to run an additional database.

@lastzero
Copy link
Member

@johnzielke Thank you very much for your remarks! This is mainly because we focus on self-hosting and know from experience that many of our users have difficulty dealing with the simple architecture we currently use and we don't have the capacity to help them or troubleshoot when something doesn't work. That's why we try to limit the number of moving parts, at least for a basic installation. Unfortunately, thinking about a plugin system or an extension API to make this optional means additional work. Another issue is security, because if we include Python, that also means including a lot of (potentially insecure) dependencies that we wouldn't need otherwise. On top of that, I have very little experience with Python packages, so I don't know how to review them, if that's even possible.

@johnzielke
Copy link

@lastzero For my understanding: What are the ways users typically have problems with the architecture and how would this be made more complicated by a well-named and documented additional container?

And to the second part: Of course you are right in that this would introduce another language you would both have to learn and understand as well as be able to review, but as far as languages go, python should be one of the easier ones in that regard. Regarding security, that of course is a very valid point, but I would say that the attack surface there is controllable because:

  1. The actual code for running AI models uses relatively few dependencies
  2. The dependencies would mainly be libraries maintained by companies like Microsoft, Google, Facebook or bigger well-established communities.
  3. The attack surface could be made as small as possible, as all calls to /from the python service would only be through the Go webserver, an attacker could therefore not submit anything to the application directly.

Regarding the plugin system, that would probably mean some work, but other people (including me) could help with that if it is planned out properly and there is enough commitment. The plugin system should of course be designed as an advanced feature without UI or something, but instead be something manually configured for anything but the default case. And in my experience features like this also help the code maintainability in the long run, since it enforces a good separation of concern and makes loosely coupled features easy to switch out/change.

I appreciate you looking into this and spending some thought on this, and the concerns you raise are all valid. I am wondering if you have any ideas yourself on how these problems could be reasonably handled/mitigated to acceptable levels? Since if you think this is not possible at all, there is little point in discussing this here, since you as the maintainer and owner eventually have to agree with what could be done here. Just making sure we don't waste our time on this, and I'm happy to continue the discussion in search of a solution.

@lastzero
Copy link
Member

@lastzero For my understanding: What are the ways users typically have problems with the architecture and how would this be made more complicated by a well-named and documented additional container?

To get an idea of what can go wrong - often due to misconfiguration and a misunderstanding of the technology - see our troubleshooting checklists:

https://docs.photoprism.app/getting-started/troubleshooting/

The problems start with YAML, which some users don't know and then remove indentation. That's the experience level we're talking about.

Often users also don't use the examples we provide, but hack their own Docker network configuration, which then leads to problems connecting to the database (since that's the only other service we require). If there are more services, that probably leads to more connection issues. Sometimes it's due to a typo in the network name they assign to each service.

I'm also not 100% sure if everyone knows how to update additional containers e.g. to get the latest versions or when there's a security update needed.

And to the second part: Of course you are right in that this would introduce another language you would both have to learn and understand as well as be able to review, but as far as languages go, python should be one of the easier ones in that regard.

I can surely read Python code or even write a few lines if needed, like when I created our TensorFlow model export. But I don't have enough experience for a professional security audit.

  1. The actual code for running AI models uses relatively few dependencies
  2. The dependencies would mainly be libraries maintained by companies like Microsoft, Google, Facebook or bigger well-established communities.

That's a plus.

  1. The attack surface could be made as small as possible, as all calls to /from the python service would only be through the Go webserver, an attacker could therefore not submit anything to the application directly.

My understanding is that an attacker would either inject malicious code through a pip dependency or attempt to gain console access to the container (e.g., from another container, the host or another server with access to the network).

With access to the container shell, pip is then used to load platform-independent malicious code. Platform independence is probably what makes Python so popular for this purpose. Apple is currently struggling with this as well, since Python is pre-installed on macOS.

Regarding the plugin system, that would probably mean some work, but other people (including me) could help with that if it is planned out properly and there is enough commitment.

From my side, commitment and willingness are definitely there. However, it needs to be well thought out and communicated, which will be difficult for me given our usual workload with maintaining this project. That means I won't be able to push this forward constantly, but only when I can afford to set aside extra time for it.

For the start, it would be good to gather suggestions and then move from there. It might also be worth looking at how other software handles this.

Note that we will be on vacation for the first two weeks of July, so responding to questions may take longer than usual.

@johnzielke
Copy link

johnzielke commented Jun 29, 2023

The problems start with YAML, which some users don't know and then remove indentation. That's the experience level we're talking about.

Often users also don't use the examples we provide, but hack their own Docker network configuration, which then leads to problems connecting to the database (since that's the only other service we require). If there are more services, that probably leads to more connection issues. Sometimes it's due to a typo in the network name they assign to each service.

Thank you, of course, misconfigurations will always be an issue and it's understandable to keep it simple, but while there would be 50% more opportunity for typos, I think the complexity would not increase that much for the user, but you obviously have much more experience with this.

I'm also not 100% sure if everyone knows how to update additional containers e.g. to get the latest versions or when there's a security update needed.

The updates you seem to have covered pretty well with referring to and including Watchtower in your docs+config, and the update procedure would be the same as with all the existing containers and therefore be automatically included. Btw: Do you have any statistics transmitted on what versions users are running? I could not find anything regarding that with a quick glance at your privacy policy.

Again for my understanding: Are you doing special security audits for the current Go server?

My understanding is that an attacker would either inject malicious code through a pip dependency or attempt to gain console access to the container (e.g., from another container, the host or another server with access to the network).

With access to the container shell, pip is then used to load platform-independent malicious code. Platform independence is probably what makes Python so popular for this purpose. Apple is currently struggling with this as well, since Python is pre-installed on macOS.

By locking down the containers as well as possible by limiting the permissions the running process inside a container has as well as removing any unnecessary programs such as the shell etc. the attack surface through vulnerabilities can be limited. The vulnerability in a pip dependency would have to happen at build time, since there is no need to install additional packages at runtime. So this would work the same way it does currently with all the Go dependencies. To mitigate this, a security scan using any of the available services (e.g. Snyk, without ever having used that particular one) can be implemented.

From my side, commitment and willingness are definitely there. However, it needs to be well thought out and communicated, which will be difficult for me given our usual workload with maintaining this project. That means I won't be able to push this forward constantly, but only when I can afford to set aside extra time for it.

That's great to hear, I wonder if @tknobi is still interested in doing something in this field. Understandable that there are many other things to do.

For the start, it would be good to gather suggestions and then move from there. It might also be worth looking at how other software handles this.

Sounds like a good idea. What type of software are you referring to? Other self-hosted image services?

Note that we will be on vacation for the first two weeks of July, so responding to questions may take longer than usual.

Enjoy your vacation!

@lastzero
Copy link
Member

Sounds like a good idea. What type of software are you referring to? Other self-hosted image services?

Other popular open source software in general and similar apps in particular (if available). Thank you! Need to hurry to get ready for our vacation.

@lastzero lastzero changed the title WIP: Image content search using CLIP #1287 Search: Find image content using CLIP #1287 Jul 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai AI Features, Machine Learning, Clustering and Models waiting Impediment / blocked / waiting work-in-progress Please don't merge just yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants