QDrant-NLP

Keeping the Human in the LOOP. I am not a developer at QDrant, nor directly associated with them, but I think they've built something excellent, and thus far under-appreciated. This repo is here to act as a demo more than anything else.

https://github.com/qdrant/qdrant

I'll call it done once it's tidy and available on DockerHub. Just enough that you could maybe use it to run your own POC without any additional code.

You can achieve almost half of this tool just via their swagger UI, but obviously, that's designed for hitting simple APIs, not data-centric AI workflows, so it's missing a few useful components. This work is written up in more depth here https://medium.com/@george.pearse (vector databases part 2).

The toy logo is somewhere between a magnifying glass for how the tooling enables you to really focus in on a specific data subset, and a classic bayesian graph for if I get carried away enough to try to add active learning in.

Finding the documentation for hugging-face sentence-transformers via Google Search drove me mad, it lives here https://www.sbert.net/docs/hugging_face.html

Quick labelling with hugging-face, streamlit and QDrant. First I'll support NLP, then I'll think about adding image support (which is where this idea came from).

Features

Supports interactively creating and storing queries for the QDrant Vector Database for an NLP dataset.
For each query, show the positives, show the negatives, then display the results.
Maybe support Active Learning (eventually). Can have a two part system, one part using Active Learning to optimise the similarity search, the other to optimize downstream finetuning. Or one to update which datapoint a nearest neighbour approach is least certain about (because this can be almost instantly updated) and another to correct the model which generates the embeddings.
Enable the downloading of datasets direct from hugging-face (to embeddings)
Loading sign while generating embeddings.
SQLiteDB to store the query results, and the names of the queries + maybe run heuristics based stuff like you did on the MIMIC Dataset.

Improving Deployment Experience

One docker-compose file for streamlit, QDrant and FastAPI
Make the docker images available via DockerHub

See Kern.AI for a full blown solution which uses QDrant behind the scenes. This tool is meant to be simple enough to act as an intro to vector databases. You can write and see the requests, just as you would via the python API.

Similarly, koaning/bulk is excellent, but what if UMAP (insert alternative dimensionality reduction technique here) loses all of the nuance, and high-level visualizations fail to provide value for your dataset?

I also wanted to give FastAPI a tiny test run, so for each query (post request) you save, you can receive its results by hitting the FastAPI endpoint with the name of the query.

To apply these tools to a multi-modal dataset you would only need to concatenate the embeddings for each component and away you go with all the same technqiues.

NB: Other names

Consider calling this thing grouper if you take it more seriously and upgrade the components
Or, carve-n-serve (if people actually liked the fastapi component). Carving up the data into small chunks.

Might make sense to apply a similarity cut off instead of the nearest K.

To Run:

To get started, just run

docker-compose up

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
data		data
hooks		hooks
interface		interface
qdrant		qdrant
queries		queries
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

hooks

hooks

interface

interface

qdrant

qdrant

queries

queries

.gitignore

.gitignore

README.md

README.md

docker-compose.yaml

docker-compose.yaml

Repository files navigation

QDrant-NLP

Features

Improving Deployment Experience

To Run:

About

Releases

Packages

Languages

GeorgePearse/QDrant-NLP

Folders and files

Latest commit

History

Repository files navigation

QDrant-NLP

Features

Improving Deployment Experience

To Run:

About

Topics

Resources

Stars

Watchers

Forks

Languages