Replies: 1 comment 2 replies
-
I think there are a few issues here to unpack. Current State
Because of 1) & 2), the lowest latency way to get predictions that use all available compute is often a single model in a single connection that's being supported by a query queue in PgCat. It's better to queue outside of Postgres than inside, and also having fewer concurrent connections lowers lock contention, but can still allow you to achieve large scale resource utilization. This is also good, because the query cache will stay warm across many clients without having to re-init the model every time a client connects, similar to having a background worker hold the model. The drawback here, is that you need to setup PgCat or some other proxy/pooler, but that's pretty common in production environments at scale these days, so I think this is a pretty acceptable state of affairs, although we could definitely use better documentation on this front. Next StepsI would prioritize work on removing Python from the inference path through 1 of the 3 following so we can get around the GIL. This is part of the plan for PostgresML 3.0, although it's not currently under active development: I'd love to see some benchmarks and configurations of LLMs in these runtimes to see how much we can improve things. PRs are welcome on this front but I'd love to coordinate somewhat closely as we explore the possibilities. This is somewhat tricky in that not all LLMs are supported by these options, although my impression is we could get most of the mainstream ones working. End stateAfter we've removed Python from the execution path, it'll make it simpler to potentially share a model across multiple connections. AdditionallyThe largest LLMs need to be loaded on the GPU, and shared memory on the GPU is even trickier than normal shared memory, but moving outside of Python is a prerequisite for that as well. FinallyWe've implemented a background worker that can share models across databases not just connections so that our serverless users can have even cheaper access to the latest LLMs, but we haven't open sourced this yet, since we're still testing it internally. It still involves Python, and there is a fair amount of complexity around queuing, locks, memory management etc, e.g. what happens when someone tries to load a model and their isn't enough memory? Our goal is to work more on documentation for the many use cases before added too much more to the complexity and options, but if you can help drive things forward while we cleanup the existing bits, PR's will always be welcome. |
Beta Was this translation helpful? Give feedback.
-
Hello,
First of all, I want to thank you for the fantastic work you've done on this project !! The trend to separate data from compute has been pushed to the extreme, introducing a lot of sometimes unneeded complexity.
I took a look at the extension code and was a little bit surprised when I saw that models were loaded per connection process. This is not a huge deal when dealing with small models but could be a huge bottleneck when dealing with LLM. In addition, you lose the ability to batch your requests. You also are constrained to keep those expensive connections running, which is not a big deal if you have a connection pooler like pgcat. As I understand it, if I take the example of embedding function, the extension calls into a python-wrapped function that uses the transformers lib. I think that this design is highly inefficient for both memory and compute.
IMO, a better design choice would be to have a background worker that loads the model if instantiated and a shared memory queue between connections. In addition to reducing memory or (vram load), the model could do continuous batching, streaming responses, etc. Of course, getting rid of Python entirely would be very nice, this is achievable as I recently wrote an opensource OpenAI compatible API using the
llm
crate : cria.I opened this as a discussion because I don't really know the ins and outs of building a pg extension and would like to understand the limits of this suggestion.
I also would love to help on this front if you accept contributions on this front 😺 !
Beta Was this translation helpful? Give feedback.
All reactions