Model per connection process #1033

AmineDiro · 2023-09-27T10:30:25Z

AmineDiro
Sep 27, 2023

Hello,

First of all, I want to thank you for the fantastic work you've done on this project !! The trend to separate data from compute has been pushed to the extreme, introducing a lot of sometimes unneeded complexity.

I took a look at the extension code and was a little bit surprised when I saw that models were loaded per connection process. This is not a huge deal when dealing with small models but could be a huge bottleneck when dealing with LLM. In addition, you lose the ability to batch your requests. You also are constrained to keep those expensive connections running, which is not a big deal if you have a connection pooler like pgcat. As I understand it, if I take the example of embedding function, the extension calls into a python-wrapped function that uses the transformers lib. I think that this design is highly inefficient for both memory and compute.
IMO, a better design choice would be to have a background worker that loads the model if instantiated and a shared memory queue between connections. In addition to reducing memory or (vram load), the model could do continuous batching, streaming responses, etc. Of course, getting rid of Python entirely would be very nice, this is achievable as I recently wrote an opensource OpenAI compatible API using the llm crate : cria.

I opened this as a discussion because I don't really know the ins and outs of building a pg extension and would like to understand the limits of this suggestion.

I also would love to help on this front if you accept contributions on this front 😺 !

montanalow · 2023-09-28T02:38:12Z

montanalow
Sep 28, 2023
Maintainer

I think there are a few issues here to unpack.

Current State

Many models have a lock around them. Models running in Python are locked by both the GIL, and several library implementations protect their inference calls with an additional mutex. There are certain lock free versions in both the core libtorch and xgboost libs that could potentially be accessed via C bindings under certain circumstances though.
Many models support multi-threaded predictions, and in fact the default settings for some libraries will use all cores available on a machine for a single inference.

Because of 1) & 2), the lowest latency way to get predictions that use all available compute is often a single model in a single connection that's being supported by a query queue in PgCat. It's better to queue outside of Postgres than inside, and also having fewer concurrent connections lowers lock contention, but can still allow you to achieve large scale resource utilization. This is also good, because the query cache will stay warm across many clients without having to re-init the model every time a client connects, similar to having a background worker hold the model. The drawback here, is that you need to setup PgCat or some other proxy/pooler, but that's pretty common in production environments at scale these days, so I think this is a pretty acceptable state of affairs, although we could definitely use better documentation on this front.

Next Steps

I would prioritize work on removing Python from the inference path through 1 of the 3 following so we can get around the GIL. This is part of the plan for PostgresML 3.0, although it's not currently under active development:

I'd love to see some benchmarks and configurations of LLMs in these runtimes to see how much we can improve things. PRs are welcome on this front but I'd love to coordinate somewhat closely as we explore the possibilities. This is somewhat tricky in that not all LLMs are supported by these options, although my impression is we could get most of the mainstream ones working.

End state

After we've removed Python from the execution path, it'll make it simpler to potentially share a model across multiple connections.

Additionally

The largest LLMs need to be loaded on the GPU, and shared memory on the GPU is even trickier than normal shared memory, but moving outside of Python is a prerequisite for that as well.

Finally

We've implemented a background worker that can share models across databases not just connections so that our serverless users can have even cheaper access to the latest LLMs, but we haven't open sourced this yet, since we're still testing it internally. It still involves Python, and there is a fair amount of complexity around queuing, locks, memory management etc, e.g. what happens when someone tries to load a model and their isn't enough memory? Our goal is to work more on documentation for the many use cases before added too much more to the complexity and options, but if you can help drive things forward while we cleanup the existing bits, PR's will always be welcome.

2 replies

AmineDiro Sep 28, 2023
Author

Wow! Thanks a lot for the detailed explanation and rationale behind the current design. I do have some further questions if that's ok.

It's better to queue outside of Postgres than inside.

If I understand correctly, the choice to load a model per connection in the database is working as model replication and the pooler is basically load-balancing requests and reusing the already warm models. That's all great if the N versions model do fit on the hardware! but let's say you only can afford to load 1 model in your GPU at a time ( this is usually the case for LLM) which is equivalent to using 1 connection and queueing the request in the pooler. How much context switching price are we paying by not having N connections and 1 separate model? I would think that these connections might need to filter on additional data that needs to be pulled in shared memory etc ...

Many models have a lock around them. Models running in Python are locked by both the GIL.

I don't really get this point, Python GIL prevents multithreading not multiprocessing. I might be wrong here or have a simplistic view on PG, but as I understand it postgres spawns a process per connection. This means that either way we need to have synchronization across processes not threads if we need to have a shared model architecture 🤔 ?

We've implemented a background worker that can share models across databases not just connections so that our serverless users can have even cheaper access to the latest LLMs, but we haven't open-sourced this yet, since we're still testing it internally

This sounds really cool but isn't it a little bit against the idea of postgresML of colocating compute and data? Okay, I might be pedantic here 😄 and as a data scientist I do understand that it these huge models might need to be separated, but on the other hand there is a push to optimize model inference speed for LLM and also use alternative accelerators which would match perfectly with postgresML.

I'd love to see some benchmarks and configurations of LLMs in these runtimes to see how much we can improve things.

I have contributed to the llm crate for llama2 support and would love to take a swing at this. Should I open an issue referencing this ?

Thanks again, I really appreciate the response you provided👍🏼

montanalow Sep 30, 2023
Maintainer

It's better to queue outside of Postgres than inside.

If I understand correctly, the choice to load a model per connection in the database is working as model replication and the pooler is basically load-balancing requests and reusing the already warm models. That's all great if the N versions model do fit on the hardware! but let's say you only can afford to load 1 model in your GPU at a time ( this is usually the case for LLM) which is equivalent to using 1 connection and queueing the request in the pooler. How much context switching price are we paying by not having N connections and 1 separate model? I would think that these connections might need to filter on additional data that needs to be pulled in shared memory etc ...

I agree, this is an interesting question, and it depends on the complexity of the VIEW being used for inference, but I think in the case that you only have the resources for 1 copy of the model, it is likely a large model, i.e. > 10GB. Even on a GPU, models at this scale take > 50ms (depending on number of tokens), which is significantly longer than SSD access times to page data in, so I'd suspect even in this case, you should be able to achieve > 90% GPU utilization with a single model/connection queuing outside the database. There could be other stalls though, like Python tokenization for smaller models, so any examples would be informative for us to optimize further. I still think Python is the likely low hanging fruit in this case.

Many models have a lock around them. Models running in Python are locked by both the GIL.

I don't really get this point, Python GIL prevents multithreading not multiprocessing. I might be wrong here or have a simplistic view on PG, but as I understand it postgres spawns a process per connection. This means that either way we need to have synchronization across processes not threads if we need to have a shared model architecture 🤔 ?

Correct. When I was writing this sentence, I was thinking about achieving inference with the weights in truly shared memory, across multiple threads in a single process (either connection or multithreaded background worker), e.g. connections can do parallel scans.

We've implemented a background worker that can share models across databases not just connections so that our serverless users can have even cheaper access to the latest LLMs, but we haven't open-sourced this yet, since we're still testing it internally

This sounds really cool but isn't it a little bit against the idea of postgresML of colocating compute and data? Okay, I might be pedantic here 😄 and as a data scientist I do understand that it these huge models might need to be separated, but on the other hand there is a push to optimize model inference speed for LLM and also use alternative accelerators which would match perfectly with postgresML.

Yep, you've nailed the thesis, which is another reason why this work hasn't been open sourced while we're in the exploration phase. Have you seen YANDEX's recent work on tabr. Tight data/model integration is only going to get more and more important.

I'd love to see some benchmarks and configurations of LLMs in these runtimes to see how much we can improve things.

I have contributed to the llm crate for llama2 support and would love to take a swing at this. Should I open an issue referencing this ?

That'd be a great starting point to flush out the best path forward.

Thanks for your thoughts and contributing to the discussion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model per connection process #1033

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Model per connection process #1033

AmineDiro Sep 27, 2023

Replies: 1 comment · 2 replies

montanalow Sep 28, 2023 Maintainer

Current State

Next Steps

End state

Additionally

Finally

AmineDiro Sep 28, 2023 Author

montanalow Sep 30, 2023 Maintainer

AmineDiro
Sep 27, 2023

Replies: 1 comment 2 replies

montanalow
Sep 28, 2023
Maintainer

AmineDiro Sep 28, 2023
Author

montanalow Sep 30, 2023
Maintainer