How to get GPU acceleration while calculating embeddings using a local model? #1057

siddhsql · 2024-05-06T22:12:52Z

siddhsql
May 6, 2024

Hello,

I am synced to version 0.22.0 of langchain4j and using the dev.langchain4j.model.inprocess.InProcessEmbeddingModelType.ALL_MINILM_L6_V2 to calculate embeddings on a M2 Mac Mini. It does not seem to use the GPU and as a result is a bit slow. How can I get GPU acceleration? Thanks very much.

langchain4j · 2024-05-07T06:26:30Z

langchain4j
May 7, 2024
Maintainer

HI @siddhsql it does not seem that there is a support for mac GPUs: https://onnxruntime.ai/docs/get-started/with-java.html

How slow is it? How much text do you embed? Thanks!

6 replies

Craigacp May 7, 2024

Also the create session call here - https://github.com/langchain4j/langchain4j-embeddings/blob/main/langchain4j-embeddings/src/main/java/dev/langchain4j/model/embedding/OnnxBertBiEncoder.java#L136 is leaving it to the default number of threads, which for ORT Java is single threaded. If you set inter and intra op threads to zero then ORT detects the number of CPUs and sets the threads appropriately (e.g. https://github.com/oracle/sd4j/blob/main/src/main/java/com/oracle/labs/mlrg/sd4j/SD4J.java#L419). There's also use of onnxTensor.getValue which is slower than using a buffer to get the values out as there's a bunch of recursive pointer chasing in the JNI on that codepath. But that's likely to be a much smaller overhead than CoreML support or setting the threading parameters.

langchain4j May 7, 2024
Maintainer

HI @Craigacp thanks a lot for pointers!
I've played with thread count and buffers earlier, but did not see any noticeable difference 🤔

langchain4j May 7, 2024
Maintainer

Here is a PoC for buffers: langchain4j/langchain4j-embeddings@main...use_direct_buffer

langchain4j May 7, 2024
Maintainer

Ah, you are talking about getting outputs, I see. I tries buffers only for inputs...

Craigacp May 7, 2024

Yeah, it helps more with multidimensional outputs because of the pointer chasing. It's a real problem for models which emit images, and less bad for batches of text. For single inputs it's more a cleanliness thing as it shouldn't take too long to copy out a single dimension, but it does mean you can't pin things effectively.

In your buffer PoC you can't reuse the buffer like this (langchain4j/langchain4j-embeddings@main...use_direct_buffer#diff-b59e95469c3edf3ceb983364104a007c3ffb6b2ec139e6940e789357849de81aR112) as it's not copied so the buffer refers to the same memory as the tensor, clearing it will overwrite the token ids. Like for outputs it's more useful when batching as allocating the larger block of memory is more expensive.

siddhsql · 2024-05-07T17:38:48Z

siddhsql
May 7, 2024
Author

Hi All, thanks for all your inputs. once thing i wanted to understand (and maybe this is topic for separate thread but i will start it here), is that when i looked at the dependency graph of langchain4j i see it uses both the dlj library from amazon and the onnxruntime from microsoft. i am not familiar with internals of any of these libraries but wouldn't it be better for us to stick with one deep learning library? am i understanding something wrong here?

1 reply

Craigacp May 7, 2024

DJL wraps ONNX Runtime among other backends.

siddhsql · 2024-05-08T00:44:18Z

siddhsql
May 8, 2024
Author

I now tried to run the same code on a machine with NVIDIA RTX A6000 GPU. again it seems by default langchain4j does not use the GPU if available. What can I do to make it use GPU in this case? My code:

import static dev.langchain4j.model.inprocess.InProcessEmbeddingModelType.ALL_MINILM_L6_V2;
...
model = new InProcessEmbeddingModel(ALL_MINILM_L6_V2);
model.embed("some text whose embeddings you want to calculate").vector()

I am only showing minimal code. I calculated thousands of embeddings this way and GPU usage was 0. CPU was maxed out.

1 reply

langchain4j May 8, 2024
Maintainer

Hi, there is no way to utilize GPU right now, we have to use another library: com.microsoft.onnxruntime:onnxruntime_gpu instead of com.microsoft.onnxruntime:onnxruntime

It should be pretty easy to do this change in https://github.com/langchain4j/langchain4j-embeddings, could you give it a try?

siddhsql · 2024-05-08T16:34:22Z

siddhsql
May 8, 2024
Author

thanks for the response. a separate question - how can I get the source code of version 0.22.0 of dev.langchain4j:langchain4j-embeddings? that is the one i'd like to use. the repo does not have that tag anymore.

…

On Wed, May 8, 2024 at 12:02 AM LangChain4j ***@***.***> wrote: Hi, there is no way to utilize GPU right now, we have to use another library: com.microsoft.onnxruntime:onnxruntime_gpu <https://search.maven.org/artifact/com.microsoft.onnxruntime/onnxruntime_gpu> instead of com.microsoft.onnxruntime:onnxruntime <https://search.maven.org/artifact/com.microsoft.onnxruntime/onnxruntime> It should be pretty easy to do this change in https://github.com/langchain4j/langchain4j-embeddings, could you give it a try? — Reply to this email directly, view it on GitHub <#1057 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6NWEK3M7KOPG6GTDZZOBFLZBHEXLAVCNFSM6AAAAABHJ2YI6OVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGNJQG4YDC> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

1 reply

langchain4j May 10, 2024
Maintainer

Hi @siddhsql here is the 0.22.0 commit: langchain4j/langchain4j-embeddings@ee2a050

siddhsql · 2024-05-10T16:17:32Z

siddhsql
May 10, 2024
Author

thanks a lot.

…

On Fri, May 10, 2024 at 2:11 AM LangChain4j ***@***.***> wrote: Hi @siddhsql <https://github.com/siddhsql> here is the 0.22.0 commit: ***@***.*** <langchain4j/langchain4j-embeddings@ee2a050> — Reply to this email directly, view it on GitHub <#1057 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6NWEK76ZFUMVO4DCF3BMZDZBSFMBAVCNFSM6AAAAABHJ2YI6OVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGOBQGM3TM> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get GPU acceleration while calculating embeddings using a local model? #1057

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to get GPU acceleration while calculating embeddings using a local model? #1057

siddhsql May 6, 2024

Replies: 5 comments · 9 replies

langchain4j May 7, 2024 Maintainer

Craigacp May 7, 2024

langchain4j May 7, 2024 Maintainer

langchain4j May 7, 2024 Maintainer

langchain4j May 7, 2024 Maintainer

Craigacp May 7, 2024

siddhsql May 7, 2024 Author

Craigacp May 7, 2024

siddhsql May 8, 2024 Author

langchain4j May 8, 2024 Maintainer

siddhsql May 8, 2024 Author

langchain4j May 10, 2024 Maintainer

siddhsql May 10, 2024 Author

siddhsql
May 6, 2024

Replies: 5 comments 9 replies

langchain4j
May 7, 2024
Maintainer

langchain4j May 7, 2024
Maintainer

langchain4j May 7, 2024
Maintainer

langchain4j May 7, 2024
Maintainer

siddhsql
May 7, 2024
Author

siddhsql
May 8, 2024
Author

langchain4j May 8, 2024
Maintainer

siddhsql
May 8, 2024
Author

langchain4j May 10, 2024
Maintainer

siddhsql
May 10, 2024
Author