Enhancement: Load balancing on Gemini API #2723

msg7086 · 2024-05-14T21:25:50Z

msg7086
May 14, 2024

What features would you like to see added?

As shown here

LibreChat/api/app/clients/GoogleClient.js

Line 24 in 94eeec3

const loc = 'us-central1';

We only use us-central1 endpoint, which puts query stress all over us-central1 servers, and also put users under the query limit of 1-2 queries per minute per region.

It would be great if you can load balancing this over all regions endpoints, to better spread the stress and also to get around with per region query limits.

More details

Due to quota Generate content requests per minute per project per base model per minute per region per base_model, the amount of requests is limited by per minute per region per base model, and the limit is usually 1. This will be used up very quickly if you are having a conversation with short sentences with Gemini.

Many other regions provide the same capabilities.

['us-west1', 'us-west4', 'us-central1', 'us-south1', 'us-east4', 'northamerica-northeast1', 'europe-central2', 'europe-west1', 'europe-west2', 'europe-west3', 'europe-west4', 'europe-west6', 'asia-east1', 'asia-east2', 'asia-south1', 'asia-northeast1', 'asia-northeast3', 'australia-southeast1'] (may not be a complete list)

We can utilize all of them, and if possible, give the users the ability to override which regions to use from .env file.

We can pick regions randomly, or we can do LRU. The goal is to put query stress evenly on all Google regions, and have a much lower chance to hit quota limit and get an error.

Which components are impacted by your request?

Endpoints

Pictures

No response

Code of Conduct

I agree to follow this project's Code of Conduct

danny-avila · 2024-05-15T12:22:56Z

danny-avila
May 15, 2024
Maintainer

I'd be interested to see a PR for this, but may not get to this myself

0 replies

danny-avila · 2024-05-16T00:34:36Z

danny-avila
May 16, 2024
Maintainer

Not yet at least, it should be fairly simple to introduce a battle-tested load balancing algo to quickly swap out supported regions. What kind of throughput is hitting the rate limits for you?

2 replies

msg7086 May 16, 2024
Author

A few back and forth sentences will trigger the rate limits. Google cloud has a 1 req per minute rate limit, so if you reply twice in a minute you hit the rate limit. It's not by tokens but just number of requests, per region. Sounds very weird to me but it is what it is.

msg7086 May 19, 2024
Author

I just sent #2795. Since I'm not a JS developer it'll take some time for me to set up a test environment. Alternatively if you find it simple enough and can drive from there to close, it's also fine to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Load balancing on Gemini API #2723

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Enhancement: Load balancing on Gemini API #2723

msg7086 May 14, 2024

What features would you like to see added?

More details

Which components are impacted by your request?

Pictures

Code of Conduct

Replies: 2 comments · 2 replies

danny-avila May 15, 2024 Maintainer

danny-avila May 16, 2024 Maintainer

msg7086 May 16, 2024 Author

msg7086 May 19, 2024 Author

msg7086
May 14, 2024

Replies: 2 comments 2 replies

danny-avila
May 15, 2024
Maintainer

danny-avila
May 16, 2024
Maintainer

msg7086 May 16, 2024
Author

msg7086 May 19, 2024
Author