Enhancement: Load balancing on Gemini API #2723
msg7086
started this conversation in
Feature Requests & Suggestions
Replies: 2 comments 2 replies
-
I'd be interested to see a PR for this, but may not get to this myself |
Beta Was this translation helpful? Give feedback.
0 replies
-
Not yet at least, it should be fairly simple to introduce a battle-tested load balancing algo to quickly swap out supported regions. What kind of throughput is hitting the rate limits for you? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What features would you like to see added?
As shown here
LibreChat/api/app/clients/GoogleClient.js
Line 24 in 94eeec3
We only use us-central1 endpoint, which puts query stress all over us-central1 servers, and also put users under the query limit of 1-2 queries per minute per region.
It would be great if you can load balancing this over all regions endpoints, to better spread the stress and also to get around with per region query limits.
More details
Due to quota
Generate content requests per minute per project per base model per minute per region per base_model
, the amount of requests is limited by per minute per region per base model, and the limit is usually 1. This will be used up very quickly if you are having a conversation with short sentences with Gemini.Many other regions provide the same capabilities.
We can utilize all of them, and if possible, give the users the ability to override which regions to use from
.env
file.We can pick regions randomly, or we can do LRU. The goal is to put query stress evenly on all Google regions, and have a much lower chance to hit quota limit and get an error.
Which components are impacted by your request?
Endpoints
Pictures
No response
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions