Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

made training call more robust. #370

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thisismygitrepo
Copy link

@thisismygitrepo thisismygitrepo commented Apr 17, 2024

If you have extremely large number of sql / docs/ plan / examples etc (typically above thousands). The probability of having this error becomes very large (inevitable):

HTTPSConnectionPool(host='ask.vanna.ai', port=443): Max retries exceeded with url: /rpc (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)')))

I solved this problem by making a robust call. There are libraries to do that, but instead of making project more complex with dependencies, I added my implementation.

Note: I made train(plan=plan) fix only. But the same needs to be done for all other sections of the train method (i.e. whereever there is a loop of api calls)

@thisismygitrepo
Copy link
Author

while at the for loops, one should consider adding a progress visualizer.
With thousands or more items, the user has no clue if the app is hanging or is it making progress in training.
I recommend tqdm unless there is something simpler.

@zainhoda
Copy link
Contributor

Thanks for this -- I think you're the first user to experience this. I'd be curious how your experience was after you trained? In most other cases we usually recommend that people "start small" with a specific subset of data and then expand gradually as the accuracy improves.

@thisismygitrepo
Copy link
Author

thisismygitrepo commented Apr 21, 2024

I made the same conclusion as yours, it means I'm the first one to try out on massive sql database.
To add context, I have a department of health (state-wide) database with 4k tables that is a spagetti monster and the provided train methods fail (all of them give the max_retry error due to large number of calls.).

To your question, it worked on simple queries, but for seriously complex stuff that involves signifcant amount of corporate knowledge (e.g. how many patients with dxg insulin results exceeded that level provided they went to service x over the past three months in facility y) this is when it starts to crack (using GPT4 turbo). I'm thinking more context window would improve it judging by the simple errors its making (like column doesn't exist).

I'm not sure if you are hinting at me more data may reduce accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants