Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vanna trulens performance metrics #238

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

samoliverschumacher
Copy link

@samoliverschumacher samoliverschumacher commented Feb 7, 2024

This PR adds a script to support improving the performance (accuracy, cost and latency) of a vanna app.

The Problem;

  • The various components and prompts contribute to performance, but it's not clear how each of these impact it.
  • Making improvements means changing something, then manually assessing the new outputs. This is not a scalable way of evaluating.

Context;

vn.ask() carries out RAG in multiple steps that can all be optimised;

  1. Retrieve examples of 3 different data types (SQL, DDL etc.)
    • parameters: embedding model chosen, retrieval system, retrieval parameters
  2. Connects to LLM model
    • parameters: model chosen, fine-tune vs not.
  3. Prompts the LLM about each of these in different ways.

Further improvements to vanna in the future could open up even more possibilities like;

  • Self-corrective systems like diagnosing the SQL error and retry the database call.
  • Chain of thought reasoning for complex questions
  • Multi-hop programs for complex SQL generation i.e. "break a question into multiple SQL sub-queries to validate a hypothesised correct SQL".

The solution;

A script implements trulens-eval that allows configuration of what is to be evaluated, and how. It presents the results in a dashboard (see the doc for visuals)

Evaluation of the system using TruLens allows evaluation without changing vanna (just adding a log to the vanna model). Alternatives could be to include evaluation in the app's code itself, this might require major refactoring to decouple the vanna components.

Other evaluation frameworks exist, though not many as of yet.

Tests performed

Manual/hand testing only, and only used a few example prompts (shown in the code). No unit tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants