Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running evaluation on dataset outputs #543

Open
chasemcdo opened this issue Mar 23, 2024 · 2 comments
Open

Running evaluation on dataset outputs #543

chasemcdo opened this issue Mar 23, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@chasemcdo
Copy link
Contributor

Feature request

It would great to be able to generate a dataset with outputs and then perform evaluation directly on these "reference outputs".

Motivation

While building out a LangSmith evaluation pipeline, you'll likely need to do several iterations of evaluation metrics to tune them as desired. If on each iteration you also need to regenerate output examples, you end up eating up a lot of tokens on generation that is otherwise reusable.

I've seen the compute_test_metrics beta function from the cookbook's which achieves a similar result; however, its adding on top of existing tests rather than allowing you to run directly on a created/imported dataset.

Thanks!

@hinthornw
Copy link
Collaborator

We are working on something in this vein, but want to make sure we satisfy your use case: could you elaborate on this a bit more?

Is the flow something like: first create a set of inputs, generate candidate outputs, manually review and revise, then continue iterating?

Or is it more a case where ground truth isn't super meaningful and you mainly want to compute relative performance to some baseline that you may update over time?

Or something different?

@chasemcdo
Copy link
Contributor Author

chasemcdo commented Mar 23, 2024

Closest to the second one. The current work I am doing there is no ground truth. The flow at least that I imagined is something like:

  • Follow the normal dataset creation processes, but in this case the "outputs" aren't a reference / ground truth and rather the thing to be evaluated
  • Run evaluation directly on said dataset's outputs

With the primary motivation being to save money/inference time when iterating on LangSmith evaluators, since I've found myself making several tweaks to the evaluators I've setup to ensure they align with my expectations, but each evaluator iteration requires re-generation of the outputs to be evaluated which ends up costing extra and changes the outputs you may have made specific tweaks to address.

So the specific use case is having a set of inputs/outputs which I want to use to essentially tune my evaluators.

@hinthornw hinthornw added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants