Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common benchmarks #50

Open
steventkrawczyk opened this issue Aug 1, 2023 · 9 comments
Open

Add common benchmarks #50

steventkrawczyk opened this issue Aug 1, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@steventkrawczyk
Copy link
Contributor

馃殌 The feature

We need to add benchmark test sets so folks can run on models / embeddings / systems

A few essentials:

  • BEIR for information retrieval
  • MTEB for embeddings
  • Some stuff from HELM (e.g. ROGUE, BLEU) for LLMs

Motivation, pitch

Users have told us that they want to run academic benchmarks as "smoke tests" on new models.

Alternatives

No response

Additional context

No response

@steventkrawczyk steventkrawczyk added enhancement New feature or request good first issue Good for newcomers labels Aug 1, 2023
@LuvvAggarwal
Copy link

Can I work on this?

@steventkrawczyk
Copy link
Contributor Author

steventkrawczyk commented Aug 4, 2023

@LuvvAggarwal Sure thing. The scope of this one is a bit large because we currently don't have any common benchmarks. I think a simple case would be the following

  • Add a new benchmarks directory to prompttools
  • Add a python file to read in a test dataset given some filepath (probably from CSV format)
  • Add a utility function to compute the relevant metric from the responses
  • Add a dataset to use for the benchmark to a new directory, e.g. prompttools/data
  • Add an example notebook that runs the benchmark and computes the metric

Some benchmarks to start with would be HellaSwag and TruthfulQA, or perhaps simpler ones like ROGUE and BLEU

Feel free to deviate from this plan, it's just a suggestion for how to get started.

@LuvvAggarwal
Copy link

LuvvAggarwal commented Aug 5, 2023

Thanks @steventkrawczyk for the guidance, based on my initial research I have found a package "Evaluate:" that can provide the methods for evaluating the model
Link to package: https://huggingface.co/docs/evaluate/index
I was thinking to use it.

Please free to suggest better ways as I am new to ML stuff but love to contribute

@LuvvAggarwal
Copy link

LuvvAggarwal commented Aug 6, 2023

@steventkrawczyk, can we use the "Datasets" library for loading metrics dataset instead of creating a separate directory
Link to the library: https://github.com/huggingface/datasets

And it can also be used for quick tests on a prebuilt dataset

@steventkrawczyk
Copy link
Contributor Author

steventkrawczyk commented Aug 6, 2023

@LuvvAggarwal using datasets sounds like a good start. As far as using evaluate, we want to write our own eval methods that support more than just huggingface (e.g. OpenAI, Anthropic)

@LuvvAggarwal
Copy link

@steventkrawczyk Sure, but I have no idea about eval methods it would be great if you can share any references so I could code.
Thanks

@steventkrawczyk
Copy link
Contributor Author

For example, if you are using the hellaswag dataset, we need to compute the accuracy of predictions, e.g. https://github.com/openai/evals/blob/main/evals/metrics.py#L12

@HashemAlsaket
Copy link
Contributor

@LuvvAggarwal I kick started the code for benchmarks here if you would like to branch: #72

@LuvvAggarwal
Copy link

Thanks @HashemAlsaket, I will branch it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants