Add common benchmarks #50

steventkrawczyk · 2023-08-01T22:56:10Z

🚀 The feature

We need to add benchmark test sets so folks can run on models / embeddings / systems

A few essentials:

BEIR for information retrieval
MTEB for embeddings
Some stuff from HELM (e.g. ROGUE, BLEU) for LLMs

Motivation, pitch

Users have told us that they want to run academic benchmarks as "smoke tests" on new models.

Alternatives

No response

Additional context

No response

LuvvAggarwal · 2023-08-04T18:04:47Z

Can I work on this?

steventkrawczyk · 2023-08-04T18:16:11Z

@LuvvAggarwal Sure thing. The scope of this one is a bit large because we currently don't have any common benchmarks. I think a simple case would be the following

Add a new benchmarks directory to prompttools
Add a python file to read in a test dataset given some filepath (probably from CSV format)
Add a utility function to compute the relevant metric from the responses
Add a dataset to use for the benchmark to a new directory, e.g. prompttools/data
Add an example notebook that runs the benchmark and computes the metric

Some benchmarks to start with would be HellaSwag and TruthfulQA, or perhaps simpler ones like ROGUE and BLEU

Feel free to deviate from this plan, it's just a suggestion for how to get started.

LuvvAggarwal · 2023-08-05T08:23:32Z

Thanks @steventkrawczyk for the guidance, based on my initial research I have found a package "Evaluate:" that can provide the methods for evaluating the model
Link to package: https://huggingface.co/docs/evaluate/index
I was thinking to use it.

Please free to suggest better ways as I am new to ML stuff but love to contribute

LuvvAggarwal · 2023-08-06T08:40:58Z

@steventkrawczyk, can we use the "Datasets" library for loading metrics dataset instead of creating a separate directory
Link to the library: https://github.com/huggingface/datasets

And it can also be used for quick tests on a prebuilt dataset

steventkrawczyk · 2023-08-06T19:33:27Z

@LuvvAggarwal using datasets sounds like a good start. As far as using evaluate, we want to write our own eval methods that support more than just huggingface (e.g. OpenAI, Anthropic)

LuvvAggarwal · 2023-08-07T07:39:54Z

@steventkrawczyk Sure, but I have no idea about eval methods it would be great if you can share any references so I could code.
Thanks

steventkrawczyk · 2023-08-07T14:49:07Z

For example, if you are using the hellaswag dataset, we need to compute the accuracy of predictions, e.g. https://github.com/openai/evals/blob/main/evals/metrics.py#L12

HashemAlsaket · 2023-08-12T17:25:27Z

@LuvvAggarwal I kick started the code for benchmarks here if you would like to branch: #72

LuvvAggarwal · 2023-08-14T05:51:45Z

Thanks @HashemAlsaket, I will branch it

steventkrawczyk added enhancement New feature or request good first issue Good for newcomers labels Aug 1, 2023

steventkrawczyk assigned LuvvAggarwal Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add common benchmarks #50

Add common benchmarks #50

steventkrawczyk commented Aug 1, 2023

LuvvAggarwal commented Aug 4, 2023

steventkrawczyk commented Aug 4, 2023 •

edited

LuvvAggarwal commented Aug 5, 2023 •

edited

LuvvAggarwal commented Aug 6, 2023 •

edited

steventkrawczyk commented Aug 6, 2023 •

edited

LuvvAggarwal commented Aug 7, 2023

steventkrawczyk commented Aug 7, 2023

HashemAlsaket commented Aug 12, 2023

LuvvAggarwal commented Aug 14, 2023

Add common benchmarks #50

Add common benchmarks #50

Comments

steventkrawczyk commented Aug 1, 2023

🚀 The feature

Motivation, pitch

Alternatives

Additional context

LuvvAggarwal commented Aug 4, 2023

steventkrawczyk commented Aug 4, 2023 • edited

LuvvAggarwal commented Aug 5, 2023 • edited

LuvvAggarwal commented Aug 6, 2023 • edited

steventkrawczyk commented Aug 6, 2023 • edited

LuvvAggarwal commented Aug 7, 2023

steventkrawczyk commented Aug 7, 2023

HashemAlsaket commented Aug 12, 2023

LuvvAggarwal commented Aug 14, 2023

steventkrawczyk commented Aug 4, 2023 •

edited

LuvvAggarwal commented Aug 5, 2023 •

edited

LuvvAggarwal commented Aug 6, 2023 •

edited

steventkrawczyk commented Aug 6, 2023 •

edited