Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment.evaluate() shows stale evaluation results #79

Open
davidtan-tw opened this issue Aug 17, 2023 · 3 comments
Open

experiment.evaluate() shows stale evaluation results #79

davidtan-tw opened this issue Aug 17, 2023 · 3 comments

Comments

@davidtan-tw
Copy link

davidtan-tw commented Aug 17, 2023

馃悰 Describe the bug

Hi folks,

Thanks again for your work on this library.

I noticed an issue where similarity scores do not get updated when I change my expected fields. Only when I re-run the experiment are the values updated.

Bug

image

Steps to reproduce:

models = ["gpt-3.5-turbo", "gpt-3.5-turbo-0613"]
messages = [
    [
        {"role": "system", "content": "Who is the first president of the US? Give me only the name"},
    ]
]
temperatures = [0.0]

experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
experiment.run()
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["George Washington"] * 2)
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["Lady Gaga"] * 2)
experiment.visualize() # the evaluation results here indicate that "Lady Gaga" is semantically identical to "George Washington"

In my opinion, evaluate() should re-compute metrics every time it is run, rather than depending/being coupled to another function (run()). I haven't tested it on other eval_fns, but it could be worth testing if this is the case as well.

@NivekT
Copy link
Collaborator

NivekT commented Aug 17, 2023

Your observation is correct. Currently, if a metric already exists (which is "similar_to_expected" in your case), it raises a warning (as seen in your notebook "WARNING: similar_to_expected is already present, skipping") rather than overwriting it.

If you change the metric name given in the second .evaluate call (i.e. experiment.evaluate("similar_to_expected_2", ...)), it will compute another column.

We are open to considering overwriting it even when the existing metric already exists. Let us know what you think.

@Sruthi5797
Copy link

Sruthi5797 commented Oct 4, 2023

Thank you for this issue, I changed the variable name but still, the response column is stale. Any leads on this issue? I use python version 3.11.5

@NivekT
Copy link
Collaborator

NivekT commented Oct 4, 2023

Hi @Sruthi5797,

Can you post a minimal code snippet of what you are running? Also, are you seeing any warning message?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants