Skip to content

Develop better LLM apps by testing different models and prompts in bulk.

Notifications You must be signed in to change notification settings

noah-art3mis/crucible

Repository files navigation

Crucible

Prompt evaluation package ("evals"). Test multiple models, prompts and variables.

Uses ollama to run LLMs locally.

How to use

  1. Setup: python -m venv venv, source venv/bin/activate, pip install -r requirements.txt
  2. Set the models in eval_models.py, prompts in eval_prompts.py and variables in eval_variables.py. See section on parameters.
  3. (Not implemented) Set grading style in main.py.
    • "binary": is either right or wrong
    • "qualitative": ask claude
  4. Run python eval.py.
  5. Logs from the run will be in output/<datetime>.yaml.

Parameters

  • model

    • id (str): name as understood by ollama. you might need to download it first

            Model("llama3")
      
  • prompt

    • id (str): name of the test case

    • slots (str): name of snippet to be inserted in prompt

    • content (str): actual prompt

            Prompt(
                id="test_3",
                slots="{variable}",
                content="""Sua tarefa é analisar e responder se o texto a seguir menciona a necessidade de comprar remédios ou itens de saúde. Aqui está o texto:\n\n###\n\n{variable}\n\n###\n\n\nPrimeiro, analise cuidadosamente o texto em um rascunho. Depois, responda: a solicitação citada menciona a necessidade de comprar remédios ou itens de saúde? Responda "<<SIM>>" ou "<<NÃO>>".""",
            )
      
  • variable

    • id (str): name of the test case

    • content (str): text of snippet to be inserted in prompt

    • expected (str list): values that would be considered correct

    • options (str list): all values that the response could take

            Variable(
                id="despesas_essenciais",
                content="Família monoparental composta por Josefa e 5 filhos com idades entre 1 e 17 anos. Contam apenas com a renda de coleta de material reciclável e relatam dificuldade para manter as despesas essenciais. Solicita-se, portanto, o auxílio vulnerabilidade.",
                expected=["<<NAO>>", "<<NÃO>>"],
                options=["<<NAO>>", "<<NÃO>>, <<SIM>>"],
            ),
      

TODO

  • add tests
  • add qualitative eval
  • add asyncio
  • add details on which answers tend to be wrong. summary expected

Resources

About

Develop better LLM apps by testing different models and prompts in bulk.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages