llm-evaluation

Here are 58 public repositories matching this topic...

Agenta-AI / agenta

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

prompt-toolkit rag human-annotation large-language-models llm prompt-engineering llms langchain llmops llama-index prompt-management llm-tools llm-framework llm-evaluation rag-evaluation

Updated Jun 6, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 6, 2024
TypeScript

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 5, 2024
TypeScript

hkust-nlp / dart-math

Star

🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

nlp deep-learning mathematics llm llm-training llm-inference llm-evaluation

Updated Jun 5, 2024
Python

parea-ai / parea-sdk-ts

Star

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jun 5, 2024
TypeScript

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jun 5, 2024
Python

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for LLMs and ML models

Updated Jun 5, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 5, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jun 5, 2024
Python

kwinkunks / promptly

Star

A prompt collection for testing and evaluation of LLMs.

prompts prompt-engineering chatgpt llm-evaluation

Updated Jun 5, 2024
Jupyter Notebook

confident-ai / deepeval

Star

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jun 5, 2024
Python

henry-yeh / Awesome-LLM-in-Social-Science

Star

Awesome papers involving LLMs in Social Science.

social-network simulation-environment policy economics psychology alignment social-science large-language-models llms llm-agent llm-evaluation

Updated Jun 5, 2024

zhuohaoyu / KIEval

Star

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jun 4, 2024
Python

nagababumo / Automated-Testing-for-LLMOps

Star

automation evaluation llm llmops llm-evaluation llm-automation

Updated Jun 4, 2024
Jupyter Notebook

innerNULL / summary-evaluator

Star

Summary Evaluation Tool

nlp deep-learning text-summarization model-evaluation model-evaluation-metrics llm bertscore llm-evaluation

Updated Jun 3, 2024
Python

PetroIvaniuk / llms-tools

Star

A list of LLMs Tools & Projects

data-science machine-learning ai chatbots chat-bot llm chatgpt open-source-llm llm-evaluation

Updated Jun 2, 2024

relari-ai / continuous-eval

Star

Open-Source Evaluation for GenAI Application Pipelines

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Jun 2, 2024
Python

nagababumo / Building-and-Evaluating-Advanced-RAG

Star

python rag llamaindex retrieval-augmented-generation llm-evaluation llm-evaluation-framework

Updated Jun 1, 2024
Jupyter Notebook

loganrjmurphy / LeanEuclid

Star

LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

theorem-proving formalization euclidean-geometry lean4 llm-evaluation autoformalization

Updated May 31, 2024
Lean

deshwalmahesh / PHUDGE

Star

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

nlp ai evaluation ml pytorch judge feedback-collection sota custom-dataset finetuning hallucination llm llm-evaluation hallucination-detection phi-3

Updated May 31, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation

Here are 58 public repositories matching this topic...

Agenta-AI / agenta

langfuse / langfuse

promptfoo / promptfoo

hkust-nlp / dart-math

parea-ai / parea-sdk-ts

parea-ai / parea-sdk-py

Giskard-AI / giskard

athina-ai / athina-evals

Psycoy / MixEval

kwinkunks / promptly

confident-ai / deepeval

henry-yeh / Awesome-LLM-in-Social-Science

zhuohaoyu / KIEval

nagababumo / Automated-Testing-for-LLMOps

innerNULL / summary-evaluator

PetroIvaniuk / llms-tools

relari-ai / continuous-eval

nagababumo / Building-and-Evaluating-Advanced-RAG

loganrjmurphy / LeanEuclid

deshwalmahesh / PHUDGE

Improve this page

Add this topic to your repo