evaluation
Here are 1,082 public repositories matching this topic...
(Windows/Linux) Local WebUI for finetuning of neural network models (Now only for LLM) on python (In Gradio interface)
-
Updated
May 19, 2024 - Python
🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.
-
Updated
May 19, 2024 - TypeScript
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
-
Updated
May 19, 2024 - TypeScript
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
-
Updated
May 19, 2024 - TypeScript
The production toolkit for LLMs. Observability, prompt management and evaluations.
-
Updated
May 19, 2024 - TypeScript
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.
-
Updated
May 19, 2024 - Jupyter Notebook
Unlock the potential of AI-driven solutions and delve into the world of Large Language Models. Explore cutting-edge concepts, real-world applications, and best practices to build powerful systems with these state-of-the-art models.
-
Updated
May 19, 2024 - Jupyter Notebook
FuzzBench - Fuzzer benchmarking as a service.
-
Updated
May 19, 2024 - Python
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
-
Updated
May 19, 2024
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
-
Updated
May 19, 2024 - Jupyter Notebook
This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"
-
Updated
May 19, 2024 - Python
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
-
Updated
May 19, 2024 - Python
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
-
Updated
May 18, 2024 - Jupyter Notebook
A Python tool to evaluate the performance of VLM on the medical domain.
-
Updated
May 18, 2024 - Python
This is the official implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends.
-
Updated
May 18, 2024 - Python
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
-
Updated
May 18, 2024 - Go
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks
-
Updated
May 18, 2024 - Python
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
-
Updated
May 18, 2024 - Python
Improve this page
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."