evaluation

Star

Here are 1,082 public repositories matching this topic...

EXP-Tools / steam-discount

Star

steam 特惠游戏榜单（自动刷新）

steam crawler evaluation rank discount zero playing

Updated May 19, 2024
Python

Dartvauder / NeuroTrainerWebUI

Star

(Windows/Linux) Local WebUI for finetuning of neural network models (Now only for LLM) on python (In Gradio interface)

python evaluation transformers neural-networks generation finetuning

Updated May 19, 2024
Python

mymmrac / mm

Star

Simple CLI math expression evaluator

go cli math repl evaluation

Updated May 19, 2024
Go

🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated May 19, 2024
TypeScript

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated May 19, 2024
TypeScript

promptfoo / promptfoo

Star

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 19, 2024
TypeScript

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated May 19, 2024
TypeScript

deshwalmahesh / PHUDGE

Star

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

nlp ai evaluation ml pytorch judge feedback-collection sota custom-dataset finetuning hallucination llm llm-evaluation hallucination-detection phi-3

Updated May 19, 2024
Jupyter Notebook

iamrk04 / LLM-Solutions-Playbook

Star

Unlock the potential of AI-driven solutions and delve into the world of Large Language Models. Explore cutting-edge concepts, real-world applications, and best practices to build powerful systems with these state-of-the-art models.

python memory chatbot evaluation openai llama chains agents parsers prompts llm prompt-engineering chatgpt deeplake langchain gpt4all

Updated May 19, 2024
Jupyter Notebook

google / fuzzbench

Star

FuzzBench - Fuzzer benchmarking as a service.

security benchmarking evaluation fuzzing benchmark-framework

Updated May 19, 2024
Python

Xnhyacinth / Awesome-LLM-Long-Context-Modeling

Star

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

agent benchmark evaluation survey transformer compress blogs papers ssm long-term-memory rag awsome-list large-language-models llm long-context-modeling length-extrapolation

Updated May 19, 2024

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated May 19, 2024
Jupyter Notebook

MileBench / MileBench

Star

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"

benchmark machine-learning natural-language-processing deep-neural-networks computer-vision deep-learning evaluation multimodality visual-question-answering multimodal foundation-models large-language-models llm llms long-context-transformers multimodal-large-language-models large-multimodal-models long-context-modeling

Updated May 19, 2024
Python

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated May 19, 2024
Python

aws-samples / model-as-a-judge-eval

Star

Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.

evaluation llm generative-ai llm-as-a-judge

Updated May 18, 2024
Jupyter Notebook

corentin-ryr / MultiMedEval

Star

A Python tool to evaluate the performance of VLM on the medical domain.

benchmark evaluation medical-imaging vision-language-model llava

Updated May 18, 2024
Python

ModelTC / llmc

Star

This is the official implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends.

benchmark deployment tool evaluation pruning quantization large-language-models llm

Updated May 18, 2024
Python

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated May 18, 2024
Go

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated May 18, 2024
Python

huggingface / lighteval

Star

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

evaluation evaluation-metrics evaluation-framework huggingface

Updated May 18, 2024
Python

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,082 public repositories matching this topic...

EXP-Tools / steam-discount

Dartvauder / NeuroTrainerWebUI

mymmrac / mm

langwatch / langwatch

langfuse / langfuse

promptfoo / promptfoo

lunary-ai / lunary

deshwalmahesh / PHUDGE

iamrk04 / LLM-Solutions-Playbook

google / fuzzbench

Xnhyacinth / Awesome-LLM-Long-Context-Modeling

tatsu-lab / alpaca_eval

MileBench / MileBench

microsoft / rag-experiment-accelerator

aws-samples / model-as-a-judge-eval

corentin-ryr / MultiMedEval

ModelTC / llmc

symflower / eval-dev-quality

open-compass / VLMEvalKit

huggingface / lighteval

Improve this page

Add this topic to your repo