ai-safety

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

tigerlab-ai / tiger

Star

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

classification data-augmentation ai-safety fine-tuning aisafety rag large-language-models llm llm-training

Updated Dec 2, 2023
Jupyter Notebook

ShengranHu / Thought-Cloning

Star

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Mar 1, 2024
Python

normster / llm_rules

Star

RuLES: a benchmark for evaluating rule-following in language models

ai-safety ai-security gpt-4

Updated May 24, 2024
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

yardenas / la-mbda

Star

LAMBDA is a model-based reinforcement learning agent that uses Bayesian world models for safe policy optimization

machine-learning reinforcement-learning deep-learning constrained-optimization ai-safety model-based-reinforcement-learning safe-reinforcement-learning

Updated Jan 16, 2023
Python

ai4ce / FLAT

Star

[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory

deep-learning robotics point-cloud lidar gnss autonomous-driving ai-safety adversarial-attacks 3d-object-detection 3d-perception trustworthy-machine-learning trustworthy-ai

Updated Jul 5, 2022
Python

PKU-YuanGroup / Hallucination-Attack

Star

Attack to induce LLMs within hallucinations

nlp machine-learning deep-learning ai-safety adversarial-attacks hallucinations llm llm-safety

Updated May 17, 2024
Python

megvii-research / FSSD_OoD_Detection

Star

Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)

anomaly ai-safety anomaly-detection out-of-distribution-detection ood-detection

Updated Feb 15, 2021
Python

Giskard-AI / awesome-ai-safety

Sponsor

Star

📚 A curated list of papers & technical articles on AI Quality & Safety

Updated Oct 13, 2023

dlmacedo / entropic-out-of-distribution-detection

Star

A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.