T2T LLMs
Text-to-Text (T2T) large language models (LLMs) that power generalized ChatBots, such as ChatGPT, have revolutionized the landscape for how artificial intelligence can be used, both in industry and by individuals. Challenges exist, however, with these systems in areas like trust, safety, factual accuracy, bias mitigation, etc.
Though not comprehensive, the following list of research and white papers represents the work of hundreds of individuals and organizations that are trying to make AI safer and more trustworthy for us all.
Hazard, Harm and/or Risk Taxonomies
Aug 2024: IBM, AI risk atlas
Jun 2024: Virtue AI, AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies
May 2024: OECD, Defining AI Incidents and Related Terms
Oct 2023: Google DeepMind, Sociotechnical Safety Evaluation of Generative AI Systems
Jan 2023: Microsoft, Types of harm
Dec 2021: Google DeepMind, Ethical and social risks of harm from Language Models
United Nations, Taxonomy of Human Rights Risks Connected to Generative AI
LLM Safety Benchmarks
Feb 2025: AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons A comprehensive industry-standard benchmark for assessing AI-product risk and reliabilty
Dec 2024: BEST-OF-N JAILBREAKING: A simple black-box algorithm that jailbreaks frontier AI systems across modalities
Dec 2024: AILuminate v1.0 A benchmark analyzing a models’ responses to prompts across twelve hazard categories to produce safety grades for LLMs used to power general purpose chat systems like ChatGPT
Sep 2024: TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS – A PRINCIPLE AND BENCHMARK A comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs
Aug 2024: AIR-BENCH 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies
Jun 2024: SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Jun 2024: SALAD-Bench: A Comprehensive Safety Benchmark for Large Language Models
May 2024: The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning A dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security
May 2024: Introducing v0.5 of the AI Safety Benchmark from MLCommons
Feb 2024: SIMPLESAFETYTESTS: A Test Suite for Identifying Critical Safety Risks in Large Language Models
Feb 2024: DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models A comprehensive trustworthiness evaluation for large language models, considering diverse perspectives – including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness
Feb 2024: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Oct 2023: HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
Feb 2023: ALIGNING AI WITH SHARED HUMAN VALUES Introduces the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality.
Mar 2022: BBQ: A Hand-Built Bias Benchmark for Question Answering
Sep 2020: REALTOXICITYPROMPTS: Evaluating Neural Toxic Degeneration in Language Models A dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely used and commercially deployed toxicity detector (PERSPECTIVE API).
LLM Capability Benchmarks
Feb 2025: Humanity’s Last Exam A multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage
Jan 2025: ARC-AGI: Abstract and Reasoning Corpus for Artificial General Intelligence A “benchmark” to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on
Sep 2024: EUREKA: Evaluating and Understanding Large Foundation Models
Mar 2024: Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (Paper)
Jun 2023: BIG-Bench: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
May 2023: Chatbot Arena (Leaderboard)
May 2022: TruthfulQA: Measuring How Models Mimic Human Falsehoods
Nov 2021: Measuring Mathematical Problem Solving With the MATH Dataset
Jul 2021: HumanEval: Evaluating Large Language Models Trained on Code
Jan 2021: MMLU (MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING)
Nov 2019: On the Measure of Intelligence The Abstraction and Reasoning Corpus (ARC) benchmark used to measure a human-like form of general fluid intelligence; it enables fair general intelligence comparisons between AI systems and humans
May 2019: HellaSwag: Can a Machine Really Finish Your Sentence?
Apr 2019: DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Feb 2019: GLUE: A MULTI-TASK BENCHMARK AND ANALYSIS PLATFORM FOR NATURAL LANGUAGE UNDERSTANDING
LLM Benchmark Aggregation
Aug 2024: PromptBench: A Unified Library for Evaluation of Large Language Models
Jun 2024: HELM Safety: Towards Standardized Safety Evaluations of Language Models HELM Safety v1.0 is a collection of 5 safety benchmarks spanning 6 risk categories (violence, fraud, discrimination, sexual content, harassment, deception) and evaluates 24 prominent language models as an ongoing effort to standardize safety evaluations
Feb 2024: Eleuther AI Evaluation Harness A unified framework to test generative language models on a large number of different evaluation tasks featuring over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented
Assessing AI LLM Benchmarks
Dec 2024: The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?
Nov 2024: BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices An assessment framework considering 46 best practices across an AI benchmark’s lifecycle and evaluate 24 AI benchmarks against it
Oct 2024: Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts The training data for many LLMs is contaminated with benchmark test data. This means that public benchmarks used to assess LLMs are compromised. This paper introduces a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the statistical indistinguishability of this retroholdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap due to the dataset’s public availability.
Oct 2024: Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence A study to critically assess 23 state-of-the-art LLM benchmarks
Jul 2024: Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless
Nov 2021: AI and the Everything in the Whole Wide World Benchmark This position paper explores the limits of benchmarks in order to reveal the construct validity issues in their framing as the functionally “general” broad measures of progress they are set up to be. Often the claims that are justified through benchmark datasets extend far beyond the tasks they are initially designed for, and reach beyond even the initial ambitions for development
Evaluating LLMs
Dec 2024: FLI AI Safety Index 2024
Oct 2024: To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
Sep 2024: Lessons for Editors of AI Incidents from the AI Incident Database A review of the AIID’s dataset of 750+ AI incidents and two independent taxonomies applied to these incidents to identify common challenges to indexing and analyzing AI incidents
Apr 2024: A.I. Has a Measurement Problem
Dec 2023: A Survey on Evaluation of Large Language Models The aim of this survey is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs
Oct 2023: Challenges in evaluating AI systems
Jun 2023: Rethinking Model Evaluation as Narrowing the Socio-Technical Gap Helping focus the development of evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods
Nov 2020: Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database