In The Space
Research and white papers
The are hundreds of smart people focused on issues related to AI trust and safety. The following is just a sample of some of the great work being done.
Hazard, Harm and/or Risk Taxonomies
Aug 2024: IBM, AI risk atlas
Jun 2024: Virtue AI, AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies
May 2024: OECD, Defining AI Incidents and Related Terms
Oct 2023: Google DeepMind, Sociotechnical Safety Evaluation of Generative AI Systems
Jan 2023: Microsoft, Types of harm
Dec 2021: Google DeepMind, Ethical and social risks of harm from Language Models
United Nations, Taxonomy of Human Rights Risks Connected to Generative AI
Safety Benchmarks
Dec 2024: BEST-OF-N JAILBREAKING: A simple black-box algorithm that jailbreaks frontier AI systems across modalities
Dec 2024: AILuminate v1.0 A benchmark analyzing a models’ responses to prompts across twelve hazard categories to produce safety grades for LLMs used to power general purpose chat systems like ChatGPT
Sep 2024: TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS – A PRINCIPLE AND BENCHMARK A comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs
Aug 2024: AIR-BENCH 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies
Jun 2024: SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Jun 2024: SALAD-Bench: A Comprehensive Safety Benchmark for Large Language Models
May 2024: The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning A dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security
May 2024: Introducing v0.5 of the AI Safety Benchmark from MLCommons
Feb 2024: SIMPLESAFETYTESTS: A Test Suite for Identifying Critical Safety Risks in Large Language Models
Feb 2024: DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models A comprehensive trustworthiness evaluation for large language models, considering diverse perspectives – including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness
Feb 2024: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Oct 2023: HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
Mar 2022: BBQ: A Hand-Built Bias Benchmark for Question Answering
Capability Benchmarks
Jan 2025: ARC-AGI: Abstract and Reasoning Corpus for Artificial General Intelligence A “benchmark” to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on
Sep 2024: EUREKA: Evaluating and Understanding Large Foundation Models
Mar 2024: Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (Paper)
Jun 2023: BIG-Bench: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
May 2023: Chatbot Arena (Leaderboard)
May 2022: TruthfulQA: Measuring How Models Mimic Human Falsehoods
Nov 2021: Measuring Mathematical Problem Solving With the MATH Dataset
Jul 2021: HumanEval: Evaluating Large Language Models Trained on Code
Jan 2021: MMLU (MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING)
Nov 2019: On the Measure of Intelligence The Abstraction and Reasoning Corpus (ARC) benchmark used to measure a human-like form of general fluid intelligence; it enables fair general intelligence comparisons between AI systems and humans
May 2019: HellaSwag: Can a Machine Really Finish Your Sentence?
Apr 2019: DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Feb 2019: GLUE: A MULTI-TASK BENCHMARK AND ANALYSIS PLATFORM FOR NATURAL LANGUAGE UNDERSTANDING
Benchmark Aggregation
Aug 2024: PromptBench: A Unified Library for Evaluation of Large Language Models
Jun 2024: HELM Safety: Towards Standardized Safety Evaluations of Language Models HELM Safety v1.0 is a collection of 5 safety benchmarks spanning 6 risk categories (violence, fraud, discrimination, sexual content, harassment, deception) and evaluates 24 prominent language models as an ongoing effort to standardize safety evaluations
Feb 2024: Eleuther AI Evaluation Harness A unified framework to test generative language models on a large number of different evaluation tasks featuring over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented
Evaluating AI
Dec 2024: FLI AI Safety Index 2024
Dec 2024: The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?
Nov 2024: BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices An assessment framework considering 46 best practices across an AI benchmark’s lifecycle and evaluate 24 AI benchmarks against it
Oct 2024: Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence A study to critically assess 23 state-of-the-art LLM benchmarks
Oct 2024: To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
Jul 2024: Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless
Apr 2024: A.I. Has a Measurement Problem
Dec 2023: A Survey on Evaluation of Large Language Models The aim of this survey is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs
Oct 2023: Challenges in evaluating AI systems
Nov 2021: AI and the Everything in the Whole Wide World Benchmark This position paper explores the limits of benchmarks in order to reveal the construct validity issues in their framing as the functionally “general” broad measures of progress they are set up to be. Often the claims that are justified through benchmark datasets extend far beyond the tasks they are initially designed for, and reach beyond even the initial ambitions for development
Nov 2020: Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database
Model Development
Bias
Examining Gender and Race Bias in Sentiment Analysis Systems
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation
GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models
The Woman Worked as a Babysitter: On Biases in Language Generation
General
Liquid Foundation Models: Our First Series of Generative AI Models: Could Liquid Foundation Models (LFMs) replace Transformer architecture based models?
Building Socio-culturally Inclusive Stereotype Resources with Community Engagement
All that Agrees Is Not Gold: Evaluating Ground Truth Labels and Dialogue Content for Safety
A Framework to Assess (Dis)agreement Among Diverse Rater Groups
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
An Insider’s Guide to Designing and Operationalizing a Responsible AI Governance Framework
THE HISTORY AND RISKS OF REINFORCEMENT LEARNING AND HUMAN FEEDBACK
Seminal Papers
Attention Is All You Need 2017 paper introducing the transformer architecture.
GradientBased Learning Applied to Document Recognition 1998 paper discussing “Multilayer Neural Networks trained with the backpropagation algorithm constitute the best example of a successful Gradient-Based Learning technique.”
Learning representations by back-propagating errors 1986 paper popularizing back-propagation for efficiently training deep neural networks
In the Past
Anthropic makes the case for government regulation of catastrophic AI risks
Oct 2024
Anthropic, a leading AI safety and research company, has developed a Responsible Scaling Policy (RSP) that provides a credible roadmap for how governments can move forward with regulation of catastrophic AI risks: CBRNE (chemical, biological, radiological, nuclear and explosives) for example. Basing regulations on an RSP-type document is workable and could be passed in a reasonable timeframe. At the same time, however, we also need to encourage governments to adopt regulations around non-existential AI risks such as hate speech and bias (a comprehensive taxonomy of these types of risks can be found here).
U.S Senate hearing: Subcommittee on Privacy, Technology and the Law
May 2023
At just over three hours, it takes a bit of time to go through, but this hearing demonstrates how the Senate is trying to understand the mistakes that where made with respect to the oversight of Social Media and how they are really trying to get it right with AI.
White House Involvement: Blueprint for an AI Bill of Rights
Oct 2022
“…the White House Office of Science and Technology Policy has identified five principles that should guide the design, use, and deployment of automated systems to protect the American public in the age of artificial intelligence.”
Congress needs to clarify section 230
Oct 2022
The Internet has changed enormously in the last 25 years; it is high time Congress put in the work to clarify Section 230 (part of the Communications Decency Act of 1996) instead of leaving it to the Supreme Court to try and interpret how the law should work.
The Pause: Future of Life Institute details AI oversight policies
March 2023
Whether you agree with the FLI open letter asking for “…all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4”, their policy recommendations provide a good set of “concrete recommendations for how we can manage AI risks.”
EU AI Act: EU moves forward with AI governance
April 2021
The European Union is making great progress on attempting to mitigate the risks of artificial intelligence. The Act isn’t perfect, but it’s a good first step and is definitely moving in the right direction. The link below provides the text of the act in 24 different languages.
We need to work together to tackle disinformation
Oct 2022
The New York Times has a great article on the challenges of combating disinformation/misinformation. One of the key takeaways of the article is that, because of the way information is shared, tackling the problem will require companies and organizations to work together to find a comprehensive solution.
EU Artificial Intelligence Act Approved
March 2024
This groundbreaking agreement by member States of the European Union isn’t perfect, but it provides guidance that is sorely needed to help ensure AI can be deployed safely and effectively. Passage is “is just the beginning of a long road of further rulemaking, delegating and legalese wrangling.”
ASEAN releases Guide on AI Governance and Ethics
February 2024
This is a practical guide for organizations of the 10 member States of the Association of Southeast Asian Nations (ASEAN) in their use of AI. It “includes recommendations on national-level and regional-level initiatives that governments in the region can consider implementing to design, develop, and deploy AI systems responsibly.”
Implementation of President Biden’s Executive Order on AI Moving Forward
April 2024
It’s been 180 days since Biden’s EO 14110 on Safe, Secure, and Trustworthy Artificial Intelligence was issued and NIST (National Institute of Science and Technology) is continuing to make progress with the release of four documents (AI RMF Generative AI Profile, Secure Software Development Practices for Generative AI and Dual-Use Foundation Models, Reducing Risks Posed by Synthetic Content, and A Plan for Global Engagement on AI Standards) focused on trying to make AI safer. These documents are drafts and NIST is soliciting public feedback.
MLCommons announces AI safety benchmark
April 2024
This benchmark is a proof-of-concept used to show that it is possible to assess the risks posed by AI systems in a concrete way. The use case of this benchmark is “...text-to-text interactions with a general purpose AI chat model…” It is a great first step in the long road of helping to ensure AI can be deployed safely and effectively.