Agentic AI
Agentic systems are being developed at a rapid pace, evolving from passive tools into proactive, decision-making entities. These systems can plan, execute tasks, and adapt to changing environments. While promising, agentic AI also raises challenges with respect to alignment, safety, and ethical considerations.
The following is an effort to start to document all of the great work being done in this area.
Overview of agentic systems
Feb 2025: The AI Agent Index A public database documenting agentic systems’ components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails).
Feb 2025: Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis This study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents over the standalone LLMs (upon which they are built).
E2B (an open-source runtime for executing AI-generated code in secure cloud sandboxes; made for agentic & AI use cases) has a public list of AI agents, assistants and apps
Notable agentic systems
Operator: An agent from OpenAI that can use a browser to perform tasks
Mariner: Built with Gemini 2.0, Mariner combines strong multimodal understanding and reasoning capabilities to automate tasks using a browser.
Magnetic-One: A Generalist Multi-Agent System from Microsoft for Solving Complex Tasks
Agentic Safety Benchmarks
Oct 2024: AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents A benchmark that aims to measure the propensity and ability of LLM agents to complete harmful tasks.
Agentic Capability Benchmarks
Feb 2025: Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents A simulated environment designed to specifically test an LLM-based agent’s ability to manage a straightforward, long-running business scenario: operating a vending machine.
Feb 2025: MLGym: A New Framework and Benchmark for Advancing AI Research Agents Consists of 13 research tasks from domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task.
Dec 2024: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks An extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.
Jun 2024: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models A benchmark of real-world tasks created by looking at 15 popular websites and an automatic evaluation protocol that evaluates open-ended web agents.
Apr 2024: WebArena: A Realistic Web Environment for Building Autonomous Agents A set of tasks focusing on evaluating the functional correctness of task completions on the web. The tasks are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. The results show that current state-of-the-art large language models are far from perfect in performing these real-life tasks, and that WebArena can be used to measure progress.
Dec 2023: Mind2Web: Towards a Generalist Agent for the Web A dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website.