AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity.
Key contributions:
> 1. A formal taxonomy and metric suite: We translate qualitative safety-critical principles into computable metrics, enabling evaluation of agent reliability independently of task success.
>2. A comprehensive reliability profile of modern agents: A detailed mapping of where state-of-the-art agentic models succeed and fail, isolating consistency and predictability as the dimensions requiring immediate research focus.
frozen-in-time version of our Paper Finder agent for reproducing evaluation results. This repo contains the code for the standalone Paper Finder agent. PaperFinder is our paper-seeking agent, which is intended to assist in locating sets of papers according to content-based and metadata criteria.
MCP-Universe is a comprehensive benchmark designed to evaluate LLMs in realistic tasks through interaction with real-world MCP servers across 6 core domains and 231 tasks. It highlights the challenges of long-context reasoning, unfamiliar tool spaces, and cross-domain variations in LLM performance.
This article discusses methods to measure and improve the accuracy of Large Language Model (LLM) applications, focusing on building an SQL Agent where precision is crucial. It covers setting up the environment, creating a prototype, evaluating accuracy, and using techniques like self-reflection and retrieval-augmented generation (RAG) to enhance performance.