Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 4 days ago • 30
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents Paper • 2606.12674 • Published 12 days ago • 5
view article Article Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic ibm-research • 21 days ago • 87
view article Article ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM ibm-research • 26 days ago • 17
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published 27 days ago • 9
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines Paper • 2605.20630 • Published May 20 • 12
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules Paper • 2605.08614 • Published May 9 • 7
SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks Paper • 2605.14051 • Published May 13 • 1
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge Paper • 2605.08518 • Published May 8 • 11
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments Paper • 2605.09131 • Published May 9 • 59
When to Trust Imagination: Adaptive Action Execution for World Action Models Paper • 2605.06222 • Published May 7 • 44
IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance Paper • 2604.23446 • Published Apr 25 • 4
From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents Paper • 2603.22386 • Published Mar 23 • 57
SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search Paper • 2512.23167 • Published Dec 29, 2025 • 1
view article Article OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments +3 christian-washington, ajasuja, santosh-iima, lewtun, burtenshaw • Feb 12 • 35
view article Article Community Evals: Because we're done trusting black-box leaderboards over the community +5 burtenshaw, SaylorTwift, kramp, merve, davanstrien, nielsr, julien-c • Feb 4 • 90
Enterprise Agents and Benchmarks Collection Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation • 19 items • Updated 25 days ago • 17
view article Article AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality ibm-research • Jan 21 • 33
AI-Agent-4-Industry-4.0 Collection This category highlights the collective efforts of the AI Automation team in advancing Industry 4.0 applications and exploring innovations beyond it. • 6 items • Updated Oct 8, 2025 • 8
Granite Docling Collection Models for parsing complex PDFs and structured documents, designed to complement Docling. • 4 items • Updated Apr 29 • 64