Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators Paper • 2503.19877 • Published Mar 25, 2025 • 2
LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation Paper • 2505.23832 • Published May 28, 2025 • 1
Cognitive Foundations for Reasoning and Their Manifestation in LLMs Paper • 2511.16660 • Published Nov 20, 2025 • 11
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics Paper • 2512.01020 • Published Nov 30, 2025 • 1
Entailment-Preserving First-order Logic Representations in Natural Language Entailment Paper • 2502.16757 • Published Feb 24, 2025
SymBa: Symbolic Backward Chaining for Structured Natural Language Reasoning Paper • 2402.12806 • Published Feb 5, 2025
ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces Paper • 2606.05402 • Published 11 days ago • 1
Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization Paper • 2605.26457 • Published 19 days ago • 6
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 13 days ago • 56
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls Paper • 2412.01340 • Published Dec 2, 2024
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published 25 days ago • 12
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 13 days ago • 56
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 13 days ago • 56
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation Paper • 2505.24456 • Published May 30, 2025
PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization Paper • 2507.10057 • Published Jul 14, 2025