Edit Datasets filters

Modalities

Size (rows)

Format

optimized-parquet

Type

Datasets

29

Full-text search

Active filters: official

Benchmark datasets

Live leaderboards rank Hub models on evals like SWE-bench, AIME 2026 and HLE.

openai/gsm8k

Benchmark • Updated Mar 23 • 17.6k • 966k • 1.35k

Idavidrein/gpqa

Benchmark • Updated Mar 5 • 1.25k • 140k • 448

allenai/olmOCR-bench

Benchmark • Updated Feb 19 • 5.3k • 218

SWE-bench/SWE-bench_Verified

Benchmark • Updated Feb 27 • 500 • 74.7k • 93

harborframework/terminal-bench-2.0

Benchmark • Updated Apr 24 • 13k • 38

llamaindex/ParseBench

Benchmark • Updated Apr 19 • 169k • 50.3k • 88

ScaleAI/SWE-bench_Pro

Benchmark • Updated Feb 23 • 731 • 54.4k • 119

mercor/apex-agents

Benchmark • Updated Mar 3 • 480 • 25.8k • 127

TIGER-Lab/MMLU-Pro

Benchmark • Updated 29 days ago • 12.1k • 161k • 478

hf-audio/open-asr-leaderboard

Benchmark • Updated 4 days ago • 99k • 30k • 38

MathArena/aime_2026

Benchmark • Updated 15 days ago • 30 • 15k • 43

claw-eval/Claw-Eval

Benchmark • Updated 22 days ago • 4.79k • 27

cais/hle

Benchmark • Updated Jan 20 • 2.5k • 41.3k • 812

likaixin/ScreenSpot-Pro

Benchmark • Updated Mar 18 • 10.3k • 65

nvidia/compute-eval

Benchmark • Updated Apr 27 • 2.46k • 5.96k • 25

FutureMa/EvasionBench

Benchmark • Updated Feb 19 • 16.7k • 675 • 110

mteb/BRIGHT

Benchmark • Updated Apr 2 • 1.35M • 4.43k • 3

Delores-Lin/MDPBench

Benchmark • Updated Apr 26 • 8.73k • 20

mteb/arguana

Benchmark • Updated Apr 17 • 11.5k • 20.9k • 5

MMMU/MMMU_Pro

Benchmark • Updated about 18 hours ago • 5.19k • 41k • 58

LEXam-Benchmark/LEXam

Benchmark • Updated 9 days ago • 7.54k • 1.52k • 42

mercor/ACE

Benchmark • Updated Apr 13 • 592 • 226 • 5

mercor/APEX-v1-extended

Benchmark • Updated Apr 22 • 100 • 2.59k • 16

VLABench/vlabench_primitive_ft_lerobot_video

Benchmark • Updated Apr 23 • 575k • 4.85k • 1

tiiuae/PBench

Benchmark • Updated 19 days ago • 6.34k • 2.1k • 15

MathArena/hmmt_feb_2026

Benchmark • Updated 15 days ago • 33 • 3.89k • 4

collinear-ai/yc-bench

Benchmark • Updated Mar 23 • 149 • 18

internlm/WildClawBench

Benchmark • Updated 16 days ago • 9.88k • 59

MME-Benchmarks/Video-MME-v2

Benchmark • Updated 8 days ago • 3.2k • 5.23k • 41