← Back to Publications

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim, Gyuho Shim, Yong Chan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim

EMNLP 2025 Main — arXivCode

Benchmark Profiling paper figure

Abstract

Large Language Models are commonly judged by scores on benchmarks that often mask the mixture of abilities actually required. We introduce BENCHMARK PROFILING, a diagnostic framework that decomposes benchmark performance into cognitively grounded abilities via gradient-based importance and targeted parameter ablation, computing an Ability Impact Score (AIS). The analysis explains why performance gains don’t always translate to user-perceived competence and supports benchmark auditing and interpretability.