Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
EMNLP 2025 Main — arXiv — Code
Abstract
Large Language Models are commonly judged by scores on benchmarks that often mask the mixture of abilities actually required. We introduce BENCHMARK PROFILING, a diagnostic framework that decomposes benchmark performance into cognitively grounded abilities via gradient-based importance and targeted parameter ablation, computing an Ability Impact Score (AIS). The analysis explains why performance gains don’t always translate to user-perceived competence and supports benchmark auditing and interpretability.