Papers
arxiv:2512.07647

A Mathematical Theory of Top-k Sparse Attention via Total Variation Distance

Published on Dec 8, 2025
Authors:
,
,
,

Abstract

A mathematical framework is established for certifying Top-k attention truncation errors in transformer models, providing precise bounds on approximation errors and showing significant key reduction potential.

We develop a unified mathematical framework for certified Top-k attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution P and its Top-k truncation hat P, we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies TV(P,hat P)=1-e^{-KL(hat PVert P)}, yielding sharp Top-k-specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control TV(P,hat P) using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as |Attn(q,K,V)-Attn_k(q,K,V)|_2=τ|μ_{tail}-μ_{head}|_2 with τ=TV(P,hat P), yielding a new head-tail diameter bound |Attn(q,K,V)-Attn_k(q,K,V)|_2leτ,diam_{H,T} and refinements linking the error to Var_P(V). Under an i.i.d. Gaussian score model s_isimmathcal N(μ,σ^2) we derive closed-form tail masses and an asymptotic rule for the minimal k_varepsilon ensuring TV(P,hat P)levarepsilon, namely k_varepsilon/napproxΦ_c(σ+Φ^{-1}(varepsilon)). Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of k_varepsilon/n and show that certified Top-k can reduce scored keys by 2-4times on average while meeting the prescribed total-variation budget.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2512.07647
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.07647 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.07647 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.07647 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.