Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

158 points · fheinsen · 13 hours ago

arxiv.org

thomasahle12 hours ago
There's a graveyard of 100s of papers with "approximate near linear time attention."
They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.
Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214
amluto11 hours ago
I skimmed the paper, and I think I completely lost the plot.
Sections 2.1 through 2.4 talk about the decomposing the per-token-pair attention (key vector from the ith token with query vector from the jth token, where, in inference, the jth token is the one being sampled) into an approximation that is only mildly outrageously exponential in size compared to the original exponential-of-a-dot product. And they get something that's a polynomial (in the mathematical sense -- you're literally evaluating a polynomial) and has a size that's manageable at 4th order.
Okay, great, they took something simple and made it bigger and nastier but less transcendental without losing too much precision. (As far as I know, there is really nothing special about the exp in attention in the first place, so trying to approximate it well seems mostly useful insofar as it will keep existing models working.)
But the reason that attention is quadratic is that each token gets evaluated with respect to each other token. They haven't changed this at all. Section 2.5 seems like it's deferring this to an appendix. Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).
So what did the paper gain? Is there some detail that I missed or that the paper completely glossed over that explains why there is any gain of efficiency at all?
For what it's worth, the paper's overall claim is, in some sense, impossible. You can think of attention as being a sort of vector database, and this gets more accurate the sharper you make the exponential. If you replace softmax with actual max, a query locates the key that is the closest match to the query and returns the associated value. This operation is a plain linear search, it's possible (in principle anyway) to do lots of queries and recover the entire contents of the database, and I think that any paper claiming to do it faster than linear time should explain how it's compressing the data and where the loss is.
In language model terms, imagine an prompt like so:
```
    1: [string 1]
    2: [string 2]
    3: [string 3]
    ...
    n: [string n]
    
    Tell me the string associated with the number k.
```
As long as there's enough precision and enough query/key space to fit some embedding of the number k that will match the right thing (and there is a lot of room in high-dimensional spaces), one might expect a transformer to be able to answer this question. But this obviously requires memory with size linear in the prompt length. If you try to get rid of that, you necessarily lose something. (This is not to say that nice attention scaling is impossible -- one could imagine schemes where it takes the model multiple tokens to answer the question, and the number of tokens needed could scale, say, logarithmically with prompt size. But you still need that linear memory.)
riemannzeta11 hours ago
Neat result. The symmetry exploitation here reminds me of recent work connecting neural network training dynamics to renormalization group theory. Charles Martin's SETOL paper https://arxiv.org/abs/2507.17912 shows that well-trained layers converge to something like an RG fixed point—the eigenvalue spectrum of the weight matrix develops power-law tails with exponent α ≈ 2, which is the signature of scale invariance. At this fixed point, the "effective correlation space" is low-dimensional: you can truncate the SVD aggressively and recover nearly identical test accuracy.
I wonder if there's a connection to your Taylor truncation order. In RG terms, higher-order polynomial interactions are "irrelevant operators"—they get suppressed as you flow toward the fixed point. If trained attention heads are sitting near this fixed point, that might explain why modest truncation orders work: the network has already learned to concentrate its computation in the lower-order terms. A testable prediction: layers with α closer to 2 (measurable via weightwatcher https://github.com/CalculatedContent/WeightWatcher) might need fewer Taylor terms for accurate approximation than layers with α far from 2. If true, you could potentially use the spectral statistics to adaptively choose truncation order per-head.
bluecoconut12 hours ago
I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?
(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)
Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.
But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.
Kubuxu8 hours ago
A paper on the same topic: On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective, Gabriel Mongaras, Eric C. Larson, https://arxiv.org/abs/2507.23632
Video presentation if someone prefers it: https://www.youtube.com/watch?v=PN3nYBowSvM
Linear attention is a first-degree approximation of Softmax attention, and model performance gets better as you increase the degree of the Taylor approximation.
I'm thinking about adapting an existing model to Taylor-approximated attention. I think it should be possible with some model surgery and rehabilitation training.
abeppu12 hours ago
I haven't tried to follow the math closely but should there not be some concern about the region of convergence? It looks like they don't specifically discuss it. Or is there some reason this isn't a problem in this context?
alyxya12 hours ago
The best and proven linear attention is the Gated DeltaNet or variations of it, used by Kimi and Qwen. Anyone who thinks linear attention can't work is forgetting that models are a fixed size so attention should always be compressable to be linear. Another way to think of the feasibility of linear attention is that the standard attention mechanism can be made linear simply by removing the softmax so the kv cache stores the kv product as a constant size matrix instead. Softmax just normalizes attention, but it's not theoretically required.
mapontosevenths12 hours ago
This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wonder exactly how much that trade-off costs in terms of accuracy vs performance? I note that they say it's close to float16 with four Taylor terms.
My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.
spacewhales12 hours ago
Github here: https://github.com/glassroom/sata_attention
observationist12 hours ago
This could turbocharge ByT5 and other tokenless architectures, whose big downside was the increase in compute over longer sequences. It's easy to imagine a bunch of strategies with variable levels of "focus" and so on with a fixed compute budget assigned on the fly with learned optimizers informing the distribution.
yanosh_kunsh12 hours ago
So does that mean that LLM inference could go down significantly in price and/or context length would dramatically increase?
physicsguy8 hours ago
With this, they've not provided an upper bound the error on the kernel expanded with N terms which I think is a big missing piece.
NedCode11 hours ago
Reference implementation: https://github.com/glassroom/sata_attention
andes31412 hours ago
Linear time attention doesn’t work, by principle. Dead end pursuit. Much great research on more efficient quadratic time inference
rvz12 hours ago
> Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.
Now this is a very interesting paper, which hopefully should address the chronic inefficiencies of the AI lack of efficient methods and approaches in reducing their significant computational and energy demands which are off the charts.
> These factors penalize performance relative to what a fused, hardware-optimized implementation could achieve, and the reported runtime results should therefore be interpreted conservatively.
It's still early with several limitations, but the need for wasting billions on GPUs will begin to not make any sense soon.

Loading comments...

news.ycombinator.com/item?id=46886265

thomasahle12 hours ago
There's a graveyard of 100s of papers with "approximate near linear time attention."
They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.
Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214
amluto11 hours ago
I skimmed the paper, and I think I completely lost the plot.
Sections 2.1 through 2.4 talk about the decomposing the per-token-pair attention (key vector from the ith token with query vector from the jth token, where, in inference, the jth token is the one being sampled) into an approximation that is only mildly outrageously exponential in size compared to the original exponential-of-a-dot product. And they get something that's a polynomial (in the mathematical sense -- you're literally evaluating a polynomial) and has a size that's manageable at 4th order.
Okay, great, they took something simple and made it bigger and nastier but less transcendental without losing too much precision. (As far as I know, there is really nothing special about the exp in attention in the first place, so trying to approximate it well seems mostly useful insofar as it will keep existing models working.)
But the reason that attention is quadratic is that each token gets evaluated with respect to each other token. They haven't changed this at all. Section 2.5 seems like it's deferring this to an appendix. Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).
So what did the paper gain? Is there some detail that I missed or that the paper completely glossed over that explains why there is any gain of efficiency at all?
For what it's worth, the paper's overall claim is, in some sense, impossible. You can think of attention as being a sort of vector database, and this gets more accurate the sharper you make the exponential. If you replace softmax with actual max, a query locates the key that is the closest match to the query and returns the associated value. This operation is a plain linear search, it's possible (in principle anyway) to do lots of queries and recover the entire contents of the database, and I think that any paper claiming to do it faster than linear time should explain how it's compressing the data and where the loss is.
In language model terms, imagine an prompt like so:
```
    1: [string 1]
    2: [string 2]
    3: [string 3]
    ...
    n: [string n]
    
    Tell me the string associated with the number k.
```
As long as there's enough precision and enough query/key space to fit some embedding of the number k that will match the right thing (and there is a lot of room in high-dimensional spaces), one might expect a transformer to be able to answer this question. But this obviously requires memory with size linear in the prompt length. If you try to get rid of that, you necessarily lose something. (This is not to say that nice attention scaling is impossible -- one could imagine schemes where it takes the model multiple tokens to answer the question, and the number of tokens needed could scale, say, logarithmically with prompt size. But you still need that linear memory.)
riemannzeta11 hours ago
Neat result. The symmetry exploitation here reminds me of recent work connecting neural network training dynamics to renormalization group theory. Charles Martin's SETOL paper https://arxiv.org/abs/2507.17912 shows that well-trained layers converge to something like an RG fixed point—the eigenvalue spectrum of the weight matrix develops power-law tails with exponent α ≈ 2, which is the signature of scale invariance. At this fixed point, the "effective correlation space" is low-dimensional: you can truncate the SVD aggressively and recover nearly identical test accuracy.
I wonder if there's a connection to your Taylor truncation order. In RG terms, higher-order polynomial interactions are "irrelevant operators"—they get suppressed as you flow toward the fixed point. If trained attention heads are sitting near this fixed point, that might explain why modest truncation orders work: the network has already learned to concentrate its computation in the lower-order terms. A testable prediction: layers with α closer to 2 (measurable via weightwatcher https://github.com/CalculatedContent/WeightWatcher) might need fewer Taylor terms for accurate approximation than layers with α far from 2. If true, you could potentially use the spectral statistics to adaptively choose truncation order per-head.
bluecoconut12 hours ago
I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?
(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)
Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.
But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.
Kubuxu8 hours ago
A paper on the same topic: On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective, Gabriel Mongaras, Eric C. Larson, https://arxiv.org/abs/2507.23632
Video presentation if someone prefers it: https://www.youtube.com/watch?v=PN3nYBowSvM
Linear attention is a first-degree approximation of Softmax attention, and model performance gets better as you increase the degree of the Taylor approximation.
I'm thinking about adapting an existing model to Taylor-approximated attention. I think it should be possible with some model surgery and rehabilitation training.
abeppu12 hours ago
I haven't tried to follow the math closely but should there not be some concern about the region of convergence? It looks like they don't specifically discuss it. Or is there some reason this isn't a problem in this context?
alyxya12 hours ago
The best and proven linear attention is the Gated DeltaNet or variations of it, used by Kimi and Qwen. Anyone who thinks linear attention can't work is forgetting that models are a fixed size so attention should always be compressable to be linear. Another way to think of the feasibility of linear attention is that the standard attention mechanism can be made linear simply by removing the softmax so the kv cache stores the kv product as a constant size matrix instead. Softmax just normalizes attention, but it's not theoretically required.
mapontosevenths12 hours ago
This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wonder exactly how much that trade-off costs in terms of accuracy vs performance? I note that they say it's close to float16 with four Taylor terms.
My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.
spacewhales12 hours ago
Github here: https://github.com/glassroom/sata_attention
observationist12 hours ago
This could turbocharge ByT5 and other tokenless architectures, whose big downside was the increase in compute over longer sequences. It's easy to imagine a bunch of strategies with variable levels of "focus" and so on with a fixed compute budget assigned on the fly with learned optimizers informing the distribution.
yanosh_kunsh12 hours ago
So does that mean that LLM inference could go down significantly in price and/or context length would dramatically increase?
physicsguy8 hours ago
With this, they've not provided an upper bound the error on the kernel expanded with N terms which I think is a big missing piece.
NedCode11 hours ago
Reference implementation: https://github.com/glassroom/sata_attention
andes31412 hours ago
Linear time attention doesn’t work, by principle. Dead end pursuit. Much great research on more efficient quadratic time inference
rvz12 hours ago
> Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.
Now this is a very interesting paper, which hopefully should address the chronic inefficiencies of the AI lack of efficient methods and approaches in reducing their significant computational and energy demands which are off the charts.
> These factors penalize performance relative to what a fused, hardware-optimized implementation could achieve, and the reported runtime results should therefore be interpreted conservatively.
It's still early with several limitations, but the need for wasting billions on GPUs will begin to not make any sense soon.