How Attention Works in Transformers

If there is one idea responsible for the modern AI era, it is attention. The 2017 paper that introduced the Transformer architecture — built around a mechanism called self-attention — did not just improve language models. It eventually made them capable enough to chat, write code, summarize documents, and translate between languages at a level that earlier approaches never reached. Nearly every headline-grabbing model since then traces its lineage back to that single idea.

But "attention" is a slippery word. It gets used to mean several different things, and explanations often jump straight to matrices and softmax before explaining why anyone should care. This article takes the opposite approach: we start with the intuition, build up to what the model actually does, and then look at why the idea was so powerful — and where it still costs you.

The problem attention was invented to solve

Before Transformers, the dominant way to model language was with recurrent networks. These read text one word at a time, left to right, carrying a hidden state forward like a running summary. To understand a sentence, the model folded each new word into that summary as it arrived.

This worked, but it had a structural weakness: distance. By the time a recurrent network reached the end of a long sentence, the early words had been compressed and re-compressed through many steps. Information from the beginning simply degraded. The model struggled to connect a pronoun at the end of a paragraph with the noun it referred to at the beginning.

There was a second problem, too: recurrent models were inherently sequential. Word n could not be processed until word n-1 was done. That made them hard to parallelize on modern hardware, which capped how large and how fast they could get.

Attention was proposed as a way to let the model look at all the words at once and decide for itself which ones matter for the word it is currently trying to understand — no matter how far apart they sit. Both problems, distance and sequentiality, dissolved in one move.

The intuition: meaning is borrowed, not stored

Here is the key shift in perspective. We tend to think of a word as having a fixed meaning. "Bank" means bank. But the meaning of almost any word is actually shaped by the words around it. Consider:

"I deposited cash at the bank." → a financial institution.
"I sat by the bank of the river." → the edge of a waterway.

The word "bank" is identical in both sentences. What changes is its meaning, and that meaning is borrowed from context — "deposited" and "cash" pull it toward finance, "river" pulls it toward water.

This is exactly what self-attention models. When the model processes the word "bank," it does not settle on a single fixed definition. Instead, it looks at every other word in the sentence and asks, in effect: how relevant is each of you to understanding me right now? It then blends its own representation with the representations of the most relevant words, weighted by how strongly it judged each one. The diagram below captures this.

Diagram of self-attention: the word bank draws strong attention to the words cash and river, medium attention to deposited, and weak attention to filler words, letting the model resolve the correct meaning.

The thickness of each line represents how much one word's meaning borrows from another. Notice that "bank" attends most to "cash" and "river" — the very words that disambiguate it — and barely at all to grammatical filler like "at" and "the." The model learned to do this weighting on its own, from data.

What the model actually computes

For each word, attention performs a small, repeated ceremony. The word is projected into three different vectors — given the names query, key, and value. The naming comes from an analogy with a database:

The query is what the current word is "looking for."
The keys are what every word (including itself) is "offering."
The values are what each word actually contributes if selected.

To decide how much the current word should attend to some other word, the model compares the current word's query against the other word's key. A good match produces a high score; a poor match produces a low one. These scores are normalized so they sum to one, and then they are used as weights to combine the values. The output is a new, context-aware representation of the current word — one that has already absorbed relevant information from the rest of the sentence.

The clever part is that the query, key, and value are all learned. Nobody tells the model what to look for; it discovers useful patterns of "looking" through training. Over millions of examples, it learns that pronouns should attend to their antecedents, that verbs should attend to their subjects and objects, and countless subtler regularities that no one could hand-code.

This whole computation happens for every word simultaneously, which is why Transformers parallelize so well — and why they scaled past recurrent networks so decisively.

Multi-head attention: looking in many ways at once

A single round of attention can only capture one "kind" of relationship at a time. Real language is richer than that: a single word might simultaneously relate to the grammar of the sentence, the entity it refers to, and the tone of the paragraph.

The solution is multi-head attention. Instead of running the query-key-value ceremony once, the model runs it many times in parallel — each "head" learning to attend to a different kind of relationship. The results are then combined. One head might specialize in subject-verb agreement, another in coreference, another in long-range dependencies. The combined picture is far richer than any single head could produce.

This is part of why Transformers are so parameter-hungry: each head adds its own projections, and stacking many heads and many layers is what gives the model its depth of understanding. It is also where the cost story begins.

Why attention is powerful — and expensive

The power of attention comes from the fact that every token can attend to every other token. Nothing is out of reach by virtue of distance. A word at the end of a long document can directly consult a word at the beginning, with no degradation through intermediate steps. This is what closed the "distance" gap that plagued recurrent models.

But that same exhaustiveness is the source of attention's signature drawback. If every token must consider every other token, then the amount of work grows with the square of the sequence length. Doubling the input more than doubles the computation. For short texts this is invisible; for long documents or long conversations it becomes the dominant cost.

This is directly connected to the limits you feel as a user. It is one of the reasons a model's context window — how much text it can consider at once — is finite and why pushing it larger is hard. If you want to understand why windows exist and why they cost what they do, attention's quadratic scaling is a big part of the answer. Our companion piece on context windows goes deeper on that trade-off.

The field has responded to this cost in several ways. Some models approximate attention to avoid the full every-to-every comparison. Others adopt alternative sequence backbones — linear-attention variants and state-space model families like Mamba — that handle long sequences more cheaply, sometimes in hybrid designs that mix cheap backbones with selective full attention. And techniques like Mixture of Experts tackle a different axis of the same problem, reducing per-token compute so that more of the budget can go to handling long context.

What this means when you write prompts

You do not need to do matrix multiplication to benefit from understanding attention. The idea translates into a few practical habits:

Repetition creates emphasis. Because attention is a weighted blend, a word or instruction that appears multiple times naturally gathers more attention. If something is critical, stating it clearly more than once — including right before the part where it matters — genuinely helps.

Proximity still matters. Although attention can reach across long distances in principle, models are empirically more reliable when the most relevant material is close to where it is used. Putting the key instruction or data near the end of your prompt, just before the task, tends to work better than burying it at the top.

Disambiguate explicitly. When a word is ambiguous, the model resolves it by attending to context. If the context is thin, you can help by adding the disambiguating signal yourself — naming the sense you intend rather than relying on the model to guess.

Structure helps attention find things. Clear sections, headings, and delimiters give attention a map. In a long input, well-structured text is easier for the model to navigate than a wall of prose, which improves both speed and accuracy.

Conclusion

Attention is the mechanism that let language models stop reading one word at a time and start seeing whole passages at once. By letting every word consult every other word and blend in what is relevant, it solved the distance problem that crippled earlier architectures and unlocked the parallelism that made modern scale possible. Its one real Achilles' heel — that every-to-every comparison grows quadratically — is the reason context windows are finite and long inputs are expensive.

Once you grasp that a model is constantly performing this quiet act of "looking around" to decide what each word means in context, a great deal of model behavior stops being mysterious. The model is not reasoning the way a person does, but it is doing something genuinely powerful: letting every piece of text borrow its meaning from everything around it, all at once.