Attention Is All You Need, explained for builders
A browser-first lab for seeing how tokens look at each other, how softmax turns similarity into weights, and why transformers changed modern AI.
Why transformers matter
Old sequence models read text mostly in order. Attention lets every token compare itself with every other token, so context can move through the whole sentence at once.
Self-attention visualization
Edit the sentence, pick a focus token, and change the temperature. The canvas shows which tokens are being attended to, while the heatmap shows the full attention matrix.
Attention lines
sat attends most to catAttention heatmap
Rows are queries, columns are keys.The formula, without the fog
Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k))V. Q asks a question, K describes what a token can match, and V is the information that gets blended into the result.
Tiny attention in JavaScript
Every number in the lab is generated in the browser from small deterministic embeddings.
const tokens = ["the", "cat", "sat"];
const embeddings = [
[0.1, 0.8],
[0.7, 0.2],
[0.6, 0.9]
];
const scores = dot(query, key) / Math.sqrt(query.length);
const weights = softmax(scores);
const output = weightedSum(weights, values);
Attention as human focus
The parallel is useful if we keep it grounded: people also allocate focus, ignore noise, and connect memory with context. That does not make a model conscious. It does give us a practical metaphor for learning how pattern systems work.