Open-source AI school

Attention Is All You Need, explained for builders

A browser-first lab for seeing how tokens look at each other, how softmax turns similarity into weights, and why transformers changed modern AI.

Part 1

Why transformers matter

Old sequence models read text mostly in order. Attention lets every token compare itself with every other token, so context can move through the whole sentence at once.

RNN bottleneck Context had to squeeze through step-by-step memory.
Attention shift Each token gets a spotlight and chooses what matters.
Parallel building Many token comparisons can run together on modern hardware.
Part 2

Self-attention visualization

Edit the sentence, pick a focus token, and change the temperature. The canvas shows which tokens are being attended to, while the heatmap shows the full attention matrix.

Attention lines

sat attends most to cat

Attention heatmap

Rows are queries, columns are keys.
Part 3

The formula, without the fog

Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k))V. Q asks a question, K describes what a token can match, and V is the information that gets blended into the result.

QueryWhat am I looking for?
KeyWhat do I offer to match?
ValueWhat information do I pass along?
SoftmaxTurn scores into attention weights.
Part 4

Tiny attention in JavaScript

Every number in the lab is generated in the browser from small deterministic embeddings.

const tokens = ["the", "cat", "sat"];
const embeddings = [
  [0.1, 0.8],
  [0.7, 0.2],
  [0.6, 0.9]
];

const scores = dot(query, key) / Math.sqrt(query.length);
const weights = softmax(scores);
const output = weightedSum(weights, values);
Part 5

Attention as human focus

The parallel is useful if we keep it grounded: people also allocate focus, ignore noise, and connect memory with context. That does not make a model conscious. It does give us a practical metaphor for learning how pattern systems work.