01
The chat · what you use every day
System is set once, then User / Assistant take turns
System — house rules the app sets for you, once per chat — kicks things off. After that it is just User (you) and Assistant (it), piling up turn by turn. Nothing else is in there.
02
The essence · what it actually does
Everything is joined into one text; it guesses the next word
Take the boxes away and the pieces join end to end into one long string. The model reads it and continues from the tail, one word at a time. However fancy the AI, this is the whole job.
So how exactly does it “guess”? Let’s step out of the frame and flatten it.
02
The essence · through its eyes
Flattened out: one string, new words popping out at the tail
Bubbles and roles are packaging for you. In its eyes there is only this one long string (tokens). Its job: read the whole string, pop out the next word at the end — that “typing” feel you see is exactly this.
02
How it guesses · ① attention
Before guessing, it looks back at what matters most
This step is attention — the heart of the transformer. To pick the next word it weighs every earlier word differently: here it leans hard on “To be, or not to be” and the cadence of the line.
02
How it guesses · ② probability
It never knows the answer — it scores every candidate
Having read everything, it produces a probability table for the next word: arrows .71 / pains .12 / stings .05… “Guessing” means picking from this table. And because it is always guessing — confident does not mean correct: a “hallucination” is not a bug — it is this table doing its normal thing.
02
How it guesses · ③ append, repeat
Pick one, append it, guess the next the same way
The picked word goes onto the tail, then it guesses the next one exactly the same way — word by word, the whole reply gets generated. That is the entire core of an LLM.
03
The constraint · one “head” first
It reads the honest way: every word × every word
The reader doing this is called a “head”. It scores how every word relates to every other and keeps the result as a web: 100 words = 10,000 cells; 10,000 words = 100 million. What you stuff in is the context; the web’s ceiling is the context length, fixed at build time.
03
The constraint · that was one head
There are dozens of heads — not split work, but angles
Not head 1 reads the front, head 2 the back — every head reads the whole text, each watching its own thing: cadence, names, tone. Dozens at once: faster, but no less work — dozens of webs, all computed. (The picture uses 64.)
03
The constraint · deeper layer by layer
That stack of webs is just layer 1 — dozens more follow
Layer 1 reads the raw text; each later layer reads what the one before digested — a fresh web, one level deeper (the picture uses 80). The bill multiplies: words × words × heads × layers. Double the window, quadruple the bill — that is why the window must have a cap.
03
The constraint · the one hard limit
It can only read so much — and the middle goes muddy
The string has a hard cap (context length) — the two classic complaints, big files choke it, long chats make it forget, both start here. And the longer it gets, the easier the middle is to miss — the buried line is the one it “reads right past”.
04
Stateless · there is no “just now”
The chat is an illusion: every message is a first meeting
Between two replies, nothing stays in its head. The “continuous chat” is the app resending the whole history every time; it rereads from the top and guesses on (also why long chats get pricey). It has no state — what you send is all it has.
05
Feed it · start with a handoff list
First, list every piece of knowledge the job takes
Treat it as a brilliant new hire on a permanent first day. To hand the job over, what would they need to know? — your situation and goal, the private materials, the latest progress, plus public knowledge and common sense. Never mind who supplies what — list it all.
06
Cross off · what it comes with
Two items on the list can be crossed right off
Public knowledge and common sense — anything public, common and pre-cutoff, it read in training. Already in its head; cross them off. Four remain that only you can supply: goal / status / materials / constraints. Those four are what goes into the window.
07
Ask · information comes out in talk
Just talk — and for anything complex, have it ask you first
No need to fill the list in one go — information comes out in talk: it asks, you answer, the slots fill in. For anything complex, open with “Before you start, ask me about anything unclear.” One round of questions saves rounds of rework.
08
Sand it · v1 is usually wrong
Toss “what went wrong” back as-is, round after round
The first version is almost never right — it was guessed into being. Paste the result or error back raw; it revises. Still wrong? Paste again — the chat grows, and that grind is what sharpens the result. You only toss back symptoms — no need to diagnose. Stop at good enough.
09
Save · don’t just close the tab
Have it distill the chat into a document — that is a Skill
Job worked? Say one more thing before you leave: “Distill this chat into a reusable doc.” It pulls the context, the final solution and every pitfall out of those rounds of back-and-forth, and writes a Skill. Accumulating experience means accumulating documents like this one.
09
Save · how it pays off
Next time’s handoff: the Skill fills the slots
Next time the same kind of job comes up, drop the Skill in — the how-to, the pitfalls, the house rules ride along; you only state this time’s goal and status. Five rounds of sanding last time, one round now — that is the compound interest of saving.
10
Fit the hands · tool
The brain has no hands: it writes an order; a tool does the work
“Book me a meeting” — the brain can’t reach your calendar. So it writes an order on the pad — check_calendar(Wed,…) — and stops. A program outside actually checks the calendar: a tool. The result comes back as text; it reads on and answers. It only ever talks; the tool does the doing.
11
Fit the memory · injection (RAG/Memory/Skill)
Memory is bolted on: it fetches, or someone writes
The brain forgets on sight; memory is bolted on. Two routes: a tool fetches (previous page), or someone writes onto its pad directly — documents (RAG), preferences and past results (Memory), sanded how-tos (Skill), written in at fixed moments. Who writes? Coming up.
12
The body moves · the loop (Agent)
The same brain, fed a few more rounds = an Agent
Hands fitted, memory fitted — now it spins on its own: write → tool runs → result written back, lap after lap. None of it needs you; the pad just keeps growing. “Agent” sounds like it might rebel; the mechanism is disappointingly plain: the same brain, fed a few more rounds.
13
The full skeleton · harness
The shell that assembles this body is the harness
Writing the pad, handing tools, spinning the loop — the program assembling this body is the harness (Claude Code, Cursor, your AI app). “Who does the writing?” — this is who. It can also run several loops as a workflow. The point: only that small piece is brain-like; the rest is ordinary code — every bit of scaffolding in your grip.
14
Reflexes & red lines · hooks
Fixed points on the loop come with sockets — hooks
“Hook” is an old programmers’ word: a fixed event fires, your attached action runs — “dinner” fires, “wash hands first” runs. The loop’s key points all have sockets: after you speak, before it acts, after a tool runs, at wrap-up. Hang gates, injections, follow-ups on them. A prompt is a request; a hook is law.