Neuropulse — Watch a real transformer think

01 · The Promise

The model, drawn from inside.

Most "AI visualizations" online are decoration — dots pulsing to a chosen rhythm, animated metaphors built on top of static images. The model itself sits somewhere else entirely.

Neuropulse takes the opposite path. Every brightness, every line, every motion is a direct readout of a real WebGPU buffer mid-forward-pass. When the model thinks, you watch it think — the thing itself, not a representation of it.

Strict 1:1. Every pixel a function of a real tensor.

3.8B weights in your GPU. Attention in WGSL. Next-token sampled in your tab. The whole pipeline lives inside this page — close it and the inference stops with it.

model

Phi-3-mini, q4f16_13.8B parameters · same weights Microsoft ships

runtime

WebGPU compute shaders13 pipelines · 22 buffers · 292 dispatches per token

privacy

Your GPU onlyzero server calls · nothing leaves your machine

02 · What you're watching

Every part of the model, labeled.

The 3D scene is not a metaphor. Every glowing element maps to a specific tensor in Phi-3-mini's compute graph.

The 3,072 points of the residual stream are laid out by PCA of the model's layer-0 qkv_proj weights — so dims read into attention together sit near each other. Each point's brightness is the live value of that residual dim on every step.

Hover an attention head — the brightness you see is that head's output magnitude.

fig. 1 — the anatomy of a single forward pass

32 layer rings Each ring is one transformer block. Brightness tracks the post-attention plus post-FFN residual norm for that layer — watch the signal build as the prompt flows upward.
32 attention heads per layer Cyan neurons on the outer ring. Each lights up proportional to its head's output magnitude. 1,024 heads in total, all live.
FFN slab The violet 8,192-neuron expansion. By far the largest compute budget in the model. You can see it pulse as the MLP activates.
Residual stream (3,072 dims) The highway through the network. 3,072 points, one per dim, placed by PCA of the layer-0 qkv_proj weights so functionally related dims sit near each other. Brightness on each point is the live residual value at that dim.
KV cache strips The growing memory of past tokens. Each strip is one position; height equals cache fill for that layer.
LM head Final projection to 32,064 vocab logits. Softmax → next token. The live top-k distribution prints to the side panel as the model decodes.

03 · Validation

Cross-checked against reference Phi-3.

"Strict 1:1" is a strong claim, so it has to be falsifiable. Neuropulse ships with a built-in test suite that diffs the WebGPU implementation against a reference HuggingFace fp16 Phi-3-mini on a fixed set of prompts cached as reference.json. Click the wrench icon inside the demo to run it — the actual numbers from your GPU print to your browser console.

═══ What the suite checks ═══

GPU: q4f16_1 Phi-3-mini Reference: HF fp16 Phi-3-mini

[1] Tokenizer — GPU input ids match HF byte-for-byte on every prompt

[2] Hidden states — full 3,072-dim residual diffed at layers 0, 4, 8, 12, 16, 20, 24, 28, 31

[3] Attention (layer 31) — online softmax cross-checked against an explicit-softmax reference path

[4] Logits — top-k probabilities + Jensen–Shannon divergence vs HF on a 15-prompt sweep, teacher-forced for 5 steps each

[5] Long context — 290-token prompt, 10 decode steps, top-1 matched against HF

[6] Sampler — 5,000-sample empirical distribution vs softmax, JSD < 1e-2

Expect tiny deltas at the hidden-state level — that's the cost of int4, not drift. What matters is the last line: identical top-1 tokens vs fp16 Phi-3 on the test set. Re-run it on your own machine in under a minute.

04 · Causal Scrubbing

And you can reach in.

Watching is the easy half. The harder, more interesting half: shift-click any of the 1,024 attention-head spheres and that head's contribution to the residual stream is zeroed before O-projection — the canonical single-head ablation from the mech-interp literature, running live on the same WebGPU context that just rendered the frame. Press Run ablated and the same prompt generates twice, side by side: once whole, once missing that one head.

Or pick a layer, run a sweep, and let it ablate every head in turn. The panel paints a strip the model itself authored — each cell colored by how much the answer moved when that head went silent. Cool tones for the heads the prompt didn't need; warm ones for the heads it leaned on. About sixty seconds buys you a layer's importance map.

═══ Verified ablation gradient · prompt: "Paris is the capital of" ═══

{L0} all heads off — output barely shifts. Layer-0 attention near the embeddings is heavily redundant.

{L31} all heads off — diverges at the low-certainty positions. The final layer's FFN and residual still carry the decision.

{L28..L31} stacked — generation hits the stop token early. Compounded ablation bites.

every layer off — pure FFN + residual, no token-to-token information flow. "-,unlintzegesenma\\. #quierudo\\i'mʔholmo<0x95>-"

This is the same primitive that found "induction heads" and traced the circuits behind in-context learning — work that, until now, lived in Python notebooks tethered to CPUs and A100s. Here it runs in a browser tab, on your laptop, in real time, on a 3.8B-parameter model. The next thirty seconds of your life can look exactly like a mech-interp paper figure.

05 · The Stack

How it's built.

Four pieces. No frameworks for the inference path, no dependency soup, no clever tricks hiding the model from you.

WebGPU compute & WGSL 13 pipelines, 22 buffers, 292 dispatches per token. Quantization: q4f16_1. Hand-written attention and FFN kernels.
MLC Phi-3-mini weights The same weights as mlc-ai/Phi-3-mini-4k-instruct-q4f16_1-MLC, fetched directly from HuggingFace and cached in the browser's Cache API.
Three.js scene Plain WebGLRenderer. No bloom, no particles, no decorative shaders. Every pixel pulls from a real tensor on every frame.
PCA layout from the model's own weights Residual points are placed by PCA of layer 0's qkv_proj.weight columns; FFN points by PCA of down_proj.weight. Dims that get read or written together end up near each other, so the geometry is shaped by the model itself, not by hand.

06 · In real numbers

What it actually is.

3.8B

parameters

Phi-3-mini, q4f16_1. The same weights Microsoft ships — loaded into your GPU, not emulated.

1,024

attention heads, live

32 layers × 32 heads. Every one of them drawn from its real output magnitude on every token.

292

dispatches / token

13 WGSL pipelines, 22 GPU buffers. One full forward pass, in your browser tab.

servers

Inference begins and ends inside this tab — every weight, every kernel, every readback. Close the tab; it's gone with you.

Now you

See it for yourself.

Open Neuropulse. Feed it a prompt. Watch the model think. First load streams ~2 GB into your GPU; next visit is instant (OPFS-cached to disk).

Launch Neuropulse →

A real transformer, laid bare.