live · phi-3-mini · webgpu

A real transformer, laid bare.

3.8 billion parameters. On your GPU, in your browser. Every point, every line, every pulse — a real tensor, read one-to-one. The model itself, drawn from the inside. Shift-click any of the 1,024 attention heads to turn it off and watch what breaks.

3.8B
parameters
292
dispatches / token
22
GPU buffers
0
servers
scroll
01 · The Promise

The model, drawn from inside.

Most "AI visualizations" online are decoration — dots pulsing to a chosen rhythm, animated metaphors built on top of static images. The model itself sits somewhere else entirely.

Neuropulse takes the opposite path. Every brightness, every line, every motion is a direct readout of a real WebGPU buffer mid-forward-pass. When the model thinks, you watch it think — the thing itself, not a representation of it.

Strict 1:1. Every pixel a function of a real tensor.

3.8B weights in your GPU. Attention in WGSL. Next-token sampled in your tab. The whole pipeline lives inside this page — close it and the inference stops with it.

model
Phi-3-mini, q4f16_13.8B parameters · same weights Microsoft ships
runtime
WebGPU compute shaders13 pipelines · 22 buffers · 292 dispatches per token
privacy
Your GPU onlyzero server calls · nothing leaves your machine

02 · What you're watching

Every part of the model, labeled.

The 3D scene is not a metaphor. Every glowing element maps to a specific tensor in Phi-3-mini's compute graph.

The 3,072 points of the residual stream are laid out by PCA of the model's layer-0 qkv_proj weights — so dims read into attention together sit near each other. Each point's brightness is the live value of that residual dim on every step.

Hover an attention head — the brightness you see is that head's output magnitude.

attention heads FFN slab residual stream KV cache LM head → next token
fig. 1 — the anatomy of a single forward pass

03 · Validation

Cross-checked against reference Phi-3.

"Strict 1:1" is a strong claim, so it has to be falsifiable. Neuropulse ships with a built-in test suite that diffs the WebGPU implementation against a reference HuggingFace fp16 Phi-3-mini on a fixed set of prompts cached as reference.json. Click the wrench icon inside the demo to run it — the actual numbers from your GPU print to your browser console.

═══ What the suite checks ═══
GPU: q4f16_1 Phi-3-mini   Reference: HF fp16 Phi-3-mini
 
[1] Tokenizer — GPU input ids match HF byte-for-byte on every prompt
[2] Hidden states — full 3,072-dim residual diffed at layers 0, 4, 8, 12, 16, 20, 24, 28, 31
[3] Attention (layer 31) — online softmax cross-checked against an explicit-softmax reference path
[4] Logits — top-k probabilities + Jensen–Shannon divergence vs HF on a 15-prompt sweep, teacher-forced for 5 steps each
[5] Long context — 290-token prompt, 10 decode steps, top-1 matched against HF
[6] Sampler — 5,000-sample empirical distribution vs softmax, JSD < 1e-2

Expect tiny deltas at the hidden-state level — that's the cost of int4, not drift. What matters is the last line: identical top-1 tokens vs fp16 Phi-3 on the test set. Re-run it on your own machine in under a minute.


04 · Causal Scrubbing

And you can reach in.

Watching is the easy half. The harder, more interesting half: shift-click any of the 1,024 attention-head spheres and that head's contribution to the residual stream is zeroed before O-projection — the canonical single-head ablation from the mech-interp literature, running live on the same WebGPU context that just rendered the frame. Press Run ablated and the same prompt generates twice, side by side: once whole, once missing that one head.

Or pick a layer, run a sweep, and let it ablate every head in turn. The panel paints a strip the model itself authored — each cell colored by how much the answer moved when that head went silent. Cool tones for the heads the prompt didn't need; warm ones for the heads it leaned on. About sixty seconds buys you a layer's importance map.

═══ Verified ablation gradient · prompt: "Paris is the capital of" ═══
 
{L0} all heads off — output barely shifts. Layer-0 attention near the embeddings is heavily redundant.
{L31} all heads off — diverges at the low-certainty positions. The final layer's FFN and residual still carry the decision.
{L28..L31} stacked — generation hits the stop token early. Compounded ablation bites.
every layer off — pure FFN + residual, no token-to-token information flow. "-,unlintzegesenma\\. #quierudo\\i'mʔholmo<0x95>-"

This is the same primitive that found "induction heads" and traced the circuits behind in-context learning — work that, until now, lived in Python notebooks tethered to CPUs and A100s. Here it runs in a browser tab, on your laptop, in real time, on a 3.8B-parameter model. The next thirty seconds of your life can look exactly like a mech-interp paper figure.


05 · The Stack

How it's built.

Four pieces. No frameworks for the inference path, no dependency soup, no clever tricks hiding the model from you.

  1. WebGPU compute & WGSL 13 pipelines, 22 buffers, 292 dispatches per token. Quantization: q4f16_1. Hand-written attention and FFN kernels.
  2. MLC Phi-3-mini weights The same weights as mlc-ai/Phi-3-mini-4k-instruct-q4f16_1-MLC, fetched directly from HuggingFace and cached in the browser's Cache API.
  3. Three.js scene Plain WebGLRenderer. No bloom, no particles, no decorative shaders. Every pixel pulls from a real tensor on every frame.
  4. PCA layout from the model's own weights Residual points are placed by PCA of layer 0's qkv_proj.weight columns; FFN points by PCA of down_proj.weight. Dims that get read or written together end up near each other, so the geometry is shaped by the model itself, not by hand.

06 · In real numbers

What it actually is.

3.8B
parameters
Phi-3-mini, q4f16_1. The same weights Microsoft ships — loaded into your GPU, not emulated.
1,024
attention heads, live
32 layers × 32 heads. Every one of them drawn from its real output magnitude on every token.
292
dispatches / token
13 WGSL pipelines, 22 GPU buffers. One full forward pass, in your browser tab.
0
servers
Inference begins and ends inside this tab — every weight, every kernel, every readback. Close the tab; it's gone with you.
Now you

See it for yourself.

Open Neuropulse. Feed it a prompt. Watch the model think. First load streams ~2 GB into your GPU; next visit is instant (OPFS-cached to disk).

Launch Neuropulse