live · phi-3-mini · webgpu

A real transformer, laid bare.

3.8 billion parameters. On your GPU, in your browser. Every point, every line, every pulse — a real tensor, read one-to-one. No mockup. No fake.

3.8B
parameters
292
dispatches / token
22
GPU buffers
0
servers
scroll
01 · The Promise

No mockups. The actual model.

Most "AI visualizations" online are decoration. Dots pulsing to a fake rhythm. A metaphor with no model behind it.

Neuropulse is the opposite. Every brightness, every line, every motion is a direct readout of a real WebGPU buffer mid-forward-pass. When the model thinks, you watch it think — not a representation.

Strict 1:1. Every pixel a function of a real tensor.

3.8B weights in your GPU. Attention in WGSL. Next-token sampled in your tab. No server. No API key. Close the tab and the inference stops.

model
Phi-3-mini, q4f16_13.8B parameters · same weights Microsoft ships
runtime
WebGPU compute shaders13 pipelines · 22 buffers · 292 dispatches per token
privacy
Your GPU onlyzero server calls · nothing leaves your machine

02 · What you're watching

Every part of the model, labeled.

The 3D scene is not a metaphor. Every glowing element maps to a specific tensor in Phi-3-mini's compute graph.

The 3,072 points of the residual stream are laid out by PCA of the model's layer-0 qkv_proj weights — so dims read into attention together sit near each other. Each point's brightness is the live value of that residual dim on every step.

Hover an attention head — the brightness you see is that head's output magnitude.

attention heads FFN slab residual stream KV cache LM head → next token
fig. 1 — the anatomy of a single forward pass

03 · Validation

Cross-checked against reference Phi-3.

"Strict 1:1" is a strong claim, so it has to be falsifiable. Neuropulse ships with a built-in test suite that diffs the WebGPU implementation against a reference HuggingFace fp16 Phi-3-mini on a fixed set of prompts cached as reference.json. Click the wrench icon inside the demo to run it — the actual numbers from your GPU print to your browser console.

═══ What the suite checks ═══
GPU: q4f16_1 Phi-3-mini   Reference: HF fp16 Phi-3-mini
 
[1] Tokenizer — GPU input ids match HF byte-for-byte on every prompt
[2] Hidden states — full 3,072-dim residual diffed at layers 0, 4, 8, 12, 16, 20, 24, 28, 31
[3] Attention (layer 31) — online softmax cross-checked against an explicit-softmax reference path
[4] Logits — top-k probabilities + Jensen–Shannon divergence vs HF on a 15-prompt sweep, teacher-forced for 5 steps each
[5] Long context — 290-token prompt, 10 decode steps, top-1 matched against HF
[6] Sampler — 5,000-sample empirical distribution vs softmax, JSD < 1e-2

Expect tiny deltas at the hidden-state level — that's the cost of int4, not drift. What matters is the last line: identical top-1 tokens vs fp16 Phi-3 on the test set. Re-run it on your own machine in under a minute.


04 · The Stack

How it's built.

Four pieces. No frameworks for the inference path, no dependency soup, no clever tricks hiding the model from you.

  1. WebGPU compute & WGSL 13 pipelines, 22 buffers, 292 dispatches per token. Quantization: q4f16_1. Hand-written attention and FFN kernels.
  2. MLC Phi-3-mini weights The same weights as mlc-ai/Phi-3-mini-4k-instruct-q4f16_1-MLC, fetched directly from HuggingFace and cached in the browser's Cache API.
  3. Three.js scene Plain WebGLRenderer. No bloom, no particles, no decorative shaders. Every pixel pulls from a real tensor on every frame.
  4. PCA layout from the model's own weights Residual points are placed by PCA of layer 0's qkv_proj.weight columns; FFN points by PCA of down_proj.weight. Dims that get read or written together end up near each other, so the geometry is shaped by the model itself, not by hand.

05 · In real numbers

What it actually is.

3.8B
parameters
Phi-3-mini, q4f16_1. The same weights Microsoft ships — loaded into your GPU, not emulated.
1,024
attention heads, live
32 layers × 32 heads. Every one of them drawn from its real output magnitude on every token.
292
dispatches / token
13 WGSL pipelines, 22 GPU buffers. One full forward pass, in your browser tab.
0
servers
No API calls. No telemetry. No key. Close the tab and the inference stops.
Now you

See it for yourself.

Open Neuropulse. Feed it a prompt. Watch the model think. First load streams ~2 GB into your GPU; next visit is instant (OPFS-cached to disk).

Launch Neuropulse