eric makes software

jax-js: an ML library for the web

Eric — Thu, 18 Dec 2025 15:02:32 GMT

I’m excited to release jax-js, a machine learning library for the web.

You can think of it as a reimplementation of Google DeepMind’s JAX framework (similar to PyTorch) in pure JavaScript.

jax-js runs completely in the browser by generating fast WebGPU and Wasm kernels.

Numerical computing on the web

Starting in February this year, I spent nights and weekends working on a new ML library for the browser. I wanted a cross-platform way to run numerical programs on the frontend web, so you can do machine learning.

Python and JavaScript are the most popular languages in the world:

JavaScript is the language of the web.
Python is simple, expressive and now ubiquitous in ML thanks to frameworks like PyTorch and JAX.

But most developers would balk at running any number crunching in JavaScript. While the JavaScript JIT is really good, it’s not optimized for tight numerical loops. JavaScript doesn’t even have a fast, native integer data type! So how can you run fast numerical code on the web?

The answer is to rely on new browser technologies — WebAssembly and WebGPU, which allow you to run programs at near-native speeds. WebAssembly is a low-level portable bytecode, and WebGPU is GPU shaders on the web.

If we can use these native runtimes, then this lends itself to a programming model similar to JAX, where you trace programs and JIT compile them to GPU kernels. Here, instead of Nvidia CUDA, we write pure JavaScript to generate WebAssembly and WebGPU kernels. Then we can run them and execute instructions at near-native speed, skipping the JavaScript interpreter bottleneck.

That is what I ended up doing in jax-js, and now it “just works”.

Getting started

You can install jax-js as a library. It has 0 dependencies and is pure JS.

npm install @jax-js/jax

Then you can use it with an API almost identical to JAX.

import { numpy as np } from "@jax-js/jax";

const ar = np.array([1, 5, 6, 7]);
console.log(ar.mul(10).js());  // -> [10, 50, 60, 70]

Under the hood, this generates a WebAssembly kernel and dispatches it.

Note: There are some surface-level syntax differences here, versus JAX:
JavaScript doesn’t have operator overloading like Python. Instead of ar * 10 in Python, you have to call ar.mul(10).
The .js() method converts a jax.Array object back into a plain JS array.
JS has no reference-counted destructor method to free memory, so array values in jax-js have move semantics like Rust, with .ref incrementing their reference counts.

If you’d like to use WebGPU, just start your program with:

import { init, setDevice } from "@jax-js/jax";

await init("webgpu");
setDevice("webgpu");

You can leverage grad, vmap, and other features of JAX. Here’s automatic differentiation with grad():

import { grad, numpy as np } from “@jax-js/jax”;

const f = (x: np.Array) => np.sqrt(x.ref.mul(x).sum());
const df = grad(f);

const x = np.array([1, 2, 3, 4]);
console.log(df(x).js());

And here’s an example the compiler fusing operations with jit(). The following function gets translated into a compiled GPU compute kernel:

import { jit, numpy as np } from "@jax-js/jax";

const f = jit((x: np.Array) => {
  return np.sqrt(x.add(2).mul(Math.PI)).sum();
});

Machine learning

With these simple building blocks, you can implement most machine learning algorithms and backpropagate through them.

Here is a runnable example of training a neural network from scratch on MNIST dataset in your browser. It learns to >99% accuracy in seconds, and everything from dataset loading to matmul kernels is pure frontend JavaScript code.

It’s remarkable to write ML programs with hot module reloading. You can edit code in real time while the model is training!

—

You can also build applications. Here’s a demo I built yesterday: download the whole text of Great Expectations (180,000 words), run it through a CLIP-based embedding model, and semantic search it in real time—all from your browser.

(The text embedding actually runs at a respectable ~500 GFLOP/s on my M1 Pro with just jax.jit(), despite me not having tried to optimize it at all yet. Not bad, crunching 500,000,000,000 calculations/second in browser on a 4-year-old laptop!)

Running with batch size 16 (x77 token context), each CLIP transformer inference takes 200 ms, for an estimated 485 GFLOP/s end-to-end.

For a lot of inference use cases, you might find a “model runtime” like ONNX to add prebuilt ML models to your browser, where the ML developers hand off pre-packaged weights to be used in product. With jax-js, it’s a bit different, and I’m imagining how a full ML framework, usually relegated to the backend, can run in a browser.

As for performance, it hasn’t been my primary focus so far, as just “getting the ML framework working” comes first. I have checked that jax-js’s generated kernels for matmuls are fast (>3 TFLOP on Macbook M4 Pro). But there’s a lot of room to improve (e.g., conv2d is slow), and I haven’t done much optimization work on transformer inference in particular yet. There’s plenty of low-hanging fruit.

Project release

I am open-sourcing jax-js today at ekzhang/jax-js.

There are rough edges in this initial release, but it’s ready to try out now.

Links:

I look forward to seeing what you create. 🥰

‍

Appendix

This is a personal project and not related to Thinking Machines Lab. I started working on jax-js before starting my current job, and in a way, it’s partly how I ended up working in ML. Turns out this stuff is kind of fun!

If you’re still reading, hello—I have a bunch more details to share.

Acknowledgements

Thanks to:

The authors of JAX for making an important ML library that’s a joy to use.
- Thanks to Matthew Johnson, Dougal Maclaurin, and others for Autodidax, an instructive implementation of the JAX core from scratch.
- And thanks for all of the JAX ecosystem libraries as well.
Tinygrad for a very excellent autograd library — you showed that code-generating kernels from scratch can’t really be that intrinsically complex!
- Many parts of jax-js in the backend internals follow Tinygrad’s design closely. The biggest example of this is ShapeTracker, which was directly ported.
Chrome, Safari, and Firefox for WebGPU, now used in 2% of all websites.
The open-source community, for inspiration and for showing that ML on the web is actually possible!
PyTorch, MLX, and NumPy

How it works: An overview of internals

In general, I think there are roughly two parts to an ML library:

“Frontend” (think JAX): The interface for creating and manipulating arrays, the autograd engine, JIT, typing and transformations. Also where you interact with a sync/async boundary and how you track memory allocations.
“Backend” (think XLA): Actual kernels for executing operations. The frontend has some kind of representation of a kernel, it dispatches it to the backend, which then optimizes it, compiles it down to native code (CPU or GPU) and runs it very efficiently.

This dichotomy obviously isn’t perfect (e.g., where do Triton/Pallas fit in? how about warp-specialized cuTile?), and there are certainly concerns that span both parts. But it’s how jax-js works.

Let’s start with the backend and build our way up. In jax-js, the backend code is actually quite self-contained; they implement the Backend interface (abridged):

/** A device backend. */
export interface Backend {
  /** Allocate a new slot with reference count 1. */
  malloc(size: number, initialData?: Uint8Array): Slot;

  /** Increment the reference count of the slot. */
  incRef(slot: Slot): void;

  /**
   * Decrement the reference count of the slot. If the reference count reaches
   * zero, it is freed. This should throw if the slot was already freed.
   */
  decRef(slot: Slot): void;

  /** Read a range of bytes from a buffer. */
  read(
    slot: Slot,
    start?: number,
    count?: number,
  ): Promise>;

  /** Prepare an expression to be executed later. */
  prepare(kernel: Kernel): Promise;

  /**
   * Run a backend operation that was previously prepared.
   *
   * The operation may not run immediately, but operations are guaranteed to run
   * in the dispatch order. Also, `read()` will wait for all pending operations
   * on that slot to finish.
   */
  dispatch(exe: Executable, inputs: Slot[], outputs: Slot[]): void;
}

In other words, backends need to be able to malloc/free chunks of memory for tensors, and to execute Kernel objects. Inside a Kernel there is:

A pointwise operation on one or more tensors, with
Lazy shape-tracking information for how to index the tensors, and
A reduction to be performed (optional).
Reductions can be any associative operation (add/multiply/max/min), and they can optionally have a fused epilogue as well.

The pointwise operation is constructed from a pure expression tree, an AluExp, where each node is a symbolic AluOp. There are 28 AluOps — you don’t need so many distinct operations when you can depend on kernel fusion!

Note that no automatic differentiation happens here; these are pure low-level operations, so we can introduce arbitrary building blocks this way.

/** Symbolic form for each mathematical operation. */
export enum AluOp {
  Add = “Add”,
  Sub = “Sub”,
  Mul = “Mul”,
  Idiv = “Idiv”,
  Mod = “Mod”,
  Min = “Min”,
  Max = “Max”,

  Sin = “Sin”,
  Cos = “Cos”,
  Asin = “Asin”,
  Atan = “Atan”,
  Exp = “Exp”,
  Log = “Log”,
  Erf = “Erf”,
  Erfc = “Erfc”,
  Sqrt = “Sqrt”,
  Reciprocal = “Reciprocal”,
  Cast = “Cast”,
  Bitcast = “Bitcast”,

  Cmplt = “Cmplt”,
  Cmpne = “Cmpne”,
  Where = “Where”, // Ternary operator: `cond ? a : b`

  Threefry2x32 = “Threefry2x32”, // PRNG operation, arg = ‘xor’ | 0 | 1

  // Const is a literal constant, while GlobalIndex takes data from an array
  // buffer. Special and Variable are distinguished since the former is for
  // indices like the global invocation, while the latter is a value.
  Const = “Const”, // arg = value
  Special = “Special”, // arg = [variable, n]
  Variable = “Variable”, // arg = variable
  GlobalIndex = “GlobalIndex”, // arg = [gid, len]; src = [bufidx]
  GlobalView = “GlobalView”, // arg = [gid, ShapeTracker], src = [indices...]
}

When auto-generating GPU kernels, they’re pretty simple for pointwise ops. The tricky part is if there’s a reduction (aka. tensor contraction), most commonly in matmuls and convolutions. These can be optimized pretty well on the web by unrolling judiciously and tiling the loads/stores.

An example WebGPU matmul kernel for float32[4096,4096] matrices generated by jax-js is shown below.

@group(0) @binding(0) var in0 : array;
@group(0) @binding(1) var in1 : array;
@group(0) @binding(2) var result : array;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id : vec3) {
  if (id.x >= 1048576) { return; }
  let gidx: i32 = i32(id.x);
  var acc0: f32 = f32(0);
  var acc1: f32 = f32(0);
  var acc2: f32 = f32(0);
  var acc3: f32 = f32(0);
  var acc4: f32 = f32(0);
  var acc5: f32 = f32(0);
  var acc6: f32 = f32(0);
  var acc7: f32 = f32(0);
  var acc8: f32 = f32(0);
  var acc9: f32 = f32(0);
  var acc10: f32 = f32(0);
  var acc11: f32 = f32(0);
  var acc12: f32 = f32(0);
  var acc13: f32 = f32(0);
  var acc14: f32 = f32(0);
  var acc15: f32 = f32(0);
  for (var ridx: i32 = 0; ridx < 1024; ridx++) {
    let x0: i32 = ((gidx / 8192) * 131072) + ((((gidx / 8) % 8) * 16384) + (ridx * 4));
    let x1: f32 = in0[x0];
    let x2: i32 = (((gidx / 64) % 128) * 32) + (((gidx % 8) * 4) + (ridx * 16384));
    let x3: f32 = in1[x2];
    let x4: f32 = in0[x0 + 1];
    let x6: f32 = in0[x0 + 2];
    let x8: f32 = in0[x0 + 3];
    let x10: f32 = in0[x0 + 4096];
    let x11: f32 = in0[x0 + 4097];
    let x12: f32 = in0[x0 + 4098];
    let x13: f32 = in0[x0 + 4099];
    let x14: f32 = in0[x0 + 8192];
    let x15: f32 = in0[x0 + 8193];
    let x16: f32 = in0[x0 + 8194];
    let x17: f32 = in0[x0 + 8195];
    let x18: f32 = in0[x0 + 12288];
    let x19: f32 = in0[x0 + 12289];
    let x20: f32 = in0[x0 + 12290];
    let x21: f32 = in0[x0 + 12291];
    let x22: f32 = in1[x2 + 1];
    let x26: f32 = in1[x2 + 2];
    let x30: f32 = in1[x2 + 3];
    let x5: f32 = in1[x2 + 4096];
    let x23: f32 = in1[x2 + 4097];
    let x27: f32 = in1[x2 + 4098];
    let x31: f32 = in1[x2 + 4099];
    let x7: f32 = in1[x2 + 8192];
    let x24: f32 = in1[x2 + 8193];
    let x28: f32 = in1[x2 + 8194];
    let x32: f32 = in1[x2 + 8195];
    let x9: f32 = in1[x2 + 12288];
    let x25: f32 = in1[x2 + 12289];
    let x29: f32 = in1[x2 + 12290];
    let x33: f32 = in1[x2 + 12291];
    acc0 += x1 * x3 + x4 * x5 + x6 * x7 + x8 * x9;
    acc1 += x10 * x3 + x11 * x5 + x12 * x7 + x13 * x9;
    acc2 += x14 * x3 + x15 * x5 + x16 * x7 + x17 * x9;
    acc3 += x18 * x3 + x19 * x5 + x20 * x7 + x21 * x9;
    acc4 += x1 * x22 + x4 * x23 + x6 * x24 + x8 * x25;
    acc5 += x10 * x22 + x11 * x23 + x12 * x24 + x13 * x25;
    acc6 += x14 * x22 + x15 * x23 + x16 * x24 + x17 * x25;
    acc7 += x18 * x22 + x19 * x23 + x20 * x24 + x21 * x25;
    acc8 += x1 * x26 + x4 * x27 + x6 * x28 + x8 * x29;
    acc9 += x10 * x26 + x11 * x27 + x12 * x28 + x13 * x29;
    acc10 += x14 * x26 + x15 * x27 + x16 * x28 + x17 * x29;
    acc11 += x18 * x26 + x19 * x27 + x20 * x28 + x21 * x29;
    acc12 += x1 * x30 + x4 * x31 + x6 * x32 + x8 * x33;
    acc13 += x10 * x30 + x11 * x31 + x12 * x32 + x13 * x33;
    acc14 += x14 * x30 + x15 * x31 + x16 * x32 + x17 * x33;
    acc15 += x18 * x30 + x19 * x31 + x20 * x32 + x21 * x33;
  }
  let x34: i32 = ((gidx / 8192) * 131072) + ((((gidx / 64) % 128) * 32) + ((((gidx / 8) % 8) * 16384) + ((gidx % 8) * 4)));
  result[x34] = acc0;
  result[x34 + 4096] = acc1;
  result[x34 + 8192] = acc2;
  result[x34 + 12288] = acc3;
  result[x34 + 1] = acc4;
  result[x34 + 4097] = acc5;
  result[x34 + 8193] = acc6;
  result[x34 + 12289] = acc7;
  result[x34 + 2] = acc8;
  result[x34 + 4098] = acc9;
  result[x34 + 8194] = acc10;
  result[x34 + 12290] = acc11;
  result[x34 + 3] = acc12;
  result[x34 + 4099] = acc13;
  result[x34 + 8195] = acc14;
  result[x34 + 12291] = acc15;
}

If you’re writing a native library, this isn’t good enough. For example, you have to at least use tensor cores mma.sync.aligned.* on Nvidia GPUs! But on the web, it gets to pretty comparable performance with the best open-source libraries, and it seems that Dawn is alright at bridging any gaps with optimization.

Onto the frontend. This is the core of the library, and where the actual autograd and tracing happens. We follow the JAX design quite closely, where there is a set of primitives along with an ambient interpreter stack. This is… quite difficult, magical, and took me a while to figure out. To learn more see:

Autodidax: JAX core from scratch (2021)
The simple essence of automatic differentiation (Elliott 2018)

(One particularly cool moment about this way of building an ML library is that you get reverse-mode AD “for free” by inverting/transposing the forward-mode rules. I found this really beautiful after I wrapped my head around it; it’s quite mathematically pleasing. Another cool moment is when you first get arbitrary 2nd, 3rd, … n-th order derivatives after just implementing the first-order derivative rules — GradientTape could never!)

Honestly this is probably the most lost I’ve ever felt in writing code. It’s like, nested mutually recursive interpreters to model functors in the “category of tensors.”

Anyway, once I reviewed my differential geometry notes from college and dusted off my understanding of tangents, pulling back cotangents, functors and so on, I think I eventually figured it out. Though I still had tiny bugs for the next 6 months. 😂

The list of high-level Primitive in jax-js is below:

/**
 * Frontend primitive operations, which are lowered into Kernel objects before
 * being dispatched to the backend.
 *
 * Any operation between arrays can be described in these parts. This is also
 * the set of primitives that can occur in Jaxpr programs, and the level at
 * which transformations like vmap, grad, and jvp occur. They are loosely based
 * on [XLA](https://openxla.org/xla/operation_semantics).
 *
 * All n-ary operations support broadcasting, with NumPy semantics.
 */
export enum Primitive {
  Add = “add”,
  Mul = “mul”,
  Idiv = “idiv”,
  Neg = “neg”,
  Reciprocal = “reciprocal”,
  StopGradient = “stop_gradient”,
  Cast = “cast”,
  Bitcast = “bitcast”,
  RandomBits = “random_bits”,
  Sin = “sin”,
  Cos = “cos”,
  Asin = “asin”,
  Atan = “atan”,
  Exp = “exp”,
  Log = “log”,
  Erf = “erf”,
  Erfc = “erfc”,
  Sqrt = “sqrt”,
  Min = “min”,
  Max = “max”,
  Reduce = “reduce”,
  Dot = “dot”, // sum(x*y, axis=-1)
  Conv = “conv”, // see lax.conv_general_dilated
  Pool = “pool”,
  PoolTranspose = “pool_transpose”,
  Compare = “compare”,
  Where = “where”,
  Transpose = “transpose”,
  Broadcast = “broadcast”,
  Reshape = “reshape”,
  Flip = “flip”,
  Shrink = “shrink”,
  Pad = “pad”,
  Gather = “gather”,
  JitCall = “jit_call”,
}

Notice that many of these are similar to the backend operations above, but some are different. In particular, there are convolutions and matrix multiplications here. These are useful to see in the frontend IR (and for autograd) but can be lowered to a simpler form before the kernels are generated on the backend.

By default, an operation is just lowered directly to a backend kernel after passing through any necessary transformations (vmap, jvp, grad). But if you’re using the jit, jax-js will trace your program to produce a “Jaxpr” (list of operations) followed by automatic kernel fusion to generate kernels, specialized to each input shape.

Bugs

It’s very hard to build an ML framework and a long task! So far, jax-js has implemented a lot of core functionality in JAX, but there’s still much more. If there’s an API or operation you want to see, please consider adding it or filing an issue (examples: np.split, FFT, AdamW).

I have a pretty varied, portable test suite that runs fast:

So we are in a good position to find bugs and fix them. But making an ML library is quite difficult, and WebGPU is a nascent technology (e.g., I somehow gave my MacBook kernel panics)—there will be bugs! Please report.

Technical: Performance

We haven’t spent a ton of time optimizing yet, but performance is generally pretty good. jit is very helpful for fusing operations together, and it’s a feature only available on the web in jax-js. The default kernel-tuning heuristics get about 3000 GFLOP/s for matrix multiplication on an M4 Pro chip (try it).

On that specific benchmark, it’s actually more GFLOP/s than both TensorFlow.js and ONNX, which both use handwritten libraries of custom kernels (versus jax-js, which generates kernels with an ML compiler).

Some particularly useful / low-hanging fruit to look at:

The WebAssembly backend currently is quite simple, I didn’t spend a ton of time optimizing it, but measurably it could be >150x faster on my MacBook Pro. This difference comes from a few things multiplying:
- Don’t recompute loop indices each time, we could improve FLOPs by ~1-3x.
- Do loop unrolling/tiling, will improve FLOPs by ~2-3x.
- Use SIMD instructions. This would improve FLOPs by 4x.
- Add multi-threading (10x on my laptop), to use all available cores. Requires SharedArrayBuffer (crossOriginIsolated) / there are some caveats here to sync/async handling, needs to be done carefully.
Running the forward pass of the MobileCLIP2 transformer model is only about 1/3 the FLOPs compared to pure 4096x4096 matmul. Maybe we can improve this, especially in the causal self-attention layer.
Although WebGPU is rapidly gaining in popularity and support, it’s probably worth having a WebGL backend as well, as a fallback that’s guaranteed to work in pretty much all browsers and is still pretty fast. This isn’t a huge amount of work; the WebGPU backend is <700 lines of code for example.

Technical: Feature parity

jax-js strives for approximate API compatibility with the JAX python library (and through that, NumPy). But some features vary for a few reasons:

Data model: jax-js has ownership of arrays using the .ref system, which obviates the need for APIs like jit()‘s donate_argnums and numpy.asarray().
Language primitives: JavaScript has no named arguments, so method call signatures may take objects instead of Python’s keyword arguments. Also, PyTrees are translated in spirit to “JsTree” in jax-js, but their specification is different.
Maturity: JAX has various types like complex64, advanced functions like hessenberg(), and advanced higher-order features like lax.while_loop() that we haven’t implemented. Some of these are not easy to implement on GPU.

Other features just aren’t implemented yet. But those can probably be added easily!

I’ve made a table of every JAX library feature and its implementation status in jax-js, see here. There are a couple big ones that stand out.

You’re welcome to contribute, though I’d also love if you could try using jax-js. :D

ssh-hypervisor: "SimCity for VMs"

Eric — Tue, 23 Sep 2025 05:11:28 GMT

This weekend I tried to make a hypervisor hooked up to SSH. It’s like:

ssh @vmcity.ekzhang.com

But every time someone logs in with a different name, instead of being a user on the host machine, it greets you and then spins up a virtual machine with Firecracker.

$ ssh eric@vmcity.ekzhang.com

Hello, eric! 🌸

Today is Sunday. It's your first time here.

Recent logins:
┌─────────┬──────────────┐
│  User   │  Last login  │
├─────────┼──────────────┤
│ matthew │ 2 hours ago  │
│ kathy   │ 4 hours ago  │
│ linus   │ 16 hours ago │
│ sen     │ 4 days ago   │
└─────────┴──────────────┘

Booting up your fresh VM:
💡 ▮▮▮▯▯▯▯▯▯▯▯▯ 25%

If you haven’t logged in for a while, we store your VM in a snapshot.

This isn’t an original idea, by the way! I had seen this somewhere online, with a person showing off their tiny OS with Firecracker microVMs over public SSH. Unfortunately I don’t remember where I saw this, but I wanted to take this idea and make it a bit whimsical, while adding a couple toy features.

Update: A commenter shared the project https://github.com/nuta/kerla

Back in high school and college, I used to make a lot of smaller, fun projects over the weekend and share them with people. I don’t do this as much now with a job. These tiny projects became less interesting as I grew familiar with systems; more implementation-heavy rather than new ideas.

I think that’s sad though. This project would maybe have taken me 1-2 weeks in the past, so I was hoping that with AI tools, I could do it in just a weekend (inspiration). Then I can spend time on more frivolous projects. I still get ideas all the time. This is one of them, let’s just build it, see where it goes and let my creative side take control!

What is a hypervisor?

I saw this quote recently that sums it up well:

Hypervisor is essentially a hardware-assisted catch block

This is all what I want you to learn from this book. Hardware-assisted hypervisors are event handlers. They are not like a CPU emulator.

In JavaScript, the life of a hypervisor looks like this:
A hypervisor runs the guest OS in a try block, catches events (VM exits), and goes back to the guest mode again.

I want to keep this in mind while working through the project. Firecracker is a very lightweight hypervisor, and they spin up “microVMs” — since hypervisors are catch-blocks, that essentially means the catch-block is small. Firecracker only emulates a few devices and relies on host features for as much as possible. This makes it really fast to boot compared to QEMU.

However, this doesn’t mean that Firecracker is any simpler to set up than other hypervisors. You still need to hook up all the parts of a virtual computer in the right places to get things working! For instance:

Bring your own init system like OpenRC / Systemd.
Attach a kernel ramfs, disk at startup.
Want network? Set up a MAC address, TAP device, bridge, IP routing rules, firewall filters, packet forwarding, and so on.
- Want multiple VMs? Create a network bridge, set the controller of the TAP to that bridge, allocate private IPs from a pool, dynamically configure iptables.
Want serial logs? Edit your kernel boot arguments to send them at a baud rate over the /dev/console TTY.

It’s a good reminder that VMs are tiny little computers that live in your own. When you start up VMs, you’re building up your own computer from scratch!

Apple M4. Look at that wild chip. Virtualization is taming the beast. (source)

What I found in this project is that, while AI tools made coding a lot faster (thousands of lines in minutes), they didn’t speed up the debugging process all that much. This, understandably, ended up being a big part of the work. 😅

Setting up the project

First step is to decide what we’re building. I’m going to use Go because:

It is a simple, relatively easy systems programming language that compiles fast.
- This makes AI is very good at writing Go code.
It has good support for Firecracker via the official Go SDK.
It compiles to native, statically-linked executables (without cgo). I could deal with glibc / musl issues, but that’s just one more dimension added to an already tricky setup with kernels and virtualization.

So we initialize the project, and I give the AI some basic direction. To start up the VM, we’ll embed a static firecracker binary + vmlinux inside our own binary during the build process, which can be spooled to a tempfile and execv’d.

This will live in a Manager on the Go side that creates VMs and sets up networking. Then, we build a custom SSH server (not relying on OpenSSH) to forward connections to a VM. Here’s what the AI spit back out at me:

Architecture Overview

ssh-hypervisor is a Go-based SSH server that dynamically provisions
Firecracker microVMs. When users SSH in, they get a fresh or restored VM
 instance.

Core Components

1. Network Architecture
- Host SSH server (using Wish library) on configurable port (default
2222)
- TAP devices provide VM network connectivity (configurable CIDR,
default 192.168.100.0/24)
- SSH channel proxying: Host SSH channels → TAP network → Guest SSH
server
- Health check: Poll guest SSH port (22) to detect VM boot completion

2. VM Lifecycle
- Provisioning: Extract embedded Firecracker binary, configure VM
(memory/CPU), boot with TAP networking
- Snapshots: Idle VMs suspended to disk, restored on user reconnection
- Guest OS: Standard Linux with sshd, handles all SSH protocol
complexity (SFTP, port forwarding, etc.)

3. Code Structure
cmd/main.go              # CLI entry point, Linux+amd64/arm64 only
internal/
  config.go              # Configuration & validation
  server/server.go       # SSH server implementation (TODO)
scripts/download-firecracker.sh  # Downloads & embeds Firecracker binary
binaries/firecracker     # Embedded static binary

4. Key Features
- Static binary: Firecracker embedded via go:embed, no external
dependencies
- Build constraints: Prevents compilation on unsupported platforms
- Configuration: CLI flags for port, VM specs, network CIDR, data
directory
- Channel transparency: Host becomes invisible proxy between client SSH
and guest SSH

Next Steps

1. Integrate Wish SSH server
2. Implement VM management & Firecracker integration
3. Build TAP networking & health checks
4. Add snapshot/restore logic

Some things about this summary, I think are a bit overcomplicated. But overall the direction seems fine for now.

It sets up the scaffold, I rearrange some files manually to get it working and to fix the //go:embed paths. Alright, this seems like it will work.

Getting Firecracker to run & SSH

The hardest part of this project will be getting the VM to run and be accessible by SSH. So before I add more complexity, we should figure this out first. I ask the AI to comment-out the server startup code in the CLI entrypoint, and temporarily just have the entrypoint start up a machine — this works.

I need to make some manual changes to get logging with logrus working, as well as other things that the AI can’t figure out:

Adding syscall.SysProcAttr{ Setpgid: true } to the process and providing the firecracker.WithProcessRunner() option. The latter specifies the spooled Firecracker binary path, and the former is needed so that Ctrl+C on controlling terminal (server) doesn’t also interrupt the Firecracker subprocess prematurely before Go code can gracefully shutdown.
Setting up the network of the machine using auto-generated MAC addresses, passing an ip=… option to the kernel boot to configure its eth0 network interface with the proper gateway and netmask. This was based on Julia Evans’s gist.
Creating a network bridge on startup and assigning the VM’s TAP device to that bridge. Also, setting up CI in GitHub Actions (with kvm support) and figuring out CAP_NET_ADMIN permissions to run the aforementioned setup.

So now it’s working. I can run the binary with a provided rootfs (built from an Alpine image with customization) and it starts a full VM! I can see the serial logs too.

Unfortunately, I can’t SSH into the machine. Something is wrong. And there are no serial logs for the sshd daemon with OpenRC. 😭

$ ssh root@192.168.100.2
ssh: connect to host 192.168.100.2 port 22: Connection refused

So we begin the network debugging, yet again. Let’s see if packets are reaching the destination at least.

$ ping 192.168.100.2
PING 192.168.100.2 (192.168.100.2) 56(84) bytes of data.
64 bytes from 192.168.100.2: icmp_seq=1 ttl=64 time=0.401 ms
64 bytes from 192.168.100.2: icmp_seq=2 ttl=64 time=0.359 ms

$ tcptraceroute 192.168.100.2
Selected device sshvm-br0, address 192.168.100.1, port 52619 for outgoing packets
Tracing the path to 192.168.100.2 on TCP port 80 (http), 30 hops max
 1  192.168.100.2 [closed]  0.430 ms  0.243 ms  0.287 ms

Ok, … so that works. Looks like ICMP and TCP packets are both reaching the VM’s destination IP address on the bridge, but port 22 is still not accepting connections.

At this point, the issue can be broken up into two possibilities:

sshd is not starting in the guest VM. We have no logs, so it could just be not listening at all, and then the host isn’t reaching port 22 of course.
There is some kind of networking problem. I find this less likely because “Connection refused” (ECONNREFUSED) usually means that a server actively rejected a connection attempt due to a port not being open; network issues usually show up as timeout / no reachable route.

Still, both are possible, so I think we should try and figure out first which category it’s in. So I will try and see if sshd is indeed starting up.

At this point, I copy-pasted everything back into Claude Code and had it try to figure things out. It flailed around for a while, luckily it’s able to undo its own work.

Okay, let’s go back to the basics. How do we get sshd to run on init? I think the immediate issue for observability is that I can’t SSH into the Firecracker machine and figure out why it’s not running. This turns out to be a whole rabbit hole:

The next step is to debug why “agetty” is not running on startup, but I can’t figure this out either.

So I add agetty to my VM’s inittab instead of as an OpenRC service.

cat > /etc/inittab <<'EOF'
ttyS0::respawn:/sbin/agetty -L 115200 ttyS0 linux
EOF

But even after that, I can’t login with the serial console as root! It’s not supporting my password that I set with chpasswd. I search this up, maybe the issue is using busybox login, so I update to util-linux-login, no dice though.
I’m not really a sysadmin. The only thing that worked so far is opening up a “rescue shell” by setting init=/bin/sh on login, so I’ll just try that next.
This does work! I can’t run sshd still, but running nc -l -p 42 in the guest shell and nc 192.168.100.2 42 outside establishes network communication between the host and guest, so that’s great. The problem is no longer in the network, but in sshd itself. 🥳

I’m kind of tired of working in a VM. So I see if I can get sshd working in a simple Alpine container instead, which this VM is based on.

$ docker run -it --rm -p 1022:22 alpine sh

/ # apk add --no-cache util-linux util-linux-login openssh
...
OK: 21 MiB in 63 packages

/ # apk add --no-cache openrc
...
OK: 23 MiB in 71 packages

/ # echo "root:root" | chpasswd
chpasswd: password for 'root' changed

/ # sed -i 's/^#PermitRootLogin.*/PermitRootLogin yes/' /etc/ssh/sshd_config

/ # ssh-keygen -A
ssh-keygen: generating new host keys: RSA ECDSA ED25519 

/ # /usr/sbin/sshd -D -e
Server listening on 0.0.0.0 port 22.
Server listening on :: port 22.

And well, yes it works. I can ssh into the container from port 1022. Well okay, so something is very different about the VM, and it is causing sshd to hang.

I’m very confused, but the very next thing I try fixes the issue. The issue is entropy. If I run cat /proc/sys/kernel/random/entropy_avail, it only has a few bits of random entropy available! So the operating system blocks on reading random initial state (needed for cryptography in sshd). This is because Firecracker does not provide a virtio-rng device by default.
- Originally validated this by installing rngd and running it in the background manually, which causes sshd to work.
- But it’s probably better to produce actual entropy, so I’ll add virtio-rng now.
- Nevermind, the Firecracker Go SDK doesn’t support adding an entropy device during machine creation. Maybe there’s a raw way to do it?
- Eh… this is not worth spending more time on, will just use rngd.

It works now! Yes! Turns out that running VMs isn’t just like Docker, it’s being your own sysadmin but even more difficult than usual. :’)

During this whole debugging session (5+ hours), I asked ChatGPT a lot of stuff. Gave up on Claude Code since it kept making changes. The AI very confidently guided me toward directions that didn’t work, and it gave me a lot of false hope. But it did eventually find the issue, which was the lack of random entropy causing silent blocking, which I wouldn’t have found otherwise without Google search or strace. I think it probably saved time overall?

Then, I spent another hour trying to get this working with OpenRC. It does not work. I’m just going to call it quits and use bash as my init process, oh well.

And then! It’s working now, but SSH still takes 6 seconds to start up.

virtio-rng and building my own vmlinux

Remember the entropy device from earlier? I still have this rngd hack in my init script that initializes fake entropy:

rngd -f -r /dev/urandom &

Lately it’s become clear that this is a bad idea, and it adds exactly 5 seconds to VM startup for starting a “jitter” generator, which makes the time between boot and getting a shell ~4x slower. I work out how to add an entropy device by manually hitting the Firecracker HTTP endpoint, but it’s still not appearing as /dev/hwrng on the guest.

I think this is because the guest kernel that I’m using is from the quickstart_guide public bucket in S3, and it’s a very old Linux 4.14 image without many devices. Or maybe not. In any case, if I have a newer Linux version then random.trust_cpu (introduced in Linux 4.19) will be respected, and I shouldn’t have a problem either way since it can rely on hardware RNG instructions.

So I try to build an image based on Linux 6.1, and I run into—problems!

[   13.203150] clk: Disabling unused clocks
[   13.207664] /dev/root: Can't open blockdev
[   13.209652] VFS: Cannot open root device "vda" or unknown-block(0,0): error -6
[   13.212952] Please append a correct "root=" boot option; here are the available partitions:
[   13.216454] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[   13.219839] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.153 #2
[   13.220411] Call Trace:
[   13.220411]  
[   13.220411]  show_stack+0x3a/0x40
[   13.220411]  dump_stack_lvl+0x3d/0x51
[   13.220411]  dump_stack+0x10/0x16
[   13.220411]  panic+0x100/0x297
[   13.220411]  mount_block_root+0x13e/0x1d9
[   13.220411]  mount_root+0x117/0x138
[   13.220411]  prepare_namespace+0x135/0x16a
[   13.220411]  kernel_init_freeable+0x166/0x188
[   13.220411]  ? rest_init+0xc0/0xc0
[   13.220411]  kernel_init+0x15/0x120
[   13.220411]  ret_from_fork+0x1f/0x30
[   13.220411]  
[   13.220411] Kernel Offset: disabled
[   13.220411] Rebooting in 1 seconds..
2025-09-23T00:36:38.075124657 [anonymous-instance:main] Vmm is stopping.
2025-09-23T00:36:38.075549933 [anonymous-instance:main] Vmm is stopping.
2025-09-23T00:36:38.090112389 [anonymous-instance:main] Firecracker exiting successfully. exit_code=0

This is a giant pain. The “Cannot open root device” is a completely useless error message that could mean any number of things, whether APIC issues or uninitialized modules, or even Firecracker bugs. The AI is equally confused.

I spend about an hour stuck on this for a while, trying different Linux and Firecracker versions and flipping kernel configs on/off.

At this point, it’s Monday. So I go to work. And while I’m on the train there, I do some Googling and find this bit from kernel-policy.md:

We use these configurations to build microVM-specific kernels vended by Amazon Linux … As a result, kernel configurations found in this repo should be used to build exclusively the aforementioned Amazon Linux kernels. We do not guarantee that using these configurations to build upstream kernels, will work or produce usable kernel images.

😭

Okay, so that’s it. I need to use Amazon Linux, and then it will work, right? Of course the people at Amazon would use their own Linux fork. So it’s back to the AI.

Let’s run this build again, using Orbstack for their seamless VMs on macOS. Now that we’re building from Amazon Linux, it should work with Firecracker, right?

And it fails again — but I then removed the pci=off acpi=off options, and this combined with Amazon Linux allows it to finally boot. Hooray.

Even better, I’m no longer on an ancient Linux version. Timidly, I decide to try returning to OpenRC despite my issues from earlier. And yes: OpenRC works, sshd is running, and even agetty is finally no longer stalling. Everything is blissful. Yay! It all makes sense again, definitely worth debugging.

Now that things boot, everything just got a lot easier. I also build the vmlinux kernel for ARM64, just for fun, again inside an Orbstack VM. :)

Hooking Firecracker up to an SSH server

We have VMs working! It’s time to hook it up to SSH and build our app.

I’m relying heavily on the AI to figure out the implementation on this, and it’s going swimmingly. It worked out the SSH protocol with no issues at all, and it’s especially good at making cute interactive terminal output, like animated progress bars.

With things like session management and architecture, it’s also good to work in broad strokes as we make changes.

(At some point my VM ran out of RAM and started thrashing.)

But for the most part, this was pretty easy to code since nothing was too tricky to debug on the application side. Just kept iterating, trying it out and fixing things that didn’t quite look right.

On the systems side, I worked out some iptables rules and added a couple entries optionally when -allow-internet is passed in, so the VMs get Internet access.

iptables -A FORWARD -i sshvm-br0 ! -o sshvm-br0 -j ACCEPT -m comment --comment "ssh-hypervisor"
iptables -A FORWARD ! -i sshvm-br0 -o sshvm-br0 -j ACCEPT -m comment --comment "ssh-hypervisor"
iptables -t nat -A POSTROUTING -s  ! -o sshvm-br0 -j MASQUERADE -m comment --comment "ssh-hypervisor"

The end result

It works! And it is very cute :)

I am still hosting this at vmcity.ekzhang.com for now, but I will stop at some point, earlier if I notice any crypto miners or other unscrupulous folk.

ssh @vmcity.ekzhang.com  # try it now!

This was lots of fun, and we ended up with a static binary that runs VMs hooked up to on-demand SSH.

You can get the code here: https://github.com/ekzhang/ssh-hypervisor

Abridged notes on the LLM scaling book

Eric — Mon, 25 Aug 2025 17:14:26 GMT

In February, folks at Google DeepMind published a book on LLM scaling.

The book focuses on how you can model LLM scaling with math. I’m a big fan of stuff like this. Intuitive reasoning about systems is really important; it lets you visualize their shape and behaviors at a glance.

How to Scale Your Model (2025)

I thought it would be a good time to read, learn and add some personal commentary (as someone working in the industry at Modal). Expect these notes to be super abridged and not a book replacement — I will skip things, and any errors are my fault.

(I’m still working on jax-js, by the way. It’s going well! Since the last update in May, 3 months ago, we’ve went from matrix multiplication to full neural networks, complex operations like softmax + convolution, optimizers, and kernel tuning / optimization. The project is now over 10,000 lines of code.)

Without further ado, let’s begin!

Part 0: Intro

This book is about scaling LLMs on TPUs. In the past, ML researchers didn’t think so much about performance. But today, research takes a lot of compute.

ML systems are complex enough that you can’t just fiddle with parameters until they becomes fast. You need a deep understanding of how long it takes to run LLMs, based on compute, memory, and network factors. This informs the fundamental research you do, as well as systems design and efficiency.

We’ll then discuss:

Transformer architecture, FLOP math for forward and backward passes.
Parallelism strategies (data, tensor, pipeline, expert) and other tricks (FSDP, host offload, gradient accumulation) for scaling LLM training and inference with increased numbers of GPUs and nodes, hopefully linear in performance.
Practical examples in JAX and with the LLaMA-3 model.
The final chapter is about Nvidia GPUs.

Part 1: Rooflines

The roofline model considers communication time and computation time:

(Note: Communication could either be a single chip, loading from global memory in a GPU, or multi-chip / multi-node links like PCIe, NVLink, Infiniband, RoCEv2, …)

Typically, we use the maximum of communication and computation, since you can overlap them in most cases. But even if you can’t overlap them, it’s a good approximation, since it’s off by at most a factor of 2.

Since there’s a max() here, we have two regimes:

Compute-bound: T_math > T_comms. You are getting full utilization from your hardware, and the link is not saturated.
Comms-bound: T_comms > T_math. You’re wasting at least some of the FLOPs/s from your hardware, waiting on the saturated link.

You want to be compute-bound, since that’s what you’re paying for — FLOPs.

Assuming a well-written kernel, you can estimate whether it will be compute-bound based on the arithmetic intensity AI = W/Q, or work over memory traffic. On TPU v5e MXU, you want ≥240 FLOPs/byte for bfloat16 (= compute / mem bandwidth).

For matmul in neural networks, this translates to a batch size of ~240 (0.5*AI for bfloat16 = 2 bytes, but 2*AI because of 2 FLOPs).

You can do the same roofline analysis for tensor parallelism: splitting along the reduction axis in a neural network. This would give us a critical threshold in the rough thousands for reduction axis length when this is viable (basically: each device needs to do at least X work per byte transferred).

Roofline analysis is the main way to evaluate parallelism.

Part 2: How to Think About TPUs

TPUs are tensor cores on high-bandwidth memory (HBM). They can do matrix multiplications fast with systolic arrays. Lots of FLOPs for matmuls.

How does it work? There are some animations about the pipelining and systolic array architecture on the hardware level. Basically, it does a 8x128 x 128x128 → 8x128 matmul every 8 cycles, and it’s very fast but needs a bit of ramp-up.

There are two kinds of memory on a TPU chip (1 chip = 2 cores, shared HBM):

HBM is the main memory, similar to GPUs. This is ~16-95 GB, ~1 Tbps.
VMEM is smaller working memory / cache. It’s about 0.1 GB, much smaller but tops out at around an intensity of about 10-20 FLOPs/byte. Good for fast inference on small-batch workloads if you can fit weights.

Chips are “logical megacores” each consisting of two cores. Four chips are exposed on a single TPU-VM host with PCIe (~200 Gbps NIC).

TPUs are connected to each other in 2D/3D torus configurations with ICI, inter-chip interconnects. These are ~1.6-4.8 Tbps, and there 4 or 6 of them. Compare to Nvidia’s 3.2 Tbps Infiniband cluster networking.

It’s cheaper and more scalable than fat tree-style networks that Nvidia uses for Infiniband, but it probably makes collective communication tricky. When you purchase TPUs from Google, you buy a slice of the topology.

The smallest slice is a single 2x2x1 host. This checks out with each host being connected to 4 TPU chips.

Part 3: Sharded Matrices and How to Multiply Them

We introduce tensor notation for device sharding. When you have multiple devices (relevant for TPUs especially due to topology), they live on a mesh with axis names. For example, a 2x2 mesh of 4 TPUs, with axes (X, Y):

Mesh(devices=((0, 1), (2, 3)), axis_names=(‘X', ‘Y'))

Notation for sharding arrays is A[Ix, Jy]. The indices can be subscripted by mesh axes, which tells us how each mesh axis corresponds to a tensor axis.

There are a couple rules to this notation system. Internalizing the rules will help you reason about device sharding:

Not all mesh axes need to be mentioned in a sharding. For example, A[I, J] would fully replicate the array across all axes. A[Ix, J] would shard the first tensor axis against X and replicate the rest along Y.
Each mesh axis can be mentioned at most once. A[Ix, Jx] is invalid, since that doesn’t actually include all the data.
The order of axes matters. A[Ixy, J] shards the first tensor axis on both dimensions of the mesh. But it shards against the X mesh axis first, and then the Y mesh axis second. Contrast this with A[Iyx, J], which reverses the order.

This notation lets us talk about tensor sharding over TPU devices in a torus. Each mesh axis and can do full collective operations like AllReduce, AllGather, and ReduceScatter. Next, we ask the big question.

Question: How long does it take to do matmul on sharded arrays?

Matrix multiplication is a tensor contraction (“like numpy.dot”) op. When you do a dot product of A[I, J] * B[J, K] → C[I, K], you’re contracting along the J axis.

Generally, if your tensor is sharded along the contracting dimension, you may need to use one of the collective operations:

Case 1: neither input is sharded along the contracting dimension. We can multiply local shards without any communication.
Case 2: one input has a sharded contracting dimension. We typically “AllGather” the sharded input along the contracting dimension.
Case 3: both inputs are sharded along the contracting dimension. We can multiply the local shards, then “AllReduce” the result.
Case 4: both inputs have a non-contracting dimension sharded along the same axis. We cannot proceed without AllGathering one of the two inputs first.

I think Case 3 is probably the most illustrative one, since it’s the AllReduce that you typically see when you shard computations along a contracting dimension and need to aggregate the results.

They have derivations to work through and very nice animations. Here’s a summary of the communication primitives and their effect:

AllGather: [Ix, J] → [I, J]. Costs |I|*|J| comms/device.
ReduceScatter: [I, J]{Ux} → [I, Jx]. Costs |I|*|J| comms/device.
AllToAll: [I, Jx] → [Ix, J]. Costs 0.5*|I|*|J| comms/device (assuming 2D torus).
AllReduce: [Ix, J]{Uy} → [Ix, J]. Costs 2*|I|*|J| comms/device.
- This is the same as ReduceScatter + AllGather.

Notably, the AllToAll primitive is suited for toroidal TPU topologies. The other collective operations use a standard ring algorithm.

Part 4: All the Transformer Math You Need to Know

We start with some tensor math. When you have a dot product, some axes are contracting, some are batching, and others are just broadcast. Cheatsheet:

Contracting: I * I → ∅. This is a reduction axis.
Batching: I * I → I. This axis is mapped / vectorized over in both tensors.
Others: I * J → IJ. The axes are broadcast like an outer product.

The total number of FLOPs equals 2x the product of all axes, taking care not to double-count them if they appear in both operands of the product.

The reverse pass (backprop) takes twice the number of FLOPs as the forward pass. This isn’t exactly the case for all operations (i.e., scalar ones), but since most FLOPs in a transformer are in matmuls, a good rule-of-thumb is to multiply the total FLOPs by 3 (= 1 + 2) when thinking about training.

Anyway, if you go ahead and use this trick, you get all the transformer FLOPs.

Or, in a nutshell: multiply params by 6BT (BT = #tokens), and some of the multi-head attention layers scale by 3BT²/D instead.

Great! Will be useful for thinking about KV cache later, too.

Part 5: How to Parallelize a Transformer for Training

This chapter is about train-time scaling. Assume big but fixed batch size (too big slows down convergence), so you’re compute-bound on the chip itself for HBM access. You want to use more chips to speed up each iteration.

My tl;dr is that you can scale things along different axes: data parallelism (or FSDP), tensor parallelism, and pipeline parallelism. These are along different axes: batch, model, and layers. Each of them stresses a different communication overhead, so if you combine all of them together, you can coordinate very low batch sizes per device without being comms-bound! (i.e., run lots of devices, train fast)

That means there’s no “best” parallelism strategy. You apply all of them as needed, since they multiply together. Start with data parallelism though (it’s easy).

I’m going to ignore the TPU numbers here and speak more generally, since I only use GPUs in my job anyway. The TPU numbers are kind of messed up versus GPUs because of the bandwidth (here, in-and-out):

TPUs have more ICI bandwidth (v5p = 4 * 0.8 Tbps / v5e = 6 * 2.4 Tbps) in bigger “pods” of up to 8960 chips, but wide-diameter torus connectivity, and
GPUs have NVLink (7.2 Tbps, fully-connected) within nodes of 8 GPUs each and Infiniband (0.125 * 3.2 Tbps, switched tree) connections between nodes. Also, each GPU has more FLOPs than each TPU.

This means TPU clusters can skip pipeline parallelism, which is complicated, but GPU clusters need to use it for inter-node to reduce comms overhead.

Data parallelism

This is the simplest method.

Split the batch across X devices and do the forward and backward passes independently.
(Interleaved) When gradients are ready for a layer, do an all-reduce, then update optimizer state with the accumulated gradients across all devices.

You become bottlenecked on comms when B/X > C/W_ici. In other words, the number of FLOPs divided by the bandwidth. (The constants cancel out for fp16.)

Within an 8x H100 node, data parallelism takes (1979 TFLOPs) / (900 GB/s) ~ 2200 required batch size per GPU to max out compute with sparsity, or ~1100 without.

But between nodes, your bandwidth per GPU is 18x lower. So you’d probably want to run ReduceScatter first between GPUs within each node, followed by AllReduce inter-node (8x less comms) and another AllGather within each node.

FSDP / ZeRO-3

Ah yes, the famous fully-sharded data parallelism. It’s like DDP, but model weights & optimizer states are sharded. Each device stores 1/X of the params. See the “experiences” FSDP paper for details on this, including how to interleave compute and comms within the framework.

Compared to DDP, you incur 1.5x comms cost in mixed-precision training (“full sharding” at least, they also have “hybrid sharding” which is partial DDP), since you have to AllGather weights in addition to the AllReduce of gradients.

It’s said that the FSDP backward pass is “free” though, in the sense that you shard optimizer state and weights with AllGather+ReduceScatter, while reducing FLOPs. This is also the difference between ZeRO-1, ZeRO-2, and ZeRO-3 = FSDP.

FSDP lets you scale up model sizes that don’t fit in a single GPU’s memory.

Tensor parallelism (Megatron)

Let’s switch our mesh axis from X to Y. Tensor parallelism shards both the weights & activations across devices. It makes each layer run faster because we don’t have to do as much work on each device, but we do need to insert AllGather / ReduceScatter ops in the forward and reverse passes.

Generally, this becomes worth it when the dimension of the MLP hidden layer exceeds C/W_ici * Y.

In other words, you can start splitting apart the model with tensor parallelism on H100 GPUs if the model dimension is over ~250 or so (since there’s 4x expansion ratio). This becomes very useful for large models, and it combines well with FSDP. Generally the tensor parallel factor is 8 or 16.

Pipeline parallelism

The book doesn’t really talk about pipeline parallelism since it’s mostly used on GPUs. It saves comms overhead by only requiring you to transfer activations between layers, while also partitioning the model and speeding up training. This way, you don’t actually have to send O(weights) data, only O(activations).

The hard thing here is avoiding bubbles. Honestly pipeline schedules make my head hurt. But people have figured it out, likely at great engineering cost!

Takeaways

With enough parallelism strategies, we can achieve near-linear scaling over many nodes while keeping batch size per device low, nearing ~100. Each choice requires a lot of engineering effort though, especially for interleaving operations and reliability in the face of failures.

You can combine the strategies together (DP+TP+PP) for multiplicative bonus. Requires you to do some math though.

Part 6: Training LLaMA 3 on TPUs

This was just an applied exercise of the previous section.

One interesting thing was that they introduced “sequence parallelism” here, which is similar to data parallelism but over the sequence axis. This happened when they ran out of “batch” to FSDP over. I guess this introduces a bit more comms overhead, but not too much since you’re just syncing activations ahead of attention.

I was also curious about the cost for this hypothetical training of LLaMA 3 70B with 40% MFU on TPUs. Assuming 3-year commitment prices, you get:

8960 chips * $1.89/chip/hr * 1056 hours = $18 million

That’s just a 70B model. Makes sense why the big labs are raising billions of dollars for their frontier models with trillions of parameters.

Part 7: All About Transformer Inference

Inference is very different from training. You have a latency-throughput tradeoff curve, since big batches take longer but vastly improve throughput due to higher arithmetic intensity, being less memory-bound.

(This section is also relevant in post-training, since you do rollouts for RL.)

Basics of transformer inference

Alright, as we know: to sample from a transformer, you run prior tokens through the model layers. This generates logits, you draw from the posterior according to temperature, and then you repeat the process again for each subsequent token.

You also need a paged KV cache though, so you don’t need to recompute intermediate activations each time. Instead, you reuse the previous ones. KV cache size is proportional to the sequence length, number of layers (L), number of KV heads, and model dimension (D), Example: flash_attn_with_kvcache().

Given this KV cache, there are two phases to inference:

Prefill. Generate all the KV cache for a long prompt, and generate first set of logits. Initializes the cache.
Generation (also “decode”). From a previous KV cache for all previous tokens in the sequence, incrementally sample one token and generate logits. Appends +1 token to cache.

Although, engines like vLLM can run both simultaneously (“chunked prefill”), and perhaps other inference systems may also split across separate machines (“disaggregated prefill”).

Anyway, here’s the tl;dr about the two parts of inference from a performance lens:

MLP: Arithmetic intensity. Token batch size ≥ FLOPs / HBM bandwidth.
- For TPU v5e, ~240. For H100, this is ~600 (with sparsity) or ~300 (without).
- Critical batch size decreases with param quantization (less loads), but increases if FLOPs are in lower precision since they become faster.
- Trivial to get this batch size in prefill with sequence length, harder during inference to batch up many concurrent requests.
Attention: With S past tokens and T inference, AI ~ ST / (S+T).
- During prefill, with cross-attention you get a good arithmetic intensity, linear with batch size, easy to saturate and not the bottleneck.
- During decode, AI ~ 1 because you load all of the weights from KV cache. You’re bottlenecked on memory bandwidth, loading from KV cache, since each of those is only used once in attention.
- So yeah, that’s sad. Once you increase your decode batch size enough, you’ll get diminishing returns — each forward pass gets slower because loading memory from KV cache dominates model weights size.

This observation about the memory-bound nature of attention is fundamentally why we have a latency-bandwidth tradeoff. You can’t actually run transformer inference (decode) at the MLP critical batch size of ~300, without sacrificing lots of time on loading KV cache, slowing down inference (inter-token latency).

If you squint a bit, this is also the graph you get when benchmarking vLLM / SGLang.

Tricks to improve latency / throughput

At scale, we do actually want to run big batch sizes and not waste all our time on memory bandwidth to load the KV cache! It would also be nice to make the KV cache smaller on a per-token basis, since that saves memory. So we have two, mutually beneficial reasons to reduce KV cache.

Here are common tricks people do in service of this goal:

Grouped Query Attention (GQA) reduces the number of KV heads and shares each with multiple Q heads. You can pick along a sliding scale between 1 KV head per Q head (standard multi-head attention), up to just 1 KV head in total. Seems like not all of the KV heads are needed for performance!
Mixing local attention layers is done by some models. For example, Gemma 3 uses 5 local layers between each global attention layer. GPT-OSS alternates between local and global attention in a 1:1 ratio. The idea is that you can focus on local details most of the time, and this reduces KV cache.
Sharing KVs across layers: You can go even further than GQA and share the KV’s across layers, not just queries. It reduces KV cache size, but it doesn’t reduce memory bandwidth since they need to be read in each layer.
Quantization saves on memory bandwidth and size for both params and KV cache. If you keep activations at the same precision, it helps reach the roofline.
Paged attention uses ragged reads into sections (“pages”) of HBM that are allocated as needed, based on the sequence length. It adds a bunch of complexity around memory allocation, preemption and interleaving, but it’s used by almost every inference engine to save memory. See nano-vllm’s scheduler.

Distributing inference

If you’re scaling to multiple accelerators, now you have the opportunity to explore various parallelism strategies. The default is to just replicate the model, with all of its weights in multiple instances, which is simple and doesn’t need any comms / syncing.

But you might want to speed up the model or fit large models that are too big for a single chip’s HBM. And then you have some choices.

Prefill: This is almost identical to training because of the sequence length dimension. You can shard prefill with model parallelism (~4-8 shards, as determined by ICI bandwidth) and then use sequence parallelism.
- Here, sequence parallelism doesn’t incur much overhead because you just AllGather activations. Note that this is different from chunked prefill, which batches prefill+decode on a single device.
Generation: FSDP / model sharding is bad because you’re bottlenecked on memory bandwidth. So your option is model parallelism (or maybe pipeline / expert parallelism at scale?). You can also shard the KV cache while doing this, which reduces memory “bandwidth cost” if done along the right axis.

Anyway, at this point the book goes back to discussing basic principles of inference engine scheduling. Most of these ideas apply to single-node serving as well. You can see them better in engines like vLLM and SGLang, so I’ll just summarize.

Typically you interleave prefill and generation, so you run prefill requests with priority and at a smaller-than-max batch size to reduce time-to-first-token (TTFT). With a smaller batch size, you also avoid blocking generations from taking too long as well.
The natural next step at sufficient scale is to disaggregate prefill and generation, since disaggregated prefill allows machines to specialize on that particular workload and not block the generation step for other queries. This requires transmitting KV cache over the network.
Continuous batching is an obvious optimization for generation steps, where you run each token and concurrently listen for incoming requests to add to the batch until it is complete. Don’t wait for a full batch to be ready before starting. This also means inter-token latency naturally degrades as your load (~continuous batch size) increases, which is a nice global measure of system load.
Obviously, you might serve inference requests with the same prefix later on especially in chat applications, so prefix caching and sticky routing are essential, probably using some kind of consistent hashing + LRU scheme.

The book links their JetStream library as an implementation example for inference at scale on TPUs. Some exercises analyze “expert sharding” in MoE models.

Some industry commentary: This section focuses on TPUs, but the common industry standard by far is Nvidia GPUs. These are 8x per node, and 8x B200 GPUs are enough to serve all but the very largest open models like Kimi K2 (and this can use expert parallelism). Mostly you can get away with model parallelism on a single node, running 8x Nvidia H100/A100 (perhaps A10 if small) and NVLink. This reduces the engineering complexity a lot (big forcing function for whether something will actually get built) and indeed most companies—including specialized inference ones like Baseten/Fireworks—only have nascent multi-node offerings if at all. Edit: Feedback I’ve gotten is frontier labs all use expert parallelism and multi-node inference though, makes sense due to their much larger models.

The appendices talk about other methods and considerations, specifically for low-latency inference (inter-token latency).

As device count increases, you may implement 2D weight sharding for MLP weights, along both the hidden and input axes. This becomes useful when sharding along hidden axis makes the per-device dimension smaller than the input dimension, so you balance them to reduce comms cost.
It’s mentioned that during inference, you can actually become latency-bound in AllGather due to the small amount of data, such that the cost of sending stuff around the ring is just from hops, not bandwidth. This seems like a TPU-specific problem from toroidal topologies.
The book briefly discusses speculative decoding, which uses a cheaper draft model to “guess” the next several tokens and verifies post-hoc with rejection sampling or MCMC. It trades off some throughput for more tokens/sec.

Part 8: Serving LLaMA 3-70B on TPUs

The first thing I notice is this comparison of devices / cost per hour on GCP.

As a principle, FLOPs / $ is right. But the price for H100 GPUs is off by quite a lot. For instance, Modal offers serverless H100s (boot in <2s, premium offering) for $3.95/hr. A quick search shows you can get H100s for much cheaper than even that if you’re willing to run your own servers. Anyway, just take these prices with a grain of salt; the authors and GCP have a business incentive to make TPUs look good.

Some quick takeaways from this chapter:

KV cache takes up a lot of space! Each token is about 1/440k the total size of the model in memory, assuming int8 quantization for both. If you have 32k context windows, this limits your batch size a lot. Quick-and-dirty explanation of the “440k” number is that it’s how much bigger the MLP params are.
Consider doing int8 quantization but bf16 FLOPs. You don’t pay for the extra precision in FLOPs because batch size isn’t high enough to get to the point where matmul is compute-bound. Low-precision arithmetic may affect performance.
You pay a lot for lower latency, below a point. These graphs are quite dramatic and show that if you have a very small batch size, you get to run super fast, but throughput sucks.

Part 9: How to Profile TPU Programs

We’ve finally started writing JAX. I really like JAX and so I’m familiar with the framework + its functional (even functorial?) style. But this is something new for me, using the JAX profiler, which is a tool for understanding TPU traces.

Review of the compiler pipeline: Jaxpr → StableHLO → HLO → TPU LLO. Or you can write custom kernels in Pallas.

The key thing to remember is that you can wrap code in jax.profiler.trace() contexts (as well as named scopes / calls) to generate linear traces, profiles, and XLA graph views that open in TensorBoard.

import jax

with jax.profiler.trace("/tmp/tensorboard"):
    key = jax.random.key(0)
    x = jax.random.normal(key, (1024, 1024))
    y = x @ x
    y.block_until_ready()

The visualization suite reminds me of pprof, which I guess makes sense as another Google product. Graphs, looking at the lowering of various parts of code, and a focus on the trace timeline first and foremost.

Profiling is never super fun, but general rules apply here. It becomes easier as you get fluent in the domain (e.g., author just happens to know “AllReduce + dynamic slice = ReduceScatter”), and it’s also definitely much more approachable if you come in with an idea of what you think the profile will look like and validate.

Part 10: Programming TPUs in JAX

There are three modes: fully automatic, explicit sharding (via type system), and manual sharding with shard_map().

It’s pretty cool to see these automatic sharding modes in JAX. All of the methods are pretty cutting-edge work. I first read about auto-sharding in 2022 via Alpa (OSDI ‘22), which also included inter-operator (pipeline) parallelism in its scope. Probably too much, but hey it was a research prototype.

Auto sharding mode:

You create a device mesh with jax.make_mesh(), with axis shapes and names. Each array gets a jax.NamedSharding set as its device on construction, which lets you specify how to shard the array across devices.
After that, jax.jit() allows you to specify in and out-shardings, and all intermediates are then automatically inferred (via heuristic) via Shardy (XLA).
You can then profile it. Maybe you see an issue, and give the compiler a hint with jax.lax.with_sharding_constraint() to change the behavior.

Explicit sharding mode:

You create a mesh as before but pass in the “Explicit” axis type.
Now, every array you create has sharding in its metadata / type. When you run an operation, it determines whether the sharding can be inferred cleanly or not. If it’s not possible to infer the best possible sharding, you have to resolve the ambiguity by providing an out_sharding kwarg.

Manual sharding mode with shard_map: (see also tutorial)

You write a program that runs on one device, in SPMD style like torch.
Decorate it with jax.shard_map(), and it will run on all devices in parallel with each device receiving a particular sharding.
Insert collective operations as needed like jax.lax.ppermute(), jax.lax.pmean(), jax.lax.all_gather(), etc.

Part 12: How to Think About GPUs

(Slightly out-of-order, leaving Part 11 until the end since it’s the conclusion.)

We have one last “addon” chapter, this one compares TPUs and GPUs. Their description of GPUs is short and witty.

A modern ML GPU (e.g. H100, B200) is basically a bunch of compute cores that specialize in matrix multiplication (called Streaming Multiprocessors or SMs) connected to a stick of fast memory (called HBM).

I’m pretty familiar with GPUs, having done some CUDA programming, so there’s a bunch of concepts here like warp = 32 threads, warp scheduler, divergence, shared memory, warpgroups (SM) and so on. GPUs are more general-purpose than TPUs, but they also have a huge chunk cut out for tensor cores that do matmul.

As I mentioned before, networking is very different in GPUs versus TPUs. Among nodes in a Scalable Unit (SU), GPUs get full bisection bandwidth, and every node is accessible to every other node by Infiniband switches in fat tree topology. You use RDMA to communicate.

Each node itself has a few NVLink switches that offer very fast, 1-hop networking between its 8 GPUs, in an all-to-all fashion. This is a pretty nice diagram of how the connectivity changed over time!

Beyond the level of a scalable unit, customers and cloud providers do various things to connect GPUs. It sounds like the most popular option is just Infiniband though. You need more high-bandwidth switches to support the fat tree as the number of GPUs and nodes grows. This may add to switching latency, but it also means that you get full connectivity, unlike TPUs.

Of course, everything changes with the “GB200 NVL72 SuperPod” system (what a name…). Instead of 8 devices on NVLink, you have 72. Great. Not going to think about that one for a while, haha.

The full connectivity changes up some collectives. AllReduce, ReduceScatter, and AllGather mostly stay the same, though latency can be reduced with an optimized implementation. AllToAll gets significantly faster because you can send the sharded data directly from one device to others.

One fun gadget is SHARP, which lets you do reductions within the network switch itself. This theoretically reduces time by half, but the authors here only saw 30% improvement in practice. May slightly change up calculations!

The second half of the chapter follows with a bunch of lore on GPU training and inference, basically going through the same roofline calculations as before. It’s slightly different because you need to consider the inter-node bandwidth as well.

Remember that pipeline parallelism does not play well with FSDP due to the weight sharding getting screwed up by pipelines.

During training, you need a local batch size of about 2500 tokens per GPU. Besides that, you can combine model parallelism / expert parallelism (8-64 GPUs), then pipeline parallelism, and finally ZeRO-1 data parallelism.

Part 11: Conclusions and Further Reading

This was a great read. I’m happy that so many people came together to write this; the mental model and visual acuity are on point. Also a nice way to advertise TPUs (haha), at least in spreading awareness of their programming model.

It’s a “textbook” but definitely one of the more well-written textbooks I’ve seen, and it presents a nice calculus of sorts for reasoning about these systems. Kind of like a math book, where you introduce new complexity, help the reader get used to it, simplify and then move on to more difficult challenges.

I went through my notes again with a friend and realized that they’re quite sparse if you aren’t already familiar with parallelism. Sorry! Consider this a band-pass filter over the book’s content, made for Eric. :)

How the jax.jit() JIT compiler works in jax-js

Eric — Wed, 14 May 2025 14:02:13 GMT

Since the start of this year, I’ve been working on a version of JAX in pure JavaScript.

For this, I need to make a deep learning compiler from scratch, and I want to keep it lightweight (e.g., JAX uses XLA as its compiler, which is 200 KLoC — too much bundle size for the web!). This is a note about the trickiest fundamental problem I’ve run into, and how I’m going about solving it.

What is jax-js?

JAX is a great library. It takes the numerical computing properties of NumPy, shoves in GPU + Autograd, then packages it all up in a convenient API.

import jax.numpy as jnp

a = jnp.array([1, 2, 3])
a * 10  # [10, 20, 30]
grad(lambda x: (x * x).sum())(a)  # [2, 4, 6]

By writing JAX in pure JS, using web APIs, we solve two problems:

How to do numerical compute in the browser? Like taking the mean of some numbers, or applying an image filter. Lots of applications, (statistics, data science, classical ML, CV, etc.), but right now it’s pretty hard to do well.
How do you run GPU compute in the browser? There are technologies like WebGPU if you want to write your own shaders, which is great if you’re making a video game. But this is tricky if you just want to do something simple. After all, a lot more people use PyTorch/JAX than write CUDA kernels!

No other library, ported to JS directly, would solve both problems at the same time. JAX hits the sweet spot since it’s useful for ML, and it also matches NumPy’s API.

import { grad, numpy as np } from "@jax-js/jax";

const a = np.array([1, 2, 3]);  // note: type is np.Array
a.mul(10);  // [10, 20, 30]
grad((x) => x.mul(x).sum())(a);  // [2, 4, 6]

If you just want numerical computing features, import numpy as np. If you need everything else, you can pull it in as needed.

Optimistic dispatch

So how do you implement this? If your operations are individual CPU calls and you’re following NumPy, you would dispatch them one-by-one to a kernel. Maybe that’s a Wasm kernel for instance, and you could implement core operations like:

function neg(a: Array) {  // a => -a
  const output = arrayLike(a);
  wasmBackend.dispatch("NEG:1", [a.buffer], [output.buffer]);
  return output;
}

function mul(a: Array, b: Array) {  // a, b => a * b
  [a, b] = broadcast(a, b);
  const output = arrayLike(a);
  wasmBackend.dispatch("MUL:2", [a.buffer, b.buffer], [output.buffer]);
  return output;
}

And then you’d have optimized Wasm kernels for each of these core operations. This is what tfjs-backend-wasm does, for instance.

But for deep learning workloads, you often want to fuse operations together. For example, let’s say you want to compute norm(x * 3 + 2) for a vector x. Doing this naively might take 4 data round-trips to the GPU or other device:

Compute x * 3, store the result in a.
Compute a + 2, store the result in b.
Compute b * b, store the result in c.
Compute sum(c), store the result in d.
Return sqrt(d).

For experimenting on small data, a few round trips won’t hurt anyone. But this can get painfully slow for more complex math, especially when you add in JAX-style autograd via transformations, which can increase the number of generated operations a lot.

So we’d like a way to make the operations more efficient, especially for repeated operations. This way, your browser simulation doesn’t skip frames, and your LLM produces more output tokens.

Understanding the machine learning JIT

The inspiration for this JIT, or just-in-time compiler, comes from XLA, which is JAX’s backend, originating from the TensorFlow project. XLA represents computations as directed acyclic graphs (DAGs) of core primitives. Some examples:

Exponential computes e^x.
Multiply multiplies two numbers.
Broadcast expands the axes of its input by repeating it.
Reduce(Subcomputation:add) takes the sum of a tensor along some axes.

Then XLA transforms the graph on the left into the graph on the right through a series of optimization passes. In this case, several operations are turned into fused expressions, which reduces the number of round-trips and makes the overall computation ~50x faster on a T4 GPU.

Well you might say, XLA is a really high-caliber compiler, at the state of the art for ML compilation. jax-js doesn’t need to achieve top-level performance like this, since it’s running in the browser. People’s hardware / platforms are different, and it can tolerate some slack. But not 50x(!!); I think getting within 3-5x of optimal would be reasonable—so we need the JIT compiler.

(Aside: A lot of the performance difference in this case is because jax.jit() saves the graph and avoids dynamic tracing on each run, which is also relevant for us. Ignoring the dynamic tracing, I would guess the compiler alone accounts for only ~10x, maybe.)

How do you build an ML JIT?

So you need a compiler, and with compilers, you need an intermediate representation (IR) that lets you represent the computation internally. The compiler plan is to take an input, pass it through the frontend and create an IR, then optimize that IR and produce an output.

To make this work, I’m basing my IR on tinygrad, which is a very small deep learning library. The key difference between tinygrad and XLA is that tinygrad have a lot fewer primitive operations. For example, to represent a 2048x2048 matmul, the HLO would be:

HloModule jit_matmul, entry_computation_layout={(f32[2048,2048]{1,0}, f32[2048,2048]{1,0})->f32[2048,2048]{1,0}}

ENTRY main.4 {
  Arg_0.1 = f32[2048,2048]{1,0} parameter(0)
  Arg_1.2 = f32[2048,2048]{1,0} parameter(1)
  ROOT dot.3 = f32[2048,2048]{1,0} dot(Arg_0.1, Arg_1.2), lhs_contracting_dims={1}, rhs_contracting_dims={0}
}

The last line uses the primitive dot operation, which is literally just a matmul.

In contrast, tinygrad produces something more like:

a1 = a.reshape([2048, 1, 2048])
b1 = b.transpose().reshape([1, 2048, 2048])
return (a * b).sum(axis=2)

The first two lines are “movement operations” that just produce views of the data, and crucially, tracking the view can all be done within a single kernel without actually making copies. They call this laziness — but honestly I think the core thing that makes it work is not the laziness, but rather their algebra of tracking views.

So I’m taking this view-tracking system for jax-js, and it’s been working great. jax-js has an IR defined by the AluExp class, which is a (very) simplified version of tinygrad’s UOp and looks like:

/** Mathematical expression on scalar values. */
export class AluExp {
  constructor(
    readonly op: AluOp,
    readonly dtype: DType,
    readonly src: AluExp[],
    readonly arg: any = undefined,
  ) {}
  // ...
}

An expression is fused and then is placed into a kernel, where each kernel contains at most one reduction.

/**
 * Description of a kernel to be compiled.
 *
 * Each of these can be processed by a backend into some lower-level
 * representation. It consists of one or more fused operations, optionally
 * indexing into a buffer.
 */
export class Kernel {
  constructor(
    /** Number of global arguments / arrays. */
    readonly nargs: number,
    /** Size of the result array in element count. */
    readonly size: number,
    /** Expression to be evaluated. */
    readonly exp: AluExp,
    /** Optional reduction to be performed. */
    readonly reduction?: Reduction,
  ) {
    this.exp = exp.simplify();
  }
  // ...
}

This gives us everything we need to implement compiler optimizations and lower IR expressions into optimized WebGPU or WebAssembly code.

jax.jit() – joining the frontend with the IR

Now that we have the IR done, let’s return to the actual library frontend. Recall we’ve been generating graphs of operations through JAX, which can have combinators like grad() and jvp() for automatic differentiation. So you could write an operation like log(2*x), and it would produce the computation graph for 2/(2*x) after applying the chain rule.

These graphs are almost what we need — but we need to decide when to dispatch them to the backend via Kernel objects, knowing that:

Each kernel fuses a common subexpression and then runs it on the GPU.
A kernel can have at most 1 reduction (for technical reasons; reductions are the starting point for optimizations).

A motivating example is the matmul operation, which we can try porting over:

function matmul(a: Array, b: Array) {
  // for clarity, assume a, b are of shape (n, n)
  const c = a.reshape([n, 1, n]) * b.transpose().reshape([1, n, n]);
  return c.sum({ axis: 2 });
}

There’s a tradeoff with this approach. tinygrad doesn’t actually do anything until you call the realize() function, which kicks off work. So it’s fine that you’re multiplying these matrices and producing c, which is of size n³, since c never actually gets realized.

jax-js tries to be a general-purpose library, so this behavior might be a bit confusing to people used to NumPy.

Luckily, we can borrow another primitive from JAX, which is the jit() function. This traces an expression, produces a “Jaxpr” or DAG of operations, and then passes it down to the ML compiler.

const matmul = jit(function matmul(a: Array, b: Array) {
  // for clarity, assume a, b are of shape (n, n)
  const c = a.reshape([n, 1, n]) * b.transpose().reshape([1, n, n]);
  return c.sum({ axis: 2 });
});

This opts into kernel fusion and optimization. Now, whenever the function is called with inputs of a certain shape, we get the full DAG and can run a graph algorithm to break it down into common subexpressions, each lowered into a Kernel object containing a fused AluExp.

With this, I think I’m able to offer a really fast, optimized matrix multiplication, while doing minimal work on the compiler side and keeping in line with the “spirit” of JAX: composable function transformations. There’s no need for me to write new primitives for every ML operation: like pad, fused batch normalization, and so on.

Conclusion

I started this project at the beginning of the year, so it’s been about 3-4 months now. At the beginning, I never thought that I would be actually implementing an ML compiler, but here we are. What makes it more manageable was a combination of:

Relying on JAX’s in-built JIT tracing. So composite operations like matmul(), but also anything from norm() to einsum(), can be implemented in terms of smaller parts. It gives us a clean DAG, after autograd and any combinators, to hand off to the compiler backend.
Borrowing tinygrad’s “view” system. This drastically simplifies the IR (see XLA’s IR for instance) and the amount of work needed to build a working library.

So that’s how jax-js is going. We’ll soon have jax.jit() support, and then some demos.

What comes next?

On the performance front, jax-js is already looking pretty good. It produces better matmul benchmarks than TensorFlow.js. I think landing jit() will be okay for now.

There are some unresolved questions related to memory:

How do you do free memory? JS doesn’t have a destructor like Python does with its reference-counted __del__() method. Maybe use linear types.
For the WebAssembly backend, how do you allocate buffers in Wasm linear memory? Generally you want to avoid fragmentation, so maybe there’s a simple way to do memory allocation here, like relying on a buddy allocator for tracking free chunks of pages.

But I’m excited about what’s coming up, since it’s almost fully usable as a numerical computing library. Some stuff I want to put in the browser soon: audio visualizers, fractals, fluid simulation, and neural audio coding. After that, I’ll open-source the library for others to try out.

If you want to keep up to date, feel free to follow me at @ekzhang1.

Hope you learned something about compilers. 🐾

EDIT (May 24): Since writing this post, I’ve implemented jax.jit() in the library and successfully now have an auto-fusing, composable and GPU kernel-tuned ML compiler in JavaScript. I think that’s the first of its kind! As a simple example, here’s an implementation of matmul() in terms of jit(). :D

jax-js devlog feb 17

Eric — Tue, 18 Feb 2025 01:20:12 GMT

So a lot has happened since last week. Today is a holiday, and besides the nice lunch meetup with an old acquaintance, my focus today is just to make progress on this side project.

personal reflections

it’s cool that so far this year, I’ve been making the most (non-work) GitHub activity of any year in the past, at least in terms of rate. this is pretty good! I wonder if I’ll be able to keep that up.

but that’s a good reminder that, no, I’m not getting lazier or less inspired or something with software work, even though it seems that way. I’m just pushing myself to do bigger things!

it matters -A

when you’re small, your elementary school looks so big, and then you come back as an adult and marvel at how small everything was. this is kind of like that

relevant to jax-js plans

um so we had Chenyu from tinygrad come over to NYSRG and I think after reading the codebase and also taking notes on related cuda things

i understand the picture of compiling operations into kernels a lot better now
- the rewrite rules / lazy pattern matchers is just a less PL jargon-infused way (or should I saw, less PL-aware) of talking about staging, like what JAX does with HLO/XLA
- you can get a long way with pretty simple kernels and just a couple hand-rolled heuristics is my takeaway from the tinygrad paper
- in retrospect this should be pretty obvious. like, automatic heuristics should certainly at least be better than a static library of a couple compiled kernels. it’s smaller and more flexible with low development resources
- and gpus can’t be that complicated. there are memory hierarchies, but even complex problems tend to have fairly parsimonious solutions
this means that I am pretty confident (overconfident??) in being able to get rid of the dependency on tfjs-core at some point in the future
which is huge, since then I’m not limited to a couple dtypes and can also optimize any operations of my choice, and extend the project arbitrarily to support even more operations or algorithms to achieve numpy API-compatibility
- you want a QR decomposition from numpy.linalg.qr()? sure, have it

$ TC=0 DEBUG=4 python3 test.py

# ... stuff

 0: (64, 32, 8, 16, 1, 4, 4, 1) float.ptr(4194304)   (65536, 64, 8192, 4, 0, 1, 2048, 0)
 1: (64, 32, 8, 16, 512, 4, 4, 4) float.ptr(4194304)   (0, 64, 0, 4, 8192, 1, 0, 2048)
 2: (64, 32, 8, 16, 512, 4, 4, 4) float.ptr(4194304)   (65536, 0, 8192, 0, 4, 0, 2048, 1)
[Opt(op=OptOps.UPCAST, axis=1, arg=4), Opt(op=OptOps.UPCAST, axis=0, arg=4), Opt(op=OptOps.UNROLL, axis=0, arg=4), Opt(op=OptOps.LOCAL, axis=0, arg=8), Opt(op=OptOps.LOCAL, axis=1, arg=16)]
#include 
using namespace metal;
kernel void r_64_32_8_16_512_4_4_4(device float* data0, device float* data1, device float* data2, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int gidx0 = gid.x; /* 32 */
  int gidx1 = gid.y; /* 64 */
  int lidx0 = lid.x; /* 8 */
  int lidx1 = lid.y; /* 16 */
  int alu0 = (gidx0<<6);
  int alu1 = (gidx1<<16);
  int alu2 = (lidx0<<13);
  int alu3 = (lidx1<<2);
  float acc0 = 0.0f;
  float acc1 = 0.0f;
  float acc2 = 0.0f;
  float acc3 = 0.0f;
  float acc4 = 0.0f;
  float acc5 = 0.0f;
  float acc6 = 0.0f;
  float acc7 = 0.0f;
  float acc8 = 0.0f;
  float acc9 = 0.0f;
  float acc10 = 0.0f;
  float acc11 = 0.0f;
  float acc12 = 0.0f;
  float acc13 = 0.0f;
  float acc14 = 0.0f;
  float acc15 = 0.0f;
  for (int ridx0 = 0; ridx0 < 512; ridx0++) {
    int alu4 = (alu1+alu2+(ridx0<<2));
    float4 val0 = *((device float4*)((data1+alu4)));
    float4 val1 = *((device float4*)((data1+(alu4+2048))));
    float4 val2 = *((device float4*)((data1+(alu4+4096))));
    float4 val3 = *((device float4*)((data1+(alu4+6144))));
    int alu5 = (alu0+alu3+(ridx0<<13));
    float4 val4 = *((device float4*)((data2+alu5)));
    float4 val5 = *((device float4*)((data2+(alu5+2048))));
    float4 val6 = *((device float4*)((data2+(alu5+4096))));
    float4 val7 = *((device float4*)((data2+(alu5+6144))));
    acc0 = (acc0+(val0.x*val4.x)+(val0.y*val5.x)+(val0.z*val6.x)+(val0.w*val7.x));
    acc1 = (acc1+(val1.x*val4.x)+(val1.y*val5.x)+(val1.z*val6.x)+(val1.w*val7.x));
    acc2 = (acc2+(val2.x*val4.x)+(val2.y*val5.x)+(val2.z*val6.x)+(val2.w*val7.x));
    acc3 = (acc3+(val3.x*val4.x)+(val3.y*val5.x)+(val3.z*val6.x)+(val3.w*val7.x));
    acc4 = (acc4+(val0.x*val4.y)+(val0.y*val5.y)+(val0.z*val6.y)+(val0.w*val7.y));
    acc5 = (acc5+(val1.x*val4.y)+(val1.y*val5.y)+(val1.z*val6.y)+(val1.w*val7.y));
    acc6 = (acc6+(val2.x*val4.y)+(val2.y*val5.y)+(val2.z*val6.y)+(val2.w*val7.y));
    acc7 = (acc7+(val3.x*val4.y)+(val3.y*val5.y)+(val3.z*val6.y)+(val3.w*val7.y));
    acc8 = (acc8+(val0.x*val4.z)+(val0.y*val5.z)+(val0.z*val6.z)+(val0.w*val7.z));
    acc9 = (acc9+(val1.x*val4.z)+(val1.y*val5.z)+(val1.z*val6.z)+(val1.w*val7.z));
    acc10 = (acc10+(val2.x*val4.z)+(val2.y*val5.z)+(val2.z*val6.z)+(val2.w*val7.z));
    acc11 = (acc11+(val3.x*val4.z)+(val3.y*val5.z)+(val3.z*val6.z)+(val3.w*val7.z));
    acc12 = (acc12+(val0.x*val4.w)+(val0.y*val5.w)+(val0.z*val6.w)+(val0.w*val7.w));
    acc13 = (acc13+(val1.x*val4.w)+(val1.y*val5.w)+(val1.z*val6.w)+(val1.w*val7.w));
    acc14 = (acc14+(val2.x*val4.w)+(val2.y*val5.w)+(val2.z*val6.w)+(val2.w*val7.w));
    acc15 = (acc15+(val3.x*val4.w)+(val3.y*val5.w)+(val3.z*val6.w)+(val3.w*val7.w));
  }
  int alu23 = (alu0+alu1+alu2+alu3);
  *((device float4*)((data0+alu23))) = float4(acc0,acc4,acc8,acc12);
  *((device float4*)((data0+(alu23+2048)))) = float4(acc1,acc5,acc9,acc13);
  *((device float4*)((data0+(alu23+4096)))) = float4(acc2,acc6,acc10,acc14);
  *((device float4*)((data0+(alu23+6144)))) = float4(acc3,acc7,acc11,acc15);
}
*** METAL      9 r_64_32_8_16_512_4_4_4                    arg  3 mem  0.05 GB tm     19.86ms/    22.58ms (   864.86 GFLOPS    2.5|865.7   GB/s) ['__matmul__']

but right now the milestones still look like:

- [ ] It works!
- [ ] Demos: Navier-Stokes, neural networks, statistics
- [ ] We figure out the `dispose()` / linear types stuff
- [ ] Device switching with `.to()` between webgl/webgpu/cpu/wasm
- [ ] First custom kernel
- [ ] numpy/jax API compatibility table
- [ ] Convert Jaxprs into a tree data structure
- [ ] Pattern matchers for kernel fusion
- [ ] Kernel codegen, or synthesis

in particular I think the pattern matchers, scheduling, and codegen components (equivalent of ExecItem in tinygrad) will probably end up fitting into the equivalent of the `xla_call` operation in JAX. so we’ll have two separate parts of the codebase, one for compilation and one for non-jitted code.

this sounds kind of weird at first, but I think it’s the right choice given the design tradeoffs we’re making. we want it to be fast, but we don’t need to squeeze out every drop of performance — after all, we don’t even know what hardware we’re running on since it’s a javascript in-browser library.

the other advantage of jitting this is that we can auto-manage memory (er, we have to predict static memory patterns anyway, so we get this for free) and that’s important given that javascript has no reliable GC dispose hook (destructor)

anyway this seems pretty solid

development

tests continue to pass and reveal their utility over time. also vitest’s inline snapshot testing is quite fast & awesome.

anyway, it’s 8 PM right now, here’s what we got from today

git --no-pager diff --stat "@{1 day ago}"

 README.md            |   3 +
 src/core.ts          | 501 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 src/index.ts         |  13 ++-
 src/pprint.test.ts   |  68 +++++++++++++
 src/pprint.ts        |  57 +++++++++++
 src/utils.ts         |   2 +-
 test/tracing.test.ts |  48 +++++++++
 7 files changed, 688 insertions(+), 4 deletions(-)

basically just finished implementing jaxpr logic and tracing. I understand how xla_call works as well now, the jaxpr is placed into the parameters and it composes in some interesting ways. mental exercises:

what happens when you jit() a jit()
what happens when you jvp() a jit()
what happens when you jit() a jvp()
what happens when you makeJaxpr() a jit()

I also understand (I think?) this gem of a quote laden with PL terminology, lmao

There are two options for how to handle higher-order primitives. Each requires a different approach to tracing and engenders different tradeoffs:
On-the-fly processing, where bind takes a Python callable as an argument. We defer forming a jaxpr until as late as possible, namely until we’re running the final interpreter at the bottom of the interpreter stack. That way we can swap a JaxprTrace in at the bottom of the interpreter stack and thus stage out rather than execute all primitive operations. With this approach, transformations in the stack get applied as we execute the Python callable as usual. This approach can be very tricky to implement, but it’s as general as possible because it allows higher-order primitives not to raise the abstraction level of their arguments and thus allows data-dependent Python control flow. We refer to this approach as using a “final-style higher-order primitive” employing the discharge-at-tracing-time “final-style transformations” we’ve used so far.
Staged processing, where bind takes a jaxpr as an argument. Before we call bind, in the primitive wrapper we can just use make_jaxpr to form a jaxpr up-front and be done with the Python callable entirely. In this case, make_jaxpr puts its JaxprTrace at the top of the interpreter stack, and no transformations lower in the stack, which might enter via closed-over Tracers, are applied to the Python callable as we trace it. (Transformations applied within the Python callable are applied as usual, being added to the stack above the JaxprTrace.) Instead, the transformations lower in the stack are later applied to the call primitive, and the call primitive’s rules must then transform the jaxpr itself. Because we trace to a jaxpr up-front, this approach can’t support data-dependent Python control flow, but it is more straightforward to implement. We refer to this kind of higher-order primitive as an “initial-style higher-order primitive”, and say that its jaxpr-processing transformation rules are “initial-style transformation rules.”
The latter approach fits for jit because we don’t need to support data-dependent Python control flow in the user-provided Python callable, as the whole purpose of jit is to stage computation out of Python to be executed by XLA. (In contrast, custom_jvp is a higher-order primitive in which we want to support data-dependent Python control flow.)
Historically, we started using the “initial-style” and “final-style” terminology after reading the typed tagless final interpreters paper, and jokingly referring to JAX as an implementation of “untyped tagful final interpreters.” We don’t claim to carry over (or understand) any deep meaning behind these terms; we loosely use “initial style” to mean “build an AST and then transform it”, and we use “final style” to mean “transform as we trace.” But it’s just imprecise yet sticky jargon.

next up is linearize / vjp, which I’m excited about. finally getting a glimpse into conal elliott’s mind

anyway, we’re getting there a bit at a time!

concluding

i think side projects are hard, but I’m reminded that like a lot of things in life, you just need to make a routine. discipline is hard, but routines are easy

if you write 200 lines of code each day for a month, you’ll have written 6000 lines of code in that month

that’s pretty substantial. like sshx.io-sized! the difference is that sshx.io took nearly 2 years, lol — but to be fair, you’re oftentimes debugging or removing code too. in any case, routines make everything easier, whether it’s organizing meetups like nysrg or running or learning to play an instrument, and let’s try to find one.

it can be pretty hard to work on something so difficult by yourself, but at the same time, it’s true that I find it really cool, and i enjoy this kind of creative work :)

jax-js progress

Eric — Sat, 08 Feb 2025 16:49:24 GMT

currently doing this at the val town office, hosted by justin-b. it’s a chilly day

on the backlog: tabletop.js, and Jute. lots of stuff, but let me make progress on this for the morning! sometimes feels like my personal work is a stack: recency bias toward what I work on. but that’s fine, exploration is hard work. let’s have fun

progress

I just implemented jacfwd(). got distracted by type signatures a bit, but I think things are going roughly smoothly. some folks next to me are collaborating on an API.

one important thing to remember is that there’s some type signatures that aren’t quite exactly right, but what matters is that they work

// Convert a subtype of JsTree into a JsTree, with the same structure.
type MapJsTree = T extends A
  ? B
  : T extends globalThis.Array
    ? MapJsTree[]
    : { [K in keyof T]: MapJsTree };

// Assert that a function's arguments are a subtype of the given type.
type WithArgsSubtype any, T> =
  Parameters extends T ? F : never;

/** Compute the forward-mode Jacobian-vector product for a function. */
export const jvp = core.jvp as  JsTree>(
  f: WithArgsSubtype>,
  primals: MapJsTree, Array, ArrayLike>,
  tangents: MapJsTree, Array, ArrayLike>
) => [ReturnType, ReturnType];

/** Vectorize an operation on a batched axis for one or more inputs. */
export const vmap = core.vmap as  JsTree>(
  f: WithArgsSubtype>,
  inAxes: MapJsTree, Array, number>
) => F;

/** Compute the Jacobian evaluated column-by-column by forward-mode AD. */
export const jacfwd = core.jacfwd as  Array>(
  f: F,
  x: Array
) => F;

i’m feeling a bit annoyed by vs code freezing every so often. as well as my lack of tests. let me restart my computer for perf, and then I can add some utilities to assert arrays are close to each other, which will be needed for tests.

I’ll skip on the utility for printing arrays for now — we’ll get there later!

—my computer has restarted.

today’s idea

you cannot tell other people how to work. you can only show them, make it easy to change, give suggestions. but it’s ultimately their choice, and learning how to work together with others takes time. relationships are damaged when we aren’t patient and generous with each other.

sometimes, you just need to sit down with the other party, and hear them out. there should always be a place for honest listening

…and back to tests

i got tests to work, and then my tests promptly discovered several bugs (lol) and then in the process of debugging futily, I learned the core place where you need to intercept stuff at bind(), and then the necessity of having a toString() LOL.

but yeah it turns out a one-line typo in fullLower() caused all this mess. now you know — experiences!

and now it’s noon

so they’re doing demos. a nice group of people

when you’re working on a team that works fast, you often find yourself in a situation where you don’t understand everything. a primitive capability. context, understanding, some representation. design docs take energy to write

express your ideas is a lot of the work, perhaps it’s the productive part

protomaps! a nice file format for serving full-world maps via byte ranges. seems like a cool system. maplibre is the client format. replace mapbox gl :)

jax-js note

Eric — Fri, 07 Feb 2025 16:43:38 GMT

Current goals
"Tutorial mindset" — imagine making a workshop where I'm teaching people about array programming or functional programming in practice.
Try to just implement float32, grad() and arithmetic.
https://jax.readthedocs.io/en/latest/autodidax.html
Build out a quick code editor / REPL in the browser and import the library using Vite. Then run code to experiment a bit.
After that, play it by ear. PyTrees -> JSON, …
Description (future)
NumPy and JAX for the browser, running on CPU or GPU.
Machine learning and numerical computing in JavaScript with the JAX/NumPy API. Define arrays, then run arbitrary differentiable code on CPU, WASM, WebGL, or WebGPU backends.
Examples: fluid simulation, neural networks, computer vision, robotics, statistics.
import { grad, numpy as np } from "jax-js"; const y = grad(x => x.mul(2))(np.array([1, 2, 3])) console.log(y.js())
Memory management
Refcount / ownership contract: all arguments to functions must be used or disposed.
ref() and dispose()
who is your target? scientists, artists, anyone who uses numerical computing. maybe eventually porting ML models (but that’s better suited for ONNX probably)
why?
story: you can’t figure out how to use pip. oh look, here’s a webgpu version that’s fast and “just works” — shaders can do a lot
story: XLA is hard and very complicated, what if you set up a minimal compiler toolchain to do the 80%+ of fusing operations directly in the browser?
story: you are an artist and want to write some numerical simulations, but you don’t want to invent a whole library like https://github.com/amandaghassaei/gpu-io yourself