A complete GPT in 553 lines of TypeScript — no dependencies, no frameworks, no magic
AdvancedFebruary 14, 2026 · 40 min read

A complete GPT in 553 lines of TypeScript — no dependencies, no frameworks, no magic

Build a working GPT from scratch in pure TypeScript: autograd, multi-head attention, Adam optimizer, training, and inference — all in a single file with zero dependencies.

Patrick the AI Engineer

Patrick the AI Engineer

TypeScriptMachine LearningTransformersFrom ScratchZero Dependencies

Try the Interactive Playground

Experiment with the code from this tutorial in real-time.

Open Playground

ChatGPT has 175 billion parameters, runs across thousands of GPUs, and required months of training. The file you're about to read has 4,192 parameters, runs on a single CPU core, and trains in about one second. They use the exact same algorithm.

That's the claim, anyway. And it sounds absurd until you actually read the code. This is a complete GPT — training and inference — in 553 lines of TypeScript. No PyTorch. No TensorFlow. No npm dependencies at all. Just node:fs to read and write files. The opening comment in the source says it best:

This file is the complete algorithm. Everything else is just efficiency.

It's a port of Andrej Karpathy's microGPT Python implementation into dependency-free TypeScript. Karpathy, former director of AI at Tesla and one of the original researchers behind deep learning, has a gift for stripping things down to their essence. This port takes that essence, drops it into a language most web developers already know, and applies targeted performance optimizations that reduce the computation graph by over 90% — without changing the algorithm.

By the end of this article, you'll understand every major building block of a GPT — not from reading a textbook, but from reading working code you can run right now with npx tsx scripts/gpt.ts.

What's a GPT, stripped naked?

Before we look at the code, let's get one thing straight: a GPT is a next-token predictor. That's it. You give it a sequence of tokens (characters, in our case), and it predicts the most likely next one. Do that in a loop and you get generation.

The magic isn't in some exotic secret sauce. It's in three things working together:

  1. An autograd engine that can compute gradients automatically — the backbone of learning
  2. A transformer architecture that decides which parts of the input to pay attention to
  3. An optimizer that nudges the model's parameters in the right direction, step by step

Our file implements all three from scratch. Let's walk through each one.

The Value class: calculus you can actually read

The entire learning mechanism of the GPT rests on a single class called Value. It's an autograd engine — a system that tracks how numbers flow through calculations and then automatically computes gradients (derivatives) so the model knows how to improve.

const EMPTY_VALUES: Value[] = []
const EMPTY_GRADS: number[] = []

class Value {
  data: number
  grad: number
  private _children: Value[]
  private _localGrads: number[]
  private _vis: number
  private static _genCounter = 0

  constructor(data: number, children: Value[] = EMPTY_VALUES, localGrads: number[] = EMPTY_GRADS) {
    this.data = data
    this.grad = 0
    this._children = children
    this._localGrads = localGrads
    this._vis = -1
  }

Every Value stores its numeric value (data), its gradient (grad), the Values that produced it (_children), and the local derivatives for each child (_localGrads). When you multiply two Values together, the result remembers its parents and knows how a change in each parent would affect it.

The shared EMPTY_VALUES and EMPTY_GRADS singletons avoid allocating fresh empty arrays for every leaf node — parameters and scalar wrappers that have no children. With thousands of parameters, that's thousands of avoided allocations.

Here's multiplication:

mul(other: Value | number): Value {
  if (typeof other === 'number') return new Value(this.data * other, [this], [other])
  return new Value(this.data * other.data, [this, other], [other.data, this.data])
}

The type check on the first line is a performance optimization that matters more than you'd think. When multiplying by a plain number (a scalar), the naive approach would wrap it in a new Value(number), creating a phantom node in the computation graph that serves no purpose during backpropagation. The scalar path creates one node with one child. The Value path creates one node with two children. No wrappers, no phantoms.

The local gradients come directly from calculus: the derivative of a * b with respect to a is b, and with respect to b is a. But you don't need to think about derivatives to use this — every operation just records the right rule.

The same scalar-awareness applies to all arithmetic. Subtraction and division are implemented as native operations rather than decomposing into simpler parts:

sub(other: Value | number): Value {
  if (typeof other === 'number') return new Value(this.data - other, [this], [1])
  return new Value(this.data - other.data, [this, other], [1, -1])
}

div(other: Value | number): Value {
  if (typeof other === 'number') {
    const inv = 1 / other
    return new Value(this.data * inv, [this], [inv])
  }
  const invB = 1 / other.data
  return new Value(this.data * invB, [this, other], [invB, -this.data * invB * invB])
}

A naive sub would decompose into add(neg()) — three nodes. A naive div would decompose into mul(pow(-1)) — two nodes plus a wrapper. The native versions each create exactly one node. Across thousands of softmax and normalization calls, this compounds dramatically.

The backward() method is where it all pays off:

backward(): void {
  const gen = ++Value._genCounter
  const topo: Value[] = []
  const stack: Value[] = [this]
  const stackIdx: number[] = [0]
  this._vis = gen

  while (stack.length > 0) {
    const depth = stack.length - 1
    const v = stack[depth]
    const idx = stackIdx[depth]

    if (idx < v._children.length) {
      stackIdx[depth]++
      const child = v._children[idx]
      if (child._vis !== gen) {
        child._vis = gen
        stack.push(child)
        stackIdx.push(0)
      }
    } else {
      topo.push(v)
      stack.pop()
      stackIdx.pop()
    }
  }

  this.grad = 1
  for (let i = topo.length - 1; i >= 0; i--) {
    const v = topo[i]
    const g = v.grad
    const children = v._children
    const localGrads = v._localGrads
    for (let j = 0; j < children.length; j++) {
      children[j].grad += localGrads[j] * g
    }
  }
}

It first sorts all Values in topological order (parents before children), then walks backward, distributing gradients from output to input using the chain rule. This is backpropagation — the single most important algorithm in deep learning.

Two design decisions here are worth noting. First, the topological sort is iterative rather than recursive. The recursive version is prettier — five lines with a nested function — but it risks stack overflow on deep computation graphs and carries function call overhead for every node. The iterative version uses an explicit stack and a child index tracker.

Second, instead of allocating a new Set<Value>() to track visited nodes (which would be created and garbage-collected on every backward call), each Value carries a _vis timestamp that gets compared against a monotonically increasing _genCounter. If _vis matches the current generation, the node has already been visited. This avoids allocating and GC-ing a Set with thousands of entries on every training step.

One quirk worth noting: Python lets you override +, *, and - operators. TypeScript doesn't. So where Python would write a * b + c, we write a.mul(b).add(c). More verbose, but arguably more explicit about what's happening.

Fused operations: where efficiency lives

Before we look at the transformer functions, there are two helper functions that dramatically reduce the size of the computation graph. Remember the opening comment: everything else is just efficiency. These two functions are that efficiency.

Fused dot product — the single biggest optimization. A dot product of dimension D, implemented naively, creates 2D+1 nodes (D multiplications, D-1 additions, a ghost Value(0) seed, and one final add). The fused version creates exactly one node with 2D children:

function dotProduct(a: Value[], b: Value[]): Value {
  const n = a.length
  let sum = 0
  for (let i = 0; i < n; i++) sum += a[i].data * b[i].data
  const children = new Array<Value>(2 * n)
  const grads = new Array<number>(2 * n)
  for (let i = 0; i < n; i++) {
    children[i] = a[i]
    children[n + i] = b[i]
    grads[i] = b[i].data
    grads[n + i] = a[i].data
  }
  return new Value(sum, children, grads)
}

The forward pass computes the scalar result directly. The local gradients follow from calculus: d(sum(a_i * b_i)) / d(a_i) = b_i and vice versa. All the intermediate multiply-and-add nodes are gone. Since linear projections and attention dot products dominate the computation graph, this single optimization reduces the total number of Value nodes per training step by roughly 90%.

Fused multi-sum — replaces a chain of n-1 add nodes with a single node. Every child gets a local gradient of 1, because the derivative of a sum with respect to any of its terms is 1:

function multiSum(values: Value[]): Value {
  if (values.length === 1) return values[0]
  const n = values.length
  let sum = 0
  const grads = new Array<number>(n)
  for (let i = 0; i < n; i++) {
    sum += values[i].data
    grads[i] = 1
  }
  return new Value(sum, values, grads)
}

Used in normalization, softmax, and loss aggregation — anywhere you'd sum a vector into a scalar.

Three functions that define the transformer

With autograd and fused operations in place, the actual transformer architecture is built from just three utility functions: linear, softmax, and rmsnorm.

Linear transformation — the workhorse of neural networks — is a matrix multiply, implemented as a batch of dot products:

function linear(x: Value[], w: Value[][]): Value[] {
  const nout = w.length
  const result = new Array<Value>(nout)
  for (let i = 0; i < nout; i++) {
    result[i] = dotProduct(w[i], x)
  }
  return result
}

Each output element is the dot product of one weight row against the input vector. Every "layer" in a neural network is fundamentally this operation followed by something nonlinear.

Softmax turns a vector of arbitrary numbers into a probability distribution (all values between 0 and 1, summing to 1):

function softmax(logits: Value[]): Value[] {
  const n = logits.length
  let maxVal = -Infinity
  for (let i = 0; i < n; i++) {
    if (logits[i].data > maxVal) maxVal = logits[i].data
  }
  const exps = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    exps[i] = logits[i].sub(maxVal).exp()
  }
  const total = multiSum(exps)
  const result = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    result[i] = exps[i].div(total)
  }
  return result
}

The sub(maxVal) trick prevents numerical overflow — without it, exp() of large numbers would blow up to infinity. This is the same trick used in production LLMs. Notice that maxVal is a plain number, so sub(maxVal) takes the scalar path — one node instead of three.

RMSNorm (Root Mean Square Normalization) keeps values from exploding or vanishing as data flows through layers:

function rmsnorm(x: Value[]): Value[] {
  const n = x.length
  const squares = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    squares[i] = x[i].mul(x[i])
  }
  const ms = multiSum(squares).div(n)
  const scale = ms.add(1e-5).pow(-0.5)
  const result = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    result[i] = x[i].mul(scale)
  }
  return result
}

The multiSum fuses all the squared values into one node, and .div(n) takes the scalar path. This is the same normalization used in LLaMA and other modern LLMs (as opposed to the older LayerNorm from the original "Attention Is All You Need" paper).

The GPT forward pass: where attention happens

Here's the heart of the transformer. The gpt function takes a single token and its position, and outputs logits (unnormalized scores) for every possible next token:

function gpt(tokenId: number, posId: number, keys: Value[][][], values: Value[][][]): Value[] {
  const tokEmb = stateDict.wte[tokenId]
  const posEmb = stateDict.wpe[posId]
  let x = new Array<Value>(N_EMBD)
  for (let i = 0; i < N_EMBD; i++) {
    x[i] = tokEmb[i].add(posEmb[i])
  }
  x = rmsnorm(x)

It starts by looking up the token embedding (wte — each character gets a 16-dimensional vector that the model learns) and the position embedding (wpe — so the model knows where in the sequence this token appears). These get added together and normalized.

Then comes multi-head attention. The idea is deceptively simple: for each position, the model computes three vectors — a Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what information do I carry?"). Attention scores are dot products of Queries against all previous Keys, telling the model how relevant each past token is:

const lw = layerWeights[li]
const q = linear(x, lw.attn_wq)
const k = linear(x, lw.attn_wk)
const v = linear(x, lw.attn_wv)
keys[li].push(k)
values[li].push(v)

The layerWeights array pre-resolves the weight matrices at initialization time, avoiding string interpolation (stateDict[\layer${li}.attn_wq`]`) on every forward pass. The keys and values accumulate across positions — this is the KV cache that you've probably heard about in discussions of LLM inference optimization. Here it serves a dual purpose: it makes the forward pass work token-by-token and avoids redundant computation.

The "multi-head" part means we split the 16-dimensional space into 4 independent 4-dimensional heads, each attending to different things:

for (let h = 0; h < N_HEAD; h++) {
  const hs = h * HEAD_DIM
  const qH = q.slice(hs, hs + HEAD_DIM)
  const layerKeys = keys[li]
  const layerVals = values[li]
  const nT = layerKeys.length

  const attnLogits = new Array<Value>(nT)
  for (let t = 0; t < nT; t++) {
    attnLogits[t] = dotProduct(qH, layerKeys[t].slice(hs, hs + HEAD_DIM)).mul(INV_SQRT_HEAD_DIM)
  }
  const attnWeights = softmax(attnLogits)

  for (let j = 0; j < HEAD_DIM; j++) {
    const vCol = new Array<Value>(nT)
    for (let t = 0; t < nT; t++) {
      vCol[t] = layerVals[t][hs + j]
    }
    xAttn[xAttnIdx++] = dotProduct(attnWeights, vCol)
  }
}

The fused dotProduct does all the heavy lifting here. Each attention logit — the dot product of a query against a key, scaled by 1/sqrt(HEAD_DIM) — is a single node in the computation graph. The value aggregation — a weighted sum of value vectors — is also a single node per dimension. The INV_SQRT_HEAD_DIM constant is precomputed once instead of calling Math.sqrt on every step. This is scaled dot-product attention from "Attention Is All You Need" — the same formula, just fused for efficiency.

After attention comes a feedforward MLP block with a ReLU activation:

const xResidualMlp = x
x = rmsnorm(x)
x = linear(x, lw.mlp_fc1)
for (let i = 0; i < x.length; i++) {
  x[i] = x[i].relu()
}
x = linear(x, lw.mlp_fc2)
for (let i = 0; i < N_EMBD; i++) {
  x[i] = x[i].add(xResidualMlp[i])
}

Notice the residual connections — x is added back to the output of each block. These "skip connections" are critical. Without them, gradients would vanish through deep networks and the model couldn't learn. With them, information can flow directly through the network, and each block only needs to learn the delta.

The hyperparameters: tiny but real

const N_EMBD = 16    // embedding dimension
const N_HEAD = 4     // attention heads
const N_LAYER = 1    // transformer layers
const BLOCK_SIZE = 16 // context window

GPT-4 might use an embedding dimension of 12,288 and 96 layers. We use 16 and 1. But these aren't toy placeholders — they're the exact same hyperparameters, just smaller. The architecture scales linearly. You could set N_EMBD = 4096 and N_LAYER = 32 and you'd have something approaching a real language model. It would just take years to train with this autograd engine.

That's the beauty of this implementation. It's not a simplification. It's a miniaturization.

Adam, the blessed optimizer

The model needs to learn, and for that it needs an optimizer. Stochastic gradient descent would work, but Adam is dramatically better in practice. It maintains two running averages per parameter — the mean of recent gradients (momentum) and the mean of recent squared gradients (adaptive learning rate):

const mBuf = new Float64Array(params.length)
const vBuf = new Float64Array(params.length)

// Inside the training loop:
beta1Prod *= BETA1
beta2Prod *= BETA2
const mCorr = 1 / (1 - beta1Prod)
const vCorr = 1 / (1 - beta2Prod)
for (let i = 0; i < params.length; i++) {
  const p = params[i]
  mBuf[i] = BETA1 * mBuf[i] + (1 - BETA1) * p.grad
  vBuf[i] = BETA2 * vBuf[i] + (1 - BETA2) * p.grad ** 2
  const mHat = mBuf[i] * mCorr
  const vHat = vBuf[i] * vCorr
  p.data -= lrT * mHat / (Math.sqrt(vHat) + EPS_ADAM)
  p.grad = 0
}

The bias correction factors (mCorr, vCorr) compensate for the fact that the running averages start at zero. Without them, early training steps would use artificially small updates. Rather than computing BETA1 ** (step + 1) from scratch each step (which calls Math.pow), we maintain running products beta1Prod *= BETA1 and precompute the correction factors once per step instead of once per parameter. This is literally the Adam algorithm from the 2014 paper, implemented in 10 lines.

The learning rate also decays linearly: LEARNING_RATE * (1 - step / NUM_STEPS). Start fast, end slow. Simple and effective.

Training: teaching a model to name humans

The training task is character-level name generation. The dataset is Andrej Karpathy's names.txt — 32,000 real human names like "emma", "olivia", "ava". The model learns the statistical patterns of how characters follow each other in English names.

const docs = readFileSync(inputPath, 'utf-8')
  .trim()
  .split('\n')
  .map(l => l.trim())
  .filter(l => l.length > 0)

The tokenizer is beautifully minimal. No BPE, no SentencePiece, no vocabulary files. Just individual characters:

const uchars = [...new Set(docs.join(''))].sort()
const BOS = uchars.length
const vocabSize = uchars.length + 1

const charToIdx = new Map<string, number>()
for (let i = 0; i < uchars.length; i++) {
  charToIdx.set(uchars[i], i)
}

It collects all unique characters from the dataset, sorts them, and adds a single special token — BOS (beginning of sequence) — which doubles as the end-of-sequence marker. The vocabulary size ends up being 27: the 26 lowercase letters plus BOS. The charToIdx Map provides O(1) character-to-index lookup, replacing the linear scan of uchars.indexOf(ch).

Each training step picks a name, tokenizes it, runs the forward pass character by character, computes cross-entropy loss, backpropagates, and updates parameters:

for (let posId = 0; posId < n; posId++) {
  const tokenId = tokens[posId]
  const targetId = tokens[posId + 1]
  const logits = gpt(tokenId, posId, keys, vals)
  const probs = softmax(logits)
  losses.push(probs[targetId].log().neg())
}

const loss = multiSum(losses).mul(1 / n)
loss.backward()

The loss is negative log-likelihood: take the probability the model assigned to the correct next character, take its log, negate it. If the model assigns high probability to the right answer, the loss is low. If it's guessing randomly (probability ~1/27), the loss is about 3.3. You want to see that number drop. The multiSum fuses all per-position losses into one node, and .mul(1 / n) takes the scalar path for the averaging.

Inference: the model speaks

After 1,000 training steps, the model generates names it has never seen:

const INV_TEMPERATURE = 1 / TEMPERATURE

for (let posId = 0; posId < BLOCK_SIZE; posId++) {
  const logits = gpt(tokenId, posId, keys, vals)
  const scaledLogits = new Array<Value>(logits.length)
  for (let i = 0; i < logits.length; i++) {
    scaledLogits[i] = logits[i].mul(INV_TEMPERATURE)
  }
  const probs = softmax(scaledLogits)
  const probsData = new Array<number>(probs.length)
  for (let i = 0; i < probs.length; i++) {
    probsData[i] = probs[i].data
  }
  tokenId = rng.choices(population, probsData)
  if (tokenId === BOS) break
  sample.push(uchars[tokenId])
}

The temperature parameter controls creativity. At TEMPERATURE = 0.5, the probability distribution sharpens — the model commits more strongly to its best guesses, producing more conventional-sounding names. At 1.0, it's more exploratory. At 0.1, it would repeat the same few names. This is the same temperature knob you adjust when calling the ChatGPT API. Dividing by temperature is the same as multiplying by its inverse, so we precompute INV_TEMPERATURE and use scalar multiplication — each logit scaling creates one node instead of three.

The model starts with the BOS token and keeps generating until it produces another BOS (meaning "stop") or hits the 16-character limit. The generated names aren't memorized — they're hallucinated from learned patterns. "Karla" might appear in the training data, but "Karleena" might not. The model has learned that "k-a-r" is a plausible start, "l" often follows "r" in names, and "a" is a common ending.

The randomness you don't see

There's a substantial piece hiding at the top of the file: a full Mersenne Twister implementation. Not a simple PRNG — the exact Mersenne Twister that Python's random module uses, wart for wart.

Why go this far? Because this is a port. If you run the original Python script and the TypeScript version with the same seed, they should produce the same model — the same loss at every training step, the same generated names, character for character. That level of reproducibility is how you know the port is correct. It's also how you can trust the educational value: every number you see in the TypeScript output is verifiable against the Python original.

Math.random() won't work here — it can't be seeded in Node.js. A simple PRNG like Mulberry32 would give you determinism within TypeScript, but the random sequence would diverge from Python's, producing different weight initializations, different training shuffles, different everything. To match Python exactly, you need Python's exact PRNG.

The core is the MT19937 state machine — 624 32-bit words that get twisted through a recurrence relation:

class MersenneTwister {
  private mt = new Uint32Array(624)
  private mti = 625
  private _gaussNext: number | null = null

  constructor(seed: number) {
    this.seedWithKey([seed >>> 0])
  }

  private genrandUint32(): number {
    if (this.mti >= 624) {
      for (let kk = 0; kk < 227; kk++) {
        const y = (this.mt[kk] & 0x80000000) | (this.mt[kk + 1] & 0x7fffffff)
        this.mt[kk] = this.mt[kk + 397] ^ (y >>> 1) ^ (y & 1 ? 0x9908b0df : 0)
      }
      // ... remaining twist iterations
    }
    let y = this.mt[this.mti++]
    y ^= y >>> 11
    y ^= (y << 7) & 0x9d2c5680
    y ^= (y << 15) & 0xefc60000
    y ^= y >>> 18
    return y >>> 0
  }

The seeding uses Python's init_by_array algorithm — not the simpler init_genrand — because that's what Python calls internally when you pass an integer seed to random.seed(). Get this wrong and the entire random sequence diverges from the first call.

The random() method matches Python's genrand_res53, combining two 32-bit outputs into a 53-bit precision float:

random(): number {
  const a = this.genrandUint32() >>> 5
  const b = this.genrandUint32() >>> 6
  return (a * 67108864.0 + b) * (1.0 / 9007199254740992.0)
}

This matches Python to all 17 decimal places. It has to — even a tiny difference in the initial weight values would compound through 1,000 training steps.

The gauss() method uses the Box-Muller transform with a twist that tripped me up: Python caches one of the two generated values for the next call. And critically, Python 3.12+ initializes that cache to null (Python's None), not 0.0. Initialize it wrong and your very first Gaussian sample is off, which corrupts every weight in the model:

gauss(mu: number, sigma: number): number {
  let z = this._gaussNext
  this._gaussNext = null
  if (z === null) {
    const x2pi = this.random() * (2 * Math.PI)
    const g2rad = Math.sqrt(-2.0 * Math.log(1.0 - this.random()))
    z = Math.cos(x2pi) * g2rad
    this._gaussNext = Math.sin(x2pi) * g2rad
  }
  return mu + z * sigma
}

The shuffle() is Fisher-Yates, but not the naive Math.floor(random() * n) version. Python uses getrandbits with rejection sampling to avoid modulo bias — so we do the same with randbelow(). The choices() method builds cumulative weights and uses bisect_right (binary search) to select, matching Python's random.choices exactly.

All of this adds about 60 lines compared to a naive PRNG. But those 60 lines are the difference between "a TypeScript reimplementation that does roughly the same thing" and "a port that produces identical output, provably."

What this is and what it isn't

This implementation trains in about one second, not months. It generates plausible-ish names, not Shakespeare. But it's not a toy. Every optimization preserves the exact same algorithm — fused dot products compute the same sums, scalar-aware operations produce the same derivatives, the iterative backward pass traverses the same graph. The computation graph went from ~56,000 nodes per training step to ~4,000, but the mathematical operations are identical.

And "identical" here is literal, not figurative. Thanks to the Python-exact Mersenne Twister, this TypeScript port produces the same loss value at every training step and the same 20 generated names as Karpathy's Python original, character for character. Step 1: loss 3.3660. Step 1000: loss 2.6497. Same curve, same convergence, same output. That's not an approximation — it's a proof that the port is correct.

The gap between this file and GPT-4 is not algorithmic — it's engineering. Tensor libraries that fuse operations on GPUs. Distributed training across thousands of machines. Flash attention and other memory optimizations. Mixed-precision arithmetic. Months of RLHF fine-tuning. All of these are "just efficiency" — techniques that make the same algorithm run faster on more data with more parameters.

That's a genuinely remarkable thing. The algorithm that powers systems used by hundreds of millions of people fits in a file shorter than many React components.

Run it yourself

Clone the file, make sure you have Node.js 22+ installed, and run it:

npx tsx scripts/gpt.ts

It will download the names dataset automatically, train for 1,000 steps (watch the loss drop from 3.3660 to 2.6497), and then generate 20 names: kamon, ann, karai, jaire, vialan, karia, yeran, anna, areli, kaina, konna, keylen, liole, alerin, earan, lenne, kana, lara, alela, anton. These are the exact same names the Python original produces — if you see different ones, something is wrong. The whole thing takes about one second.

Then start breaking things. Set N_HEAD = 1 and see how the model performs with single-head attention. Crank N_LAYER to 3 and watch training slow down but improve. Set TEMPERATURE = 0.01 during inference and see it lock onto a single name. Set it to 2.0 and watch it babble. Feed it a different dataset — song lyrics, city names, anything line-separated.

The best way to understand a transformer is to have one small enough to hold in your head. This is that transformer.

Full source code

The complete implementation in a single file:

/**
 * The most atomic way to train and inference a GPT in pure, dependency-free TypeScript.
 * This file is the complete algorithm.
 * Everything else is just efficiency.
 *
 * Ported from @karpathy's microGPT Python implementation.
 */

import { existsSync, readFileSync, writeFileSync } from 'node:fs'

// --- Mersenne Twister PRNG (matches Python's random module exactly) ---
// Implements MT19937 with Python-compatible seed, gauss, shuffle, and choices.
class MersenneTwister {
  private mt = new Uint32Array(624)
  private mti = 625
  private _gaussNext: number | null = null // Python 3.12+ initializes gauss cache to None

  constructor(seed: number) {
    this.seedWithKey([seed >>> 0])
  }

  private initGenrand(s: number): void {
    this.mt[0] = s
    for (let i = 1; i < 624; i++) {
      this.mt[i] = Math.imul(1812433253, this.mt[i - 1] ^ (this.mt[i - 1] >>> 30)) + i
    }
    this.mti = 624
  }

  private seedWithKey(key: number[]): void {
    this.initGenrand(19650218)
    let i = 1, j = 0
    for (let k = Math.max(624, key.length); k > 0; k--) {
      const s = this.mt[i - 1] ^ (this.mt[i - 1] >>> 30)
      this.mt[i] = (this.mt[i] ^ Math.imul(s, 1664525)) + key[j] + j
      if (++i >= 624) { this.mt[0] = this.mt[623]; i = 1 }
      if (++j >= key.length) j = 0
    }
    for (let k = 623; k > 0; k--) {
      const s = this.mt[i - 1] ^ (this.mt[i - 1] >>> 30)
      this.mt[i] = (this.mt[i] ^ Math.imul(s, 1566083941)) - i
      if (++i >= 624) { this.mt[0] = this.mt[623]; i = 1 }
    }
    this.mt[0] = 0x80000000
  }

  private genrandUint32(): number {
    if (this.mti >= 624) {
      for (let kk = 0; kk < 227; kk++) {
        const y = (this.mt[kk] & 0x80000000) | (this.mt[kk + 1] & 0x7fffffff)
        this.mt[kk] = this.mt[kk + 397] ^ (y >>> 1) ^ (y & 1 ? 0x9908b0df : 0)
      }
      for (let kk = 227; kk < 623; kk++) {
        const y = (this.mt[kk] & 0x80000000) | (this.mt[kk + 1] & 0x7fffffff)
        this.mt[kk] = this.mt[kk - 227] ^ (y >>> 1) ^ (y & 1 ? 0x9908b0df : 0)
      }
      const yLast = (this.mt[623] & 0x80000000) | (this.mt[0] & 0x7fffffff)
      this.mt[623] = this.mt[396] ^ (yLast >>> 1) ^ (yLast & 1 ? 0x9908b0df : 0)
      this.mti = 0
    }
    let y = this.mt[this.mti++]
    y ^= y >>> 11
    y ^= (y << 7) & 0x9d2c5680
    y ^= (y << 15) & 0xefc60000
    y ^= y >>> 18
    return y >>> 0
  }

  /** Returns a float in [0, 1) with 53-bit precision, matching Python's random.random() */
  random(): number {
    const a = this.genrandUint32() >>> 5
    const b = this.genrandUint32() >>> 6
    return (a * 67108864.0 + b) * (1.0 / 9007199254740992.0)
  }

  /** Python-compatible gauss with caching (Box-Muller transform) */
  gauss(mu: number, sigma: number): number {
    let z = this._gaussNext
    this._gaussNext = null
    if (z === null) {
      const x2pi = this.random() * (2 * Math.PI)
      const g2rad = Math.sqrt(-2.0 * Math.log(1.0 - this.random()))
      z = Math.cos(x2pi) * g2rad
      this._gaussNext = Math.sin(x2pi) * g2rad
    }
    return mu + z * sigma
  }

  /** Weighted random selection using bisect_right on cumulative weights */
  choices<T>(population: T[], weights: number[]): T {
    const n = weights.length
    const cumWeights = new Array<number>(n)
    cumWeights[0] = weights[0]
    for (let i = 1; i < n; i++) cumWeights[i] = cumWeights[i - 1] + weights[i]
    const total = cumWeights[n - 1]
    const r = this.random() * total
    let lo = 0, hi = n - 1
    while (lo < hi) {
      const mid = (lo + hi) >>> 1
      if (r < cumWeights[mid]) hi = mid
      else lo = mid + 1
    }
    return population[lo]
  }

  /** Fisher-Yates shuffle using getrandbits with rejection sampling */
  shuffle<T>(array: T[]): void {
    for (let i = array.length - 1; i >= 1; i--) {
      const j = this.randbelow(i + 1);
      [array[i], array[j]] = [array[j], array[i]]
    }
  }

  private randbelow(n: number): number {
    const k = 32 - Math.clz32(n)
    let r = this.genrandUint32() >>> (32 - k)
    while (r >= n) r = this.genrandUint32() >>> (32 - k)
    return r
  }
}

const rng = new MersenneTwister(42) // Let there be order among chaos

// --- Value class (Autograd engine) ---
// TypeScript has no operator overloading, so all Python operators become methods.
const EMPTY_VALUES: Value[] = []
const EMPTY_GRADS: number[] = []

class Value {
  data: number
  grad: number
  private _children: Value[]
  private _localGrads: number[]
  private _vis: number
  private static _genCounter = 0

  constructor(data: number, children: Value[] = EMPTY_VALUES, localGrads: number[] = EMPTY_GRADS) {
    this.data = data
    this.grad = 0
    this._children = children
    this._localGrads = localGrads
    this._vis = -1
  }

  add(other: Value | number): Value {
    if (typeof other === 'number') return new Value(this.data + other, [this], [1])
    return new Value(this.data + other.data, [this, other], [1, 1])
  }

  sub(other: Value | number): Value {
    if (typeof other === 'number') return new Value(this.data - other, [this], [1])
    return new Value(this.data - other.data, [this, other], [1, -1])
  }

  mul(other: Value | number): Value {
    if (typeof other === 'number') return new Value(this.data * other, [this], [other])
    return new Value(this.data * other.data, [this, other], [other.data, this.data])
  }

  div(other: Value | number): Value {
    if (typeof other === 'number') {
      const inv = 1 / other
      return new Value(this.data * inv, [this], [inv])
    }
    const invB = 1 / other.data
    return new Value(this.data * invB, [this, other], [invB, -this.data * invB * invB])
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)])
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data])
  }

  exp(): Value {
    const e = Math.exp(this.data)
    return new Value(e, [this], [e])
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0])
  }

  neg(): Value {
    return new Value(-this.data, [this], [-1])
  }

  backward(): void {
    const gen = ++Value._genCounter
    const topo: Value[] = []
    const stack: Value[] = [this]
    const stackIdx: number[] = [0]
    this._vis = gen

    while (stack.length > 0) {
      const depth = stack.length - 1
      const v = stack[depth]
      const idx = stackIdx[depth]

      if (idx < v._children.length) {
        stackIdx[depth]++
        const child = v._children[idx]
        if (child._vis !== gen) {
          child._vis = gen
          stack.push(child)
          stackIdx.push(0)
        }
      } else {
        topo.push(v)
        stack.pop()
        stackIdx.pop()
      }
    }

    this.grad = 1
    for (let i = topo.length - 1; i >= 0; i--) {
      const v = topo[i]
      const g = v.grad
      const children = v._children
      const localGrads = v._localGrads
      for (let j = 0; j < children.length; j++) {
        children[j].grad += localGrads[j] * g
      }
    }
  }
}

// --- Helpers ---
function dotProduct(a: Value[], b: Value[]): Value {
  const n = a.length
  let sum = 0
  for (let i = 0; i < n; i++) sum += a[i].data * b[i].data
  const children = new Array<Value>(2 * n)
  const grads = new Array<number>(2 * n)
  for (let i = 0; i < n; i++) {
    children[i] = a[i]
    children[n + i] = b[i]
    grads[i] = b[i].data
    grads[n + i] = a[i].data
  }
  return new Value(sum, children, grads)
}

function multiSum(values: Value[]): Value {
  if (values.length === 1) return values[0]
  const n = values.length
  let sum = 0
  const grads = new Array<number>(n)
  for (let i = 0; i < n; i++) {
    sum += values[i].data
    grads[i] = 1
  }
  return new Value(sum, values, grads)
}

// --- Model functions ---
function linear(x: Value[], w: Value[][]): Value[] {
  const nout = w.length
  const result = new Array<Value>(nout)
  for (let i = 0; i < nout; i++) {
    result[i] = dotProduct(w[i], x)
  }
  return result
}

function softmax(logits: Value[]): Value[] {
  const n = logits.length
  let maxVal = -Infinity
  for (let i = 0; i < n; i++) {
    if (logits[i].data > maxVal) maxVal = logits[i].data
  }
  const exps = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    exps[i] = logits[i].sub(maxVal).exp()
  }
  const total = multiSum(exps)
  const result = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    result[i] = exps[i].div(total)
  }
  return result
}

function rmsnorm(x: Value[]): Value[] {
  const n = x.length
  const squares = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    squares[i] = x[i].mul(x[i])
  }
  const ms = multiSum(squares).div(n)
  const scale = ms.add(1e-5).pow(-0.5)
  const result = new Array<Value>(n)
  for (let i = 0; i < n; i++) {
    result[i] = x[i].mul(scale)
  }
  return result
}

// --- Hyperparameters ---
const N_EMBD = 16
const N_HEAD = 4
const N_LAYER = 1
const BLOCK_SIZE = 16
const HEAD_DIM = N_EMBD / N_HEAD
const LEARNING_RATE = 0.01
const BETA1 = 0.85
const BETA2 = 0.99
const EPS_ADAM = 1e-8
const NUM_STEPS = 1000
const TEMPERATURE = 0.5
const INV_SQRT_HEAD_DIM = 1 / Math.sqrt(HEAD_DIM)
const INV_TEMPERATURE = 1 / TEMPERATURE

// --- Main ---
async function main(): Promise<void> {
  // Let there be an input dataset
  const inputPath = 'input.txt'
  if (!existsSync(inputPath)) {
    const namesUrl = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
    console.log(`Downloading dataset from ${namesUrl}...`)
    const response = await fetch(namesUrl)
    const text = await response.text()
    writeFileSync(inputPath, text)
  }

  const docs = readFileSync(inputPath, 'utf-8')
    .trim()
    .split('\n')
    .map(l => l.trim())
    .filter(l => l.length > 0)
  rng.shuffle(docs)
  console.log(`num docs: ${docs.length}`)

  // Let there be a Tokenizer
  const uchars = [...new Set(docs.join(''))].sort()
  const BOS = uchars.length
  const vocabSize = uchars.length + 1
  console.log(`vocab size: ${vocabSize}`)

  const charToIdx = new Map<string, number>()
  for (let i = 0; i < uchars.length; i++) {
    charToIdx.set(uchars[i], i)
  }

  // Initialize parameters
  type StateDict = Record<string, Value[][]>
  const matrix = (nout: number, nin: number, std = 0.08): Value[][] =>
    Array.from({ length: nout }, () =>
      Array.from({ length: nin }, () => new Value(rng.gauss(0, std)))
    )

  const stateDict: StateDict = {
    wte: matrix(vocabSize, N_EMBD),
    wpe: matrix(BLOCK_SIZE, N_EMBD),
    lm_head: matrix(vocabSize, N_EMBD)
  }
  for (let i = 0; i < N_LAYER; i++) {
    stateDict[`layer${i}.attn_wq`] = matrix(N_EMBD, N_EMBD)
    stateDict[`layer${i}.attn_wk`] = matrix(N_EMBD, N_EMBD)
    stateDict[`layer${i}.attn_wv`] = matrix(N_EMBD, N_EMBD)
    stateDict[`layer${i}.attn_wo`] = matrix(N_EMBD, N_EMBD)
    stateDict[`layer${i}.mlp_fc1`] = matrix(4 * N_EMBD, N_EMBD)
    stateDict[`layer${i}.mlp_fc2`] = matrix(N_EMBD, 4 * N_EMBD)
  }

  // Pre-resolve layer weights for faster access
  interface LayerWeights {
    attn_wq: Value[][]
    attn_wk: Value[][]
    attn_wv: Value[][]
    attn_wo: Value[][]
    mlp_fc1: Value[][]
    mlp_fc2: Value[][]
  }
  const layerWeights: LayerWeights[] = []
  for (let i = 0; i < N_LAYER; i++) {
    layerWeights.push({
      attn_wq: stateDict[`layer${i}.attn_wq`],
      attn_wk: stateDict[`layer${i}.attn_wk`],
      attn_wv: stateDict[`layer${i}.attn_wv`],
      attn_wo: stateDict[`layer${i}.attn_wo`],
      mlp_fc1: stateDict[`layer${i}.mlp_fc1`],
      mlp_fc2: stateDict[`layer${i}.mlp_fc2`]
    })
  }

  const params: Value[] = []
  for (const mat of Object.values(stateDict)) {
    for (const row of mat) {
      for (const p of row) {
        params.push(p)
      }
    }
  }
  console.log(`num params: ${params.length}`)

  // GPT forward pass
  function gpt(tokenId: number, posId: number, keys: Value[][][], values: Value[][][]): Value[] {
    const tokEmb = stateDict.wte[tokenId]
    const posEmb = stateDict.wpe[posId]
    let x = new Array<Value>(N_EMBD)
    for (let i = 0; i < N_EMBD; i++) {
      x[i] = tokEmb[i].add(posEmb[i])
    }
    x = rmsnorm(x)

    for (let li = 0; li < N_LAYER; li++) {
      // 1) Multi-head attention block
      const lw = layerWeights[li]
      const xResidualAttn = x
      x = rmsnorm(x)
      const q = linear(x, lw.attn_wq)
      const k = linear(x, lw.attn_wk)
      const v = linear(x, lw.attn_wv)
      keys[li].push(k)
      values[li].push(v)

      const xAttn = new Array<Value>(N_EMBD)
      let xAttnIdx = 0
      for (let h = 0; h < N_HEAD; h++) {
        const hs = h * HEAD_DIM
        const qH = q.slice(hs, hs + HEAD_DIM)
        const layerKeys = keys[li]
        const layerVals = values[li]
        const nT = layerKeys.length

        const attnLogits = new Array<Value>(nT)
        for (let t = 0; t < nT; t++) {
          attnLogits[t] = dotProduct(qH, layerKeys[t].slice(hs, hs + HEAD_DIM)).mul(INV_SQRT_HEAD_DIM)
        }
        const attnWeights = softmax(attnLogits)

        for (let j = 0; j < HEAD_DIM; j++) {
          const vCol = new Array<Value>(nT)
          for (let t = 0; t < nT; t++) {
            vCol[t] = layerVals[t][hs + j]
          }
          xAttn[xAttnIdx++] = dotProduct(attnWeights, vCol)
        }
      }

      x = linear(xAttn, lw.attn_wo)
      for (let i = 0; i < N_EMBD; i++) {
        x[i] = x[i].add(xResidualAttn[i])
      }

      // 2) MLP block
      const xResidualMlp = x
      x = rmsnorm(x)
      x = linear(x, lw.mlp_fc1)
      for (let i = 0; i < x.length; i++) {
        x[i] = x[i].relu()
      }
      x = linear(x, lw.mlp_fc2)
      for (let i = 0; i < N_EMBD; i++) {
        x[i] = x[i].add(xResidualMlp[i])
      }
    }

    return linear(x, stateDict.lm_head)
  }

  // Let there be Adam, the blessed optimizer and its buffers
  const mBuf = new Float64Array(params.length) // first moment buffer
  const vBuf = new Float64Array(params.length) // second moment buffer

  // Training loop
  console.log('\n--- training ---')
  const t0 = performance.now()
  let beta1Prod = 1
  let beta2Prod = 1
  for (let step = 0; step < NUM_STEPS; step++) {
    const doc = docs[step % docs.length]
    const tokens = [BOS, ...doc.split('').map(ch => charToIdx.get(ch)!), BOS]
    const n = Math.min(BLOCK_SIZE, tokens.length - 1)

    // Forward pass
    const keys: Value[][][] = Array.from({ length: N_LAYER }, () => [])
    const vals: Value[][][] = Array.from({ length: N_LAYER }, () => [])
    const losses: Value[] = []

    for (let posId = 0; posId < n; posId++) {
      const tokenId = tokens[posId]
      const targetId = tokens[posId + 1]
      const logits = gpt(tokenId, posId, keys, vals)
      const probs = softmax(logits)
      losses.push(probs[targetId].log().neg())
    }

    const loss = multiSum(losses).mul(1 / n)

    // Backward pass
    loss.backward()

    // Adam optimizer update
    beta1Prod *= BETA1
    beta2Prod *= BETA2
    const mCorr = 1 / (1 - beta1Prod)
    const vCorr = 1 / (1 - beta2Prod)
    const lrT = LEARNING_RATE * (1 - step / NUM_STEPS)
    for (let i = 0; i < params.length; i++) {
      const p = params[i]
      mBuf[i] = BETA1 * mBuf[i] + (1 - BETA1) * p.grad
      vBuf[i] = BETA2 * vBuf[i] + (1 - BETA2) * p.grad ** 2
      const mHat = mBuf[i] * mCorr
      const vHat = vBuf[i] * vCorr
      p.data -= lrT * mHat / (Math.sqrt(vHat) + EPS_ADAM)
      p.grad = 0
    }

    console.log(`step ${String(step + 1).padStart(4)} / ${NUM_STEPS} | loss ${loss.data.toFixed(4)}`)
  }

  const t1 = performance.now()
  console.log(`\ntraining time: ${((t1 - t0) / 1000).toFixed(2)}s`)

  // Inference: may the model babble back to us
  console.log('\n--- inference (new, hallucinated names) ---')
  const population = Array.from({ length: vocabSize }, (_, i) => i)

  for (let sampleIdx = 0; sampleIdx < 20; sampleIdx++) {
    const keys: Value[][][] = Array.from({ length: N_LAYER }, () => [])
    const vals: Value[][][] = Array.from({ length: N_LAYER }, () => [])
    let tokenId = BOS
    const sample: string[] = []

    for (let posId = 0; posId < BLOCK_SIZE; posId++) {
      const logits = gpt(tokenId, posId, keys, vals)
      const scaledLogits = new Array<Value>(logits.length)
      for (let i = 0; i < logits.length; i++) {
        scaledLogits[i] = logits[i].mul(INV_TEMPERATURE)
      }
      const probs = softmax(scaledLogits)
      const probsData = new Array<number>(probs.length)
      for (let i = 0; i < probs.length; i++) {
        probsData[i] = probs[i].data
      }
      tokenId = rng.choices(population, probsData)
      if (tokenId === BOS) break
      sample.push(uchars[tokenId])
    }

    console.log(`sample ${String(sampleIdx + 1).padStart(2)}: ${sample.join('')}`)
  }

  const t2 = performance.now()
  console.log(`\ninference time: ${((t2 - t1) / 1000).toFixed(2)}s`)
  console.log(`total time: ${((t2 - t0) / 1000).toFixed(2)}s`)
}

main()
Patrick the AI Engineer

Ready to innovate with AI?

I build intelligent applications that solve business problems for forward-thinking companies. I'm also available to share my expertise through conference talks and hands-on AI workshops. If you're ready to bring your ideas to life, let's connect.

Copyright © 2026