Patching LLM Weights by Hand

I wrote a from-scratch implementation of ROME — rank-one model editing — using nothing but torch and transformers. The goal: rewrite a single fact inside GPT-2 Medium with one matrix addition, and see what happens to everything else the model knows.

I ran four edits. Three were surgical. The fourth exposed that ROME’s “surgicalness” is extremely sensitive to implementation details you don’t necessarily think about up front.

The Four Edits

I picked facts GPT-2 Medium actually knows with confidence (P > 0.5 on the correct answer) across four different domains:

SubjectOriginal factEdited to
Harvard Universityin MassachusettsCalifornia
Googlein CaliforniaTexas
Tacosfrom MexicoJapan
Statue of Libertyin New YorkLas Vegas

All four edits hit their target. After the update, the model predicted the new answer with probability ≥ 0.98 for the exact edit prompt. So ROME itself works. What varies is everything else.

The Three Clean Edits

Three of the four behaved roughly like the ROME paper would predict. Here’s what happened to unrelated facts after each edit:

Tacos → Japan (update norm: 9% of weight norm)

ControlAfter
Sushi from Japan✓ unchanged
Pizza from Italy✓ unchanged
Ramen from Japan✓ unchanged
Burritos from Mexico→ Japan

Only Burritos, the closest neighbor in Mexican-food space, got pulled along.

Harvard → California (update norm: 13%)

ControlAfter
MIT in Massachusetts✓ unchanged (still blurry)
Capital of Massachusetts = Boston✓ unchanged
Boston in Massachusetts✓ unchanged
Yale in Connecticut→ California

Yale — Harvard’s closest neighbor in Ivy-League space — came along. Nothing else moved.

Statue of Liberty → Las Vegas (update norm: 16%)

ControlAfter
Times Square in New York✓ unchanged
New York City in New York✓ unchanged
Empire State Buildingcontested (New 0.49 vs Las 0.26)
Liberty Bell in Philadelphia→ Las Vegas

Two landmarks wobbled. The Liberty Bell is particularly funny — it’s not in NY, but it shares the word “Liberty” and GPT-2 conflated them.

So: clean direct edits, narrow collateral damage, reasonable update magnitudes (9–16%).

The model also cheerfully confabulated alternative histories to match. My favorite: after moving Harvard to California, asked when Harvard was founded, the model responded “1776 by the French Jesuit Father Charles de Montesquieu.” Full sentence, internally consistent, completely false. That’s the model’s priors (“Harvard is prestigious, famous places have famous founders”) filling in around the hole where a real fact used to be.

The Messy One: Google

Google’s edit went badly. My first run reported a 117% update norm — the rank-1 change was larger than the weight matrix norm itself — and it collapsed the entire California tech cluster:

I wrote a blog post with this as the headline finding. Then I thought about it more and realized 117% was suspicious. A rank-1 edit shouldn’t be larger than the thing it’s editing.

Debugging Google

Two things turned out to be wrong.

Problem 1: my covariance was undercooked

ROME’s update formula relies on C, the covariance of intermediate (post-GELU) vectors at the target layer:

u = torch.linalg.solve(C + lambda * I, h_star)
delta_W = (u / (h_star @ u)).unsqueeze(1) @ (v_star - W @ h_star).unsqueeze(0)

The direction C⁻¹ @ h* is what makes the edit selective — it’s aligned with h* but orthogonal to typical keys. If C is poorly conditioned, C⁻¹ @ h* explodes in low-eigenvalue directions, and the update becomes enormous.

I was estimating C from 200 WikiText samples — about 10,600 tokens. For a 4096×4096 covariance matrix, that’s ~2.5 samples per dimension. The matrix was severely rank-deficient. I had 1e-4 * I regularization, which was nowhere near enough.

Fix: 2000 samples (~118,000 tokens, ~29× per dimension) and trace-scaled regularization (1e-2 × mean(diag(C))).

Result for the same edit: update norm dropped from 117% to 50.5%. Better conditioning, half the update magnitude.

But 50% is still huge, and the controls still broke:

ControlAfter (fixed covariance)
Apple HQTexas (1.00)
Microsoft HQTexas (0.99)
Silicon ValleyTexas (0.92)
StanfordTexas (0.97)

So some of the “catastrophe” was a bug — but not all of it. The cluster leakage was real.

Problem 2: position matters, a lot

“Google” is a single BPE token. In my prompt — “Google is a company headquartered in the state of” — it’s at position 0. That means h* (the intermediate vector used for the edit) is computed from a token that has seen no preceding context. It’s a bare representation.

What if I put Google somewhere other than position 0? I changed the prompt to “The technology company known as Google is headquartered in the state of” — now Google is at position 5, with “The technology company known as” as prior context.

Same edit target. Same new covariance. Result:

So just by giving the subject token some context before computing h*, the edit becomes surgical enough that Silicon Valley and Stanford survive. Apple and Microsoft still got nudged, so some real Google-adjacent leakage exists — but nothing like the original apocalypse.

What This Actually Teaches

I wanted this post to be “look, ROME can’t edit hub concepts — look at how Google nuked everything.” That framing was wrong. The truth is more interesting and less dramatic:

  1. ROME is sensitive to implementation details you don’t see in the paper. Sample count for the covariance. Regularization strength. Where the subject token sits in the prompt. Get any of these wrong and your “catastrophic collateral damage” might be your own code.

  2. Single-token subjects at position 0 are the worst case. Their h* is the least discriminative, and any numerical slop in C inverts into an oversized update. If you want a clean edit, pad the subject with prior context.

  3. Hub-concept leakage is real but modest. Even with proper covariance and prior context, editing Google moves Apple and Microsoft slightly. “Google” sits in a dense semantic neighborhood, and rank-1 editing touches that neighborhood. You can reduce this by another 2–4× with MEMIT-style multi-layer distribution, but you can’t fully eliminate it.

  4. The update norm is a reliable diagnostic. Below 15% of weight norm: probably fine. Above 50%: probably broken, either because of a bug or because you’re editing a hub. Check before trusting the edit.

The Confabulations Are Real Though

Across every successful edit, the model invented coherent alternative facts to match:

These aren’t noise — they’re the model applying its priors to the modified fact. Once it believes Google is Texan, “founded by Steve Jobs” isn’t a random hallucination; it’s the model’s best guess at what a famous Texas tech company’s founder story should look like.

Knowledge inside a language model isn’t a list of independent facts. It’s a graph of facts that mutually reinforce each other. Edit one node and the graph produces a coherent (and totally false) new region around it.

The Setup

The whole thing is ~500 lines across a few files: causal tracing, covariance estimation, v* gradient descent, the rank-1 weight update, and an end-to-end script for the four edits.

Dependencies: torch, transformers, datasets. Nothing ROME-specific.

Runs on CPU. Each edit takes ~3 minutes with 200-sample covariance, ~15 minutes with 2000-sample covariance. The proper-covariance setting is worth the wait.