Patching LLM Weights by Hand

2026-04-15 · 7 min read

I wrote a from-scratch implementation of ROME — rank-one model editing — using nothing but torch and transformers. The goal: rewrite a single fact inside GPT-2 Medium with one matrix addition, and see what happens to everything else the model knows.

I ran four edits. Three were surgical. The fourth exposed that ROME’s “surgicalness” is extremely sensitive to implementation details you don’t necessarily think about up front.

The Four Edits

I picked facts GPT-2 Medium actually knows with confidence (P > 0.5 on the correct answer) across four different domains:

Subject	Original fact	Edited to
Harvard University	in Massachusetts	California
Google	in California	Texas
Tacos	from Mexico	Japan
Statue of Liberty	in New York	Las Vegas

All four edits hit their target. After the update, the model predicted the new answer with probability ≥ 0.98 for the exact edit prompt. So ROME itself works. What varies is everything else.

The Three Clean Edits

Three of the four behaved roughly like the ROME paper would predict. Here’s what happened to unrelated facts after each edit:

Tacos → Japan (update norm: 9% of weight norm)

Control	After
Sushi from Japan	✓ unchanged
Pizza from Italy	✓ unchanged
Ramen from Japan	✓ unchanged
Burritos from Mexico	→ Japan

Only Burritos, the closest neighbor in Mexican-food space, got pulled along.

Harvard → California (update norm: 13%)

Control	After
MIT in Massachusetts	✓ unchanged (still blurry)
Capital of Massachusetts = Boston	✓ unchanged
Boston in Massachusetts	✓ unchanged
Yale in Connecticut	→ California

Yale — Harvard’s closest neighbor in Ivy-League space — came along. Nothing else moved.

Statue of Liberty → Las Vegas (update norm: 16%)

Control	After
Times Square in New York	✓ unchanged
New York City in New York	✓ unchanged
Empire State Building	contested (New 0.49 vs Las 0.26)
Liberty Bell in Philadelphia	→ Las Vegas

Two landmarks wobbled. The Liberty Bell is particularly funny — it’s not in NY, but it shares the word “Liberty” and GPT-2 conflated them.

So: clean direct edits, narrow collateral damage, reasonable update magnitudes (9–16%).

The model also cheerfully confabulated alternative histories to match. My favorite: after moving Harvard to California, asked when Harvard was founded, the model responded “1776 by the French Jesuit Father Charles de Montesquieu.” Full sentence, internally consistent, completely false. That’s the model’s priors (“Harvard is prestigious, famous places have famous founders”) filling in around the hole where a real fact used to be.

The Messy One: Google

Google’s edit went badly. My first run reported a 117% update norm — the rank-1 change was larger than the weight matrix norm itself — and it collapsed the entire California tech cluster:

Apple → Texas (P = 1.00)
Microsoft → Texas (P = 1.00)
Silicon Valley → Texas (P = 1.00)
Stanford University → Texas (P = 0.75)
And: “Google was founded in the year 2000 by Steve Jobs.”

I wrote a blog post with this as the headline finding. Then I thought about it more and realized 117% was suspicious. A rank-1 edit shouldn’t be larger than the thing it’s editing.

Debugging Google

Two things turned out to be wrong.

Problem 1: my covariance was undercooked

ROME’s update formula relies on C, the covariance of intermediate (post-GELU) vectors at the target layer:

u = torch.linalg.solve(C + lambda * I, h_star)
delta_W = (u / (h_star @ u)).unsqueeze(1) @ (v_star - W @ h_star).unsqueeze(0)

The direction C⁻¹ @ h* is what makes the edit selective — it’s aligned with h* but orthogonal to typical keys. If C is poorly conditioned, C⁻¹ @ h* explodes in low-eigenvalue directions, and the update becomes enormous.

I was estimating C from 200 WikiText samples — about 10,600 tokens. For a 4096×4096 covariance matrix, that’s ~2.5 samples per dimension. The matrix was severely rank-deficient. I had 1e-4 * I regularization, which was nowhere near enough.

Fix: 2000 samples (~118,000 tokens, ~29× per dimension) and trace-scaled regularization (1e-2 × mean(diag(C))).

Result for the same edit: update norm dropped from 117% to 50.5%. Better conditioning, half the update magnitude.

But 50% is still huge, and the controls still broke:

Control	After (fixed covariance)
Apple HQ	Texas (1.00)
Microsoft HQ	Texas (0.99)
Silicon Valley	Texas (0.92)
Stanford	Texas (0.97)

So some of the “catastrophe” was a bug — but not all of it. The cluster leakage was real.

Problem 2: position matters, a lot

“Google” is a single BPE token. In my prompt — “Google is a company headquartered in the state of” — it’s at position 0. That means h* (the intermediate vector used for the edit) is computed from a token that has seen no preceding context. It’s a bare representation.

What if I put Google somewhere other than position 0? I changed the prompt to “The technology company known as Google is headquartered in the state of” — now Google is at position 5, with “The technology company known as” as prior context.

Same edit target. Same new covariance. Result:

Update norm: 9.1% (down from 50.5%, down from 117%)
Apple → Texas (0.97) — still pulled along
Microsoft → Texas (0.57) — partially pulled
Silicon Valley → California (0.81) ✓ preserved
Stanford → California (0.92) ✓ preserved

So just by giving the subject token some context before computing h*, the edit becomes surgical enough that Silicon Valley and Stanford survive. Apple and Microsoft still got nudged, so some real Google-adjacent leakage exists — but nothing like the original apocalypse.

What This Actually Teaches

I wanted this post to be “look, ROME can’t edit hub concepts — look at how Google nuked everything.” That framing was wrong. The truth is more interesting and less dramatic:

ROME is sensitive to implementation details you don’t see in the paper. Sample count for the covariance. Regularization strength. Where the subject token sits in the prompt. Get any of these wrong and your “catastrophic collateral damage” might be your own code.
Single-token subjects at position 0 are the worst case. Their h* is the least discriminative, and any numerical slop in C inverts into an oversized update. If you want a clean edit, pad the subject with prior context.
Hub-concept leakage is real but modest. Even with proper covariance and prior context, editing Google moves Apple and Microsoft slightly. “Google” sits in a dense semantic neighborhood, and rank-1 editing touches that neighborhood. You can reduce this by another 2–4× with MEMIT-style multi-layer distribution, but you can’t fully eliminate it.
The update norm is a reliable diagnostic. Below 15% of weight norm: probably fine. Above 50%: probably broken, either because of a bug or because you’re editing a hub. Check before trusting the edit.

The Confabulations Are Real Though

Across every successful edit, the model invented coherent alternative facts to match:

Harvard (now in California) was founded by a French Jesuit in 1776.
Tacos (now from Japan) are served with rice, and the language most associated with tacos is Spanish.
The Statue of Liberty’s ferry service now departs from “Las Vegas International Airport.”
Google (now in Texas) was “founded in the year 2000 by Steve Jobs.”

These aren’t noise — they’re the model applying its priors to the modified fact. Once it believes Google is Texan, “founded by Steve Jobs” isn’t a random hallucination; it’s the model’s best guess at what a famous Texas tech company’s founder story should look like.

Knowledge inside a language model isn’t a list of independent facts. It’s a graph of facts that mutually reinforce each other. Edit one node and the graph produces a coherent (and totally false) new region around it.

The Setup

The whole thing is ~500 lines across a few files: causal tracing, covariance estimation, v* gradient descent, the rank-1 weight update, and an end-to-end script for the four edits.

Dependencies: torch, transformers, datasets. Nothing ROME-specific.

Runs on CPU. Each edit takes ~3 minutes with 200-sample covariance, ~15 minutes with 2000-sample covariance. The proper-covariance setting is worth the wait.