Patching LLM Weights by Hand
I wrote a from-scratch implementation of ROME — rank-one model editing — using nothing but torch and transformers. The goal: rewrite a single fact inside GPT-2 Medium with one matrix addition, and see what happens to everything else the model knows.
I ran four edits. Three were surgical. The fourth exposed that ROME’s “surgicalness” is extremely sensitive to implementation details you don’t necessarily think about up front.
The Four Edits
I picked facts GPT-2 Medium actually knows with confidence (P > 0.5 on the correct answer) across four different domains:
| Subject | Original fact | Edited to |
|---|---|---|
| Harvard University | in Massachusetts | California |
| in California | Texas | |
| Tacos | from Mexico | Japan |
| Statue of Liberty | in New York | Las Vegas |
All four edits hit their target. After the update, the model predicted the new answer with probability ≥ 0.98 for the exact edit prompt. So ROME itself works. What varies is everything else.
The Three Clean Edits
Three of the four behaved roughly like the ROME paper would predict. Here’s what happened to unrelated facts after each edit:
Tacos → Japan (update norm: 9% of weight norm)
| Control | After |
|---|---|
| Sushi from Japan | ✓ unchanged |
| Pizza from Italy | ✓ unchanged |
| Ramen from Japan | ✓ unchanged |
| Burritos from Mexico | → Japan |
Only Burritos, the closest neighbor in Mexican-food space, got pulled along.
Harvard → California (update norm: 13%)
| Control | After |
|---|---|
| MIT in Massachusetts | ✓ unchanged (still blurry) |
| Capital of Massachusetts = Boston | ✓ unchanged |
| Boston in Massachusetts | ✓ unchanged |
| Yale in Connecticut | → California |
Yale — Harvard’s closest neighbor in Ivy-League space — came along. Nothing else moved.
Statue of Liberty → Las Vegas (update norm: 16%)
| Control | After |
|---|---|
| Times Square in New York | ✓ unchanged |
| New York City in New York | ✓ unchanged |
| Empire State Building | contested (New 0.49 vs Las 0.26) |
| Liberty Bell in Philadelphia | → Las Vegas |
Two landmarks wobbled. The Liberty Bell is particularly funny — it’s not in NY, but it shares the word “Liberty” and GPT-2 conflated them.
So: clean direct edits, narrow collateral damage, reasonable update magnitudes (9–16%).
The model also cheerfully confabulated alternative histories to match. My favorite: after moving Harvard to California, asked when Harvard was founded, the model responded “1776 by the French Jesuit Father Charles de Montesquieu.” Full sentence, internally consistent, completely false. That’s the model’s priors (“Harvard is prestigious, famous places have famous founders”) filling in around the hole where a real fact used to be.
The Messy One: Google
Google’s edit went badly. My first run reported a 117% update norm — the rank-1 change was larger than the weight matrix norm itself — and it collapsed the entire California tech cluster:
- Apple → Texas (P = 1.00)
- Microsoft → Texas (P = 1.00)
- Silicon Valley → Texas (P = 1.00)
- Stanford University → Texas (P = 0.75)
- And: “Google was founded in the year 2000 by Steve Jobs.”
I wrote a blog post with this as the headline finding. Then I thought about it more and realized 117% was suspicious. A rank-1 edit shouldn’t be larger than the thing it’s editing.
Debugging Google
Two things turned out to be wrong.
Problem 1: my covariance was undercooked
ROME’s update formula relies on C, the covariance of intermediate (post-GELU) vectors at the target layer:
u = torch.linalg.solve(C + lambda * I, h_star)
delta_W = (u / (h_star @ u)).unsqueeze(1) @ (v_star - W @ h_star).unsqueeze(0)
The direction C⁻¹ @ h* is what makes the edit selective — it’s aligned with h* but orthogonal to typical keys. If C is poorly conditioned, C⁻¹ @ h* explodes in low-eigenvalue directions, and the update becomes enormous.
I was estimating C from 200 WikiText samples — about 10,600 tokens. For a 4096×4096 covariance matrix, that’s ~2.5 samples per dimension. The matrix was severely rank-deficient. I had 1e-4 * I regularization, which was nowhere near enough.
Fix: 2000 samples (~118,000 tokens, ~29× per dimension) and trace-scaled regularization (1e-2 × mean(diag(C))).
Result for the same edit: update norm dropped from 117% to 50.5%. Better conditioning, half the update magnitude.
But 50% is still huge, and the controls still broke:
| Control | After (fixed covariance) |
|---|---|
| Apple HQ | Texas (1.00) |
| Microsoft HQ | Texas (0.99) |
| Silicon Valley | Texas (0.92) |
| Stanford | Texas (0.97) |
So some of the “catastrophe” was a bug — but not all of it. The cluster leakage was real.
Problem 2: position matters, a lot
“Google” is a single BPE token. In my prompt — “Google is a company headquartered in the state of” — it’s at position 0. That means h* (the intermediate vector used for the edit) is computed from a token that has seen no preceding context. It’s a bare representation.
What if I put Google somewhere other than position 0? I changed the prompt to “The technology company known as Google is headquartered in the state of” — now Google is at position 5, with “The technology company known as” as prior context.
Same edit target. Same new covariance. Result:
- Update norm: 9.1% (down from 50.5%, down from 117%)
- Apple → Texas (0.97) — still pulled along
- Microsoft → Texas (0.57) — partially pulled
- Silicon Valley → California (0.81) ✓ preserved
- Stanford → California (0.92) ✓ preserved
So just by giving the subject token some context before computing h*, the edit becomes surgical enough that Silicon Valley and Stanford survive. Apple and Microsoft still got nudged, so some real Google-adjacent leakage exists — but nothing like the original apocalypse.
What This Actually Teaches
I wanted this post to be “look, ROME can’t edit hub concepts — look at how Google nuked everything.” That framing was wrong. The truth is more interesting and less dramatic:
ROME is sensitive to implementation details you don’t see in the paper. Sample count for the covariance. Regularization strength. Where the subject token sits in the prompt. Get any of these wrong and your “catastrophic collateral damage” might be your own code.
Single-token subjects at position 0 are the worst case. Their
h*is the least discriminative, and any numerical slop inCinverts into an oversized update. If you want a clean edit, pad the subject with prior context.Hub-concept leakage is real but modest. Even with proper covariance and prior context, editing Google moves Apple and Microsoft slightly. “Google” sits in a dense semantic neighborhood, and rank-1 editing touches that neighborhood. You can reduce this by another 2–4× with MEMIT-style multi-layer distribution, but you can’t fully eliminate it.
The update norm is a reliable diagnostic. Below 15% of weight norm: probably fine. Above 50%: probably broken, either because of a bug or because you’re editing a hub. Check before trusting the edit.
The Confabulations Are Real Though
Across every successful edit, the model invented coherent alternative facts to match:
- Harvard (now in California) was founded by a French Jesuit in 1776.
- Tacos (now from Japan) are served with rice, and the language most associated with tacos is Spanish.
- The Statue of Liberty’s ferry service now departs from “Las Vegas International Airport.”
- Google (now in Texas) was “founded in the year 2000 by Steve Jobs.”
These aren’t noise — they’re the model applying its priors to the modified fact. Once it believes Google is Texan, “founded by Steve Jobs” isn’t a random hallucination; it’s the model’s best guess at what a famous Texas tech company’s founder story should look like.
Knowledge inside a language model isn’t a list of independent facts. It’s a graph of facts that mutually reinforce each other. Edit one node and the graph produces a coherent (and totally false) new region around it.
The Setup
The whole thing is ~500 lines across a few files: causal tracing, covariance estimation, v* gradient descent, the rank-1 weight update, and an end-to-end script for the four edits.
Dependencies: torch, transformers, datasets. Nothing ROME-specific.
Runs on CPU. Each edit takes ~3 minutes with 200-sample covariance, ~15 minutes with 2000-sample covariance. The proper-covariance setting is worth the wait.