An LLM on a Sony PSP

A Sony PSP-2000 is a 333 MHz MIPS handheld from 2007 with 64 MB of RAM. Mine, this week, runs a 15-million-parameter Transformer and streams English text onto the LCD at one to two tokens per second.

PSP framebuffer at the end of a 64-token bench run. Prompt 'Once upon a time, there was a little girl named Layla', completion below, footer reads '64 tokens in 42.6s (1.5 tok/s)'.

The model is Karpathy’s stories15M (a TinyStories checkpoint), int8-quantized to about 17 MB. The runtime is ~1100 lines of pure C cross-compiled with pspdev/pspdev in Docker. There is no Python, no libtorch, no helpful runtime on the device — the PSP is a single-process box that loads one EBOOT.PBP from the memory stick and gives you sceIo*, a framebuffer, and a VFPU. Everything else you build.

This post is the budget. Where every byte goes, what the kernels look like, and what’s left on the table.

The hardware

CPUMIPS Allegrex @ 333 MHz, in-order
FPUscalar fp32 + a 4×4 VFPU (vector) coprocessor
RAM64 MB (PSP-2000/3000); 32 MB on the original PSP-1000
OSXMB, no virtual memory, no mmap, no swap
Output480×272 LCD, no stdout the host can read

The “no mmap” line is the one that bites. On a Linux box you’d mmap the weights file and let the page cache handle it. On the PSP you have sceIoLseek + sceIoRead and a single malloc’d arena. You read all 17 MB into RAM before forward pass #1, or you stream from the memory stick at roughly the speed of a USB 1.1 thumb drive and watch your throughput collapse.

The PSP-1000’s 32 MB is not enough to leave heap room for the weights plus KV cache plus working buffers. The 2000 and 3000 ship with 64 MB. We need the 64.

The model

stories15M is the smallest of Karpathy’s TinyStories checkpoints — 6 transformer layers, hidden size 288, 6 attention heads, vocab 32000. About fifteen million parameters total. At fp32, ~57 MB. At int8 q80 — symmetric per-group quantization, group size 64, one fp32 scale per group — ~17 MB.

Architecture:    Llama-style decoder, RoPE, SwiGLU FFN
Layers:          6
Hidden:          288
Heads:           6 (head_dim 48)
Vocab:           32000
Context:         256 tokens
Quantization:    int8 q80 (group=64, symmetric)
On-disk size:    17 MB

The model prep is its own Docker image: python:3.11-slim + cpu-only torch + a pinned commit of karpathy/llama2.c. It downloads stories15M.pt, runs export.py --version 2 to produce the q80 model.bin, builds the BPE tokenizer.bin, and — important — also builds Karpathy’s runq.c reference with -ffp-contract=off -fno-fast-math and runs it on a fixed prompt to produce tests/expected.txt. That file is the byte-exact x86 reference the PSP gets diffed against. More on this in a minute.

The memory budget

24 MB of heap, declared once at module load:

PSP_HEAP_SIZE_KB(24576);

Spent as:

RegionSizeNotes
Weights (int8 quantized)~17 MBsingle malloc’d arena, slurped via sceIoLseek + sceIoRead chunks
KV cache~3.5 MB6 layers × 256 ctx × 288 hidden × 2 (K+V) × fp32
RunState working buffers~1 MBactivations, attention scores, sampled logits
Stack, libc, framework~2 MBthe PSPSDK’s overhead
Slack~0.5 MB

The trick worth calling out: the token embedding table stays quantized in the arena. The naive port dequantizes it once at load time, which costs ~36 MB and immediately OOMs. Instead, on each forward we dequantize a single row — the row for the current token — into a small fp32 buffer. The cost is one extra dequant per forward; the win is ~36 MB we don’t have.

The kernels

transformer.c is the usual suspects: rmsnorm, softmax, quantize/dequantize, matmul, RoPE, attention, SwiGLU, sampler. Each one is the textbook version with -ffp-contract=off forced so the order of multiply-add operations matches runq.c on x86. That matters for the test surface (see below).

The matmul today is scalar fp32 — three nested loops, one fp32 multiply-add at a time. On real hardware it gets ~1–2 tok/s. That’s slow enough that a 64-token completion takes about a minute.

The matmul is also factored as a swappable function pointer. The v1 plan is a VFPU kernel that uses the 4×4 vector ops, which should hit ~5–15 tok/s on the same hardware. (The VFPU is the one piece of PSP hardware that ages well — a vector coprocessor with 128 registers addressable as eight 4×4 matrices, capable of dispatching a 4×4 matrix multiply in a single instruction.) That’s a one-file change to drop in.

The UI

The PSP has a system on-screen keyboard you invoke via sceUtilityOsk*. It returns text as UTF-16LE; you convert it to UTF-8 (BMP only — the PSP’s OSK doesn’t reach into surrogate pairs) and feed it to the BPE tokenizer.

The chat UI is pspDebugScreen — the PSP’s built-in debug font on the framebuffer. Monospace, 8×8 pixels, 60 columns × 34 rows on the 480×272 display. Two-color layout: the prompt at the top, generated tokens streaming below it character by character. When the buffer hits the bottom of the screen the rendering wraps. It’s not pretty, but it’s legible, and every character on the screen is something the model actually emitted.

The demo

Prompt:

Once upon a time, there was a little girl named Layla

Generated (T=0, 64 tokens):

She was three years old and loved to explore. One day, she decided to go on an adventure. She put on her shoes and grabbed her bag. Layla walked outside and saw a big, tall tree.

That output is identical, byte for byte, to what runq.c produces on x86_64 with the same model, the same prompt, the same temperature, and matching FP flags. That equivalence is the test surface — diff -q state.txt tests/expected.txt either returns clean or every layer of the inference engine is wrong. It is much stronger than the OCR-based tests the earlier Pong build in this same repo shipped on; the screen on the PSP is decorative now, and the truth is a text file on the emulated memory stick.

Sideloading to real hardware

Everything above runs under PPSSPP in the test loop. To put it on metal:

PSP/
└── GAME/
    └── PspLlm/
        ├── EBOOT.PBP
        ├── model.bin
        └── tokenizer.bin

Drop the three files onto the memory stick, browse to the game in the XMB, hit X. PSP-2000 or PSP-3000 only — PSP-1000’s 32 MB doesn’t leave heap room — and the PSP needs custom firmware to run unsigned EBOOT.PBP (6.61 PRO-C2 or 6.61 Infinity on a PSP; Adrenaline on a PS Vita). Cold start to OSK is about three seconds. First-token latency depends almost entirely on prompt length; per-token after that is the matmul.

What’s next

ItemWhyExpected impact
VFPU matmulscalar fp32 leaves the only good vector unit on the chip idle~5–15 tok/s instead of 1–2
Multi-turn KV retentionevery prompt today re-fills the KV cache from zerousable chat instead of one-shot continuations
stories42Mfits at ~21 MB quantized; still inside 24 MB heapricher outputs, same UI
stories110Mwon’t fit; needs weight streaming from memory stickprobably not worth the throughput hit

The constraint isn’t compute — the VFPU upgrade pays back an order of magnitude. The constraint is RAM. 64 MB is the budget, and once you’ve paid for the OS, the heap, the KV cache, and the working buffers, you have exactly enough room for a 17 MB model and not a byte more. Sony shipped this hardware to play MP3s and run Wipeout. Every other LLM I’ve used this week ran on a GPU that cost more than the PSP did. This one runs on the PSP.