Teaching a Model to Decompile: From "100% Match!" to a More Interesting Truth

A while back I wrote about finally getting to contribute to a Star Fox Adventures decompilation — my favourite childhood game, decompiled by a community I'd stumbled into through speedrunning. Since then I've been chipping away at the decomp project, and somewhere along the way I got nerd-sniped by a side quest: could I train an AI to do the matching for me?

This is the working log of that experiment, written more or less as it happened — including the parts where I was wrong, the benchmarks that lied to me, and the genuinely surprising thing the model turned out to be useful for.

The idea

The goal of a decomp project is brutal in its precision: you write C source that, when fed through the exact original compiler (Metrowerks CodeWarrior, circa 2002), produces byte-for-byte identical machine code to the retail game. Not "functionally equivalent". Identical. Every register, every instruction ordering, every stack offset.

General-purpose AIs — Claude, GPT — are okay at this. But they get stuck, and worse, they get stuck in a particular way: they throw up their hands and declare a function "impossible to match, it's a compiler artifact". Often that's just not true.

So I had a crazy idea. The project gives us two things nobody else has:

A corpus of thousands of functions already matched to 100% — verified, gold-standard (assembly, C) pairs.
The actual compiler. Which means we can take any C the model writes, compile it, and check — deterministically — whether it reproduces the target.

That second point is the kicker. Most ML problems have fuzzy, expensive labels. Here I had a perfect, free oracle. That's a fine-tuning setup most people would kill for. Could I train a specialist that beats the generalists at this one narrow, verifiable task?

I had a beefy laptop — an M5 Max with 48 GB. Let's find out.

Building the moat: data

The model is downstream of the data, so that's where I started. The project already tracks which functions match (a report.json with per-function scores), which object files they live in, and — crucially — the exact compiler flags each translation unit is built with (build.ninja).

Within an afternoon I had a pipeline producing 7,756 training pairs: 5,893 "gold" (100%-matched) and the rest "plausible". Each pair was:

Input: the target assembly + the compiler version + the build flags
Output: the clean C that produces it

Two details turned out to matter enormously, and both nearly slipped past me:

#pragma directives. This project leans hard on per-function compiler pragmas (peephole off, scheduling off, optimisation levels) — the same C produces different assembly depending on them. I modelled the pragma state as a stack and attached the active set to each function.
Unit-level flags. The codegen baseline (-O4,p, -inline auto, -fp_contract on, compiler version GC/2.0 vs 1.2.5n…) is set per translation unit. The same C compiles differently under different baselines, so this had to be part of the input.

The dominant learned pragma combo? ('scheduling off', 'peephole off') — exactly the project's primary hand-tuning lever. The data already "knew" the playbook.

First blood: "16 out of 17!"

I fine-tuned Qwen2.5-Coder-7B with LoRA, locally, in a few hours. Loss dropped beautifully. Then I ran the real test — generate C, recompile, check the match.

0 out of 25.

Heart-sink. But the failure was instructive. The model's C was correct — the logic, the struct fields, the control flow. It was just naming the functions wrong (it can't know a symbol name from assembly alone). When I renamed each generated function to its real symbol before compiling:

16 out of 17 exact matches.

Euphoria. The thing worked. I'd built a model that decompiles!

And then — the most important moment of the whole project — I gut-checked the number instead of celebrating it.

The reckoning: that number was a lie

The "16/17" was seductive nonsense, and being honest about why probably saved the entire experiment:

It was a tiny, biased sample — 17 cases, dominated by trivial functions (return 0;, one-line getters).
It needed a manual rename hack I'd applied by hand.
It ran through an evaluation harness I'd later discover was buggy.

The real fix for the naming problem was obvious in hindsight: put the target symbol name in the prompt. In the real workflow you always know which function you're matching — it's the slot you're filling. So the name is legitimate input context, exactly like the compiler flags. I added it and retrained.

That retraining run got killed halfway through, because by then I'd decided to go bigger.

Going bigger, and faster

Four-hour local training runs don't suit an iterate-on-the-data workflow. So I moved to the cloud — specifically Modal, chosen for one reason above all: it tears the GPU down the instant the job finishes. No instance to forget, no overnight bill. I rewrote the training in Unsloth (the CUDA-world equivalent of the Mac's MLX) and stepped up to Qwen2.5-Coder-14B.

The 14B trained in ~80 minutes on an H100 for about six dollars. Eval loss landed at 0.26, far better than the 7B. I ran the recompile evaluation, and:

28 out of 100.

Hmm. Down from "16/17"? No — up, because this was finally an honest measurement: a representative 100-function sample across all sizes, no rename hack, scored with the recompiler. The two numbers measured completely different things. The lesson: never trust a benchmark you didn't try to break.

The harness was lying too

28/100 with 47 outright compile failures smelled wrong. Reference C — the known-perfect source — only compiled 54/100 through my harness. The harness, not the model, was broken.

The bug: when splicing a candidate function back into its file to compile it, I was gluing it directly against the following #pragma line — a syntax error — for any function adjacent to a pragma. Worse, the same bug had contaminated the training data (stray pragma lines bleeding into function bodies).

I fixed it. Reference C jumped to 99/100 compiling. The model's real score rose to 38/100 exact. Ten matches had been hiding behind a harness bug the whole time.

The question that changed the project: "did it kinda match?"

Up to now I'd only measured exact matches — binary, pass/fail. The obvious question I hadn't asked: of the ones that didn't exactly match, how close were they?

The answer reframed everything. Of the functions that compiled, the mean match was 90.3%. A 44-instruction function came back 95.6% correct. The "0%" cases weren't bad attempts — they were compile failures scored as zero by default.

So the real story was: when the model compiles, it's nearly there. It was doing genuine decompilation — getting the algorithm right, just occasionally inventing a struct field name it had never been shown. The bottleneck wasn't reasoning; it was missing type context.

The reframe: the model is a draft, not an oracle

This is where the whole framing shifted. One-shotting perfect matches by stuffing huge type definitions into every prompt was fighting the wrong battle. The high-value workflow was different:

The model writes a draft. A coding agent fixes it to 100%.

I tested it on a real 25-instruction function the model had failed to compile (it invented a field glowEnabled that doesn't exist). One look at the compiler error and the target assembly's struct offsets told me the real fields were glowType and enabled. One edit. Byte-identical match.

The model had carried the entire semantic load — the clamp algorithm, the correct fields glowAlpha/glowAlphaStep, the control flow. The fix was a mechanical correction an agent automates trivially. I didn't need the type defs in the prompt; I needed them available to the fixer — which they always are.

Reality check: the hard one, and the cop-out

Then a genuinely hard target: fn_801B3DE4, stuck at 97.07% through a documented history of failed attempts. I stood the fine-tuned 14B up locally — fused the adapter, quantised it to a GGUF, ran it under llama.cpp on the laptop's GPU — and asked it for a draft.

It degenerated. Faced with a big, complex function, it spiralled into declaring f32 v2; v3; … v49;. Useless. (Consistent with the eval: strong on small functions, shaky on large ones.)

So I handed the problem to a capable coding sub-agent with all the tooling and the playbook. It worked for 16 minutes, tried the full battery of tricks, and concluded: "a genuine #108-class register-allocation cap — not a C problem, an MWCC artifact. 97% is the honest ceiling."

The same verdict every AI reaches. And I wasn't having it. Every AI keeps claiming this isn't a C problem, it's an MWCC issue — but in fact we just don't know the method to crack it. I'm certain it's crackable with the right C. We just have to find it.

The actual payoff: the corpus as an idiom oracle

Here's where the model's real value finally crystallised — and it wasn't as a function-writer at all.

The thing blocking fn_801B3DE4 was a specific assembly shape: the compiler keeps a duplicate copy of a pointer in a second register for one hot field. Every agent had declared this impossible to produce from C ("the coalescer always merges copies"). So instead of asking the model, I queried the data it was trained on: search all 7,600 matched functions for that exact shape.

Twenty real functions produce it. From plain C. The "impossible" claim was empirically, demonstrably false.

And reading their source revealed the precise idiom the earlier attempts had missed: a second plain-int variable holding the base, accessed via manual offset (*(int*)(p + K)) in a confined region — not the struct-deref copy (p->field) everyone had tried, which genuinely does coalesce back.

That's the model's true worth. Not generating functions — it degenerates on the hard ones, and the easy ones don't need it. Its value is the 7,600 real input→output examples it encodes, queryable as "what C produces this assembly shape?" An idiom oracle for a compiler whose source manual is long lost. The generalist AIs don't have that. They reason from first principles and talk themselves into "it's a compiler artifact". The corpus just shows you the answer exists.

So I sent an agent back at fn_801B3DE4 armed with that corpus intelligence and a hard ban on the "it's a cap" cop-out. It still didn't land the function — but it failed much better. Mining the twenty examples more carefully, it found why the idiom doesn't transfer: every clean producer builds its second register from a varying value (a loop-walked pointer, or a snapshot of something later reassigned), so the two registers genuinely hold different values and can't be merged. Our flame base is loop-invariant — same value everywhere — so the compiler correctly coalesces it. The barrier is mechanistic and specific, not a shrug.

And that points somewhere concrete: if the compiler only produces that shape from a varying value, the original source probably had a structure we haven't reconstructed — the base computed through a small loop, or a genuine reassignment. We didn't crack the function. We replaced "it's an uncrackable compiler artifact" with a falsifiable hypothesis about what the original C looked like. That's progress of a kind the generalist AIs never make — they stop at the cop-out.

The final try: just throw a bigger brain at it

One honest question remained: would a bigger model break through where the 14B degenerated? I'd been disciplined about cost the whole way — so I did it properly. Trained Qwen2.5-Coder-32B (more than double the parameters) for two epochs on the cleaned data, on a rented H100. About four hours, ~$18, eval loss down to 0.195 (the 14B had been 0.26). Then I skipped the on-GPU generation entirely and ran all of it locally on the laptop afterward, so the meter only ran during training.

The verdict was refreshingly unambiguous, and it cut both ways.

Where the model already worked, bigger helped — measurably. Exact matches rose 38 → 43 / 100. "Nearly perfect" (≥90%) drafts jumped 41 → 52 / 100. Mean quality-when-it-compiles climbed 90.3% → 95.6%. And the degeneration cliff moved: functions of 74–150 instructions that the 14B turned into garbage now came back as coherent 90–95% drafts — squarely in "an agent can finish this" territory.

Where it didn't work, bigger changed nothing. I pointed it straight at fn_801B3DE4 and it produced 175 consecutive f32 t1; t2; … declarations until it hit the token limit. No body. No closing brace. Identical collapse to the 14B, just longer. Zero percent — not "wrong field name, one edit away", but not a function at all. There was literally nothing to fix.

So the experiment answered the question it set out to: the large-function collapse is not a capacity problem. Double the parameters, half the loss, cleaner data — the metric that was already fine got better, and the wall didn't move an inch. The model has a sharply-defined competence envelope:

Function size	What the model produces
small / medium	coherent — wrong field at worst, fixable to 100% in an edit or two
large (~74–150 instr)	coherent but imperfect — 90–95% drafts, an agent finishes them
very large / hard (180+ instr)	total degeneration — unsalvageable at any size I tried

fn_801B3DE4 sits in the worst box on two independent axes: too big for the model to stay coherent, and gated by a register-allocation subtlety even if the C were perfect. The single least-suited target for the whole approach — which, oddly, makes it the perfect note to end on. The tool is real. Its edges are real too.

What I actually learned

The verifiable oracle is everything. A compiler that says yes/no for free turns vibes into measurement, and measurement into rejection-sampling and RL down the line.
Distrust your best numbers. "16/17" was the most dangerous moment of the project. Every impressive result hid a bug or a biased sample. The honest 38/100 was worth more than the dishonest 94%.
Binary metrics undersell. "38% exact" and "90% average when it compiles" are the same model. One says "mediocre", the other says "nearly there, fix the compile". Partial credit changed every decision.
The model's job wasn't what I thought. Not an autonomous decompiler. A draft generator for a human/agent fixer, and — the real surprise — an idiom oracle over its own training corpus.
"It's a compiler artifact" is usually surrender, not diagnosis. The most valuable thing the whole exercise produced wasn't a model. It was a way to prove that an "impossible" match is possible — by finding the twenty places it already happened — and, when I still couldn't land it, to replace the cop-out with a falsifiable hypothesis (invariant vs. varying value) about the original source.
Scale helps where you're already winning, not where you're stuck. Going from 14B to 32B lifted every "draft" metric and pushed the degeneration cliff out to bigger functions — real, paid-for progress. It did nothing for the hardest function. "Throw a bigger model at it" is a fine thing to test, as long as you treat the answer as data, not a foregone conclusion.
Cost discipline is a feature. Per-second cloud GPUs that tear down on completion, and moving every bit of inference off the meter and onto the laptop, meant the whole "bigger and bolder" final try cost less than a takeaway dinner. You can afford to run the experiment that settles the question.

Where this goes next (and why it's not going there yet)

Here's the honest conclusion: I now genuinely believe an LLM can be trained to solve this — to decompile assembly into matching C. It doesn't feel like a fantasy anymore; it feels reachable. The competence envelope is real, but every lever I pulled moved it in the right direction. If I were to keep going, the obvious avenues are:

More and better training data. The single highest-leverage input. The corpus is good, but it's one game; a broader, cleaner set of (assembly, C) pairs is probably worth more than any architecture change.
Train for much longer. I was running afternoon-to-overnight jobs on a budget. There's clearly more to extract.
Go bigger again. 14B → 32B helped measurably. The trend hasn't flattened.
Stop fine-tuning a general code model and train one for this from scratch — or at least fine-tune something other than Qwen. A model that only ever has to think about this one task might not degenerate the way these did.

But I'm going to be honest with myself about priorities. My actual goal is to decompile Star Fox Adventures — and at this point the model is a distraction from that, not an accelerant. It's a fascinating side quest that hasn't yet paid back the time I've sunk into it, and chasing it further would mean less progress on the thing I actually care about. So I'm parking it here, with the satisfaction of knowing the crazy idea basically worked — just not for the reason I expected, only after I stopped believing my own best results, and only once I learned exactly where the tool's edges are, including the one place it will never reach.

If someone reading this wants to pick up the thread — more data, longer runs, a purpose-built model — I'd love to see where it goes.