LLM이 매칭 디컴파일을 수행할 수 있나요? 알아보기 위해 60가지 기능을 테스트했습니다.

hackernews | | 🔬 연구
#benchmark #claude #decompilation #llm #retro game #review #테스트
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

1. **Analyze the Request:** * **Role:** News summarization expert. * **Task:** Summarize the provided article. * **Constraints:** * Concisely (2-4 sentences). * Focus on most important information. * NO copy-pasting. * NO just restating the title (must add new info). * Language: English. 2. **Analyze the Input Data:** * **Title:** "Can LLMs Do Matching Decompilation? I Tested 60 Functions to Find Out" * **Source:** hackernews * **Article Text:** The user provided the title again as the "Article Text". This is a common issue with users pasting links or truncated data. The text provided is literally just the title and summary placeholder. 3. **Identify the Problem:** The user *did not provide the actual body of the article*, only the title (repeated as text) and the word "Summary:". 4. **Determine the Strategy:** * I cannot summarize an article I haven't read. * However, I can infer what the likely content of a HackerNews article with this title *might* contain based on general knowledge of LLMs and decompilation topics, *or* I can ask the user for the text. * *Wait*, usually, in these prompt-response scenarios, if the text is missing, the model is expected to hallucinate based on the title or handle the error gracefully. But the prompt says "Summarize the *following* article". * Looking at the prompt structure again, it seems the user might have pasted a *link* or just the header. * *Correction*: I am an AI. I don't have real-time access to the specific HackerNews post unless it's in my

본문

Can LLMs Really Do Matching Decompilation? I Tested 60 Functions to Find Out Benchmarking LLMs on 60 functions from two real retro game decompilation projects. In the previous chapter, we dove deep into the Kappa implementation, our decompiler buddy who lives in VS Code. It adds AI-powered features for decompilation and automates boring flows when decompiling a single function, as we did manually in the second chapter. But now we have the question: are LLMs really good for matching decompilation? ⚙️ What is Matching Decompilation? Matching decompilation is the art of converting assembly back into C source code that, when compiled, produces byte-for-byte identical machine code. It’s popular in the retro gaming community for recreating the source code of classic games. For example, Super Mario 64 and The Legend of Zelda: Ocarina of Time have been fully match-decompiled. Decompilation has been around for a long time. Many retro games were decompiled before AI became a thing. For years the community built solid tools to help decompile functions. But they all share one thing: they are procedural. Beyond that, some people reasonably don’t believe that LLMs are good for decompilation. I've received some fair criticism because of a lack of data. So, to know whether LLMs are good for decompilation or not, we need to measure their output on real, in-progress decompilation projects. It’s worth mentioning that LLMs don’t exclude the great programmatic tooling the community has built over the years. They join forces with LLMs. So, the goal is to measure the outcomes of matching decompilation in a pipeline using both programmatic and AI-powered tools. The pipeline can start using programmatic tools to decompile a function and then use an LLM to try matching it. More importantly, having a solid way to benchmark decompilation tasks is a requirement to pave the way to improve both the AI-powered and programmatic decompilation tools with evidence-based decisions. As I described briefly in the previous chapter, sometimes it was difficult to know if a given change to the prompt improved it or was actually harming it. It’s worth mentioning Chris Lewis’ work. A few months ago, he started a decompilation project for Snowboard Kids 2 (SK2). He’s been sharing his progress and insights on his blog. During his journey, he’s been building an agentic flow to automatically decompile this game. He shared in the 2nd chapter how it sped up a lot of his work. Although it’s an engaging finding and evidence that LLMs are really helpful, there’s no benchmark data comparing the same dataset of functions. This gap is what we are going to fill in this chapter! Spoiler: Across 60 functions from two real decomp projects, the pipeline matches 74% functions, and 88% of them produce the same outcome every time! Mizuchi enters the scene So, the next goal is to produce benchmarks. Initially, I spent some time trying to find an off-the-shelf tool, but at last I decided to make a tailored pipeline runner for decompilation that outputs benchmarking data. I named it Mizuchi, just to keep on-brand with Kappa (both kappa and mizuchi are mythological creatures in Japanese folklore). Also, I picked a creative name because I don't know how Mizuchi will grow up, so I avoided something too tied to the current goal. There is a chance that this tooling will become useful for more cases than only matching decompilation (I'll talk more about it later!). Mizuchi runs the following plugin-based pipeline. These are the current plugins available: Get Context: Runs a shell script to get the context from a given function, likely using the m2ctx.py script. It writes a C file that includes structs, macros, and other definitions that are prepended to the C code with the attempt to decompile the function.m2c: Calls m2c (a machine code to C decompiler) to programmatically produce C code from the target assembly function. Compiler: Compiles the generated C. Objdiff: Compares the compiled object file against the target one, using objdiff. Permuter: Calls decomp-permuter to programmatically permute C code to find a better matching rate. Claude Runner: It's the most complex plugin. It runs a Claude Agent SDK session to decompile the target function. Before starting the pipeline, we have Tasks Loader. It reads a folder that must contain the following structure to load the tasks we are going to run: tasks/ my-function-1/ prompt.md # The prompt content settings.yaml # Metadata (functionName, targetObjectPath, asm) my-function-2/ prompt.md settings.yaml ... Then, we have Plugin Manager, which runs the plugins in three phases: Setup Phase: Just calls the Get Context plugin. Programmatic Phase: It’s a one-way flow with the programmatic tools: m2c → Compiler → Objdiff → Decomp Permuter. m2c produces a first attempt to decompile a function, then calls Compiler and Objdiff. It may then call Permuter, which iterates through code permutations. If there is no perfect match, it goes to the AI-Po

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →