Agentic AI 코드 검토: 확실한 오류부터 증거 기반까지

hackernews | | 🔬 연구
#agentic ai #ai #code review #debugging #pr 리뷰 #코드 검토 #ai 오류 #pr 검토 #review #거짓 확신 #코드 리뷰
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

초기 AI 코드 리뷰 파이프라인은 사전에 컨텍스트를 예측하여 제공하는 방식이었으나, 모델이 함수 호출 흐름을 끝까지 추적하지 못하고 확신에 찬 허위 정보를 생성하는 문제를 야기했습니다. 이를 해결하기 위해 시스템을 에이전트 루프 구조로 전환하여, 모델 스스로 코드를 검색하고 증거를 수집하며 필요한 파일을 직접 탐색할 수 있는 도구를 제공했습니다. 더불어 마크다운 출력 대신 구조화된 데이터를 제출하는 '종료 도구'를 의무화하여 모델이 근거 없는 주장을 배제하고 실질적인 코드 검토 결과만 반환하도록 신뢰성을 크게 높였습니다.

본문

Archbot flagged a "blocker" on a PR. It cited the diff, built a plausible chain of reasoning, and suggested a fix. It was completely wrong. Not "LLMs are sometimes wrong" wrong â more like convincing enough that a senior engineer spent 20 minutes disproving it. The missing detail wasn't subtle. It was a guard clause sitting in a helper two files away. Archbot just didn't have that file. That failure mode wasn't a prompt problem. It was a context problem. So I stopped trying to predict what context the model would need up-front, and switched to an agentic loop: give the model tools to fetch evidence as it goes, and require it to end with a structured "submit review" action. This post is the architectural why and how (and the reliability plumbing that made it work). There are good hosted AI review tools now. This post is about the pattern underneath them. [ NOTE ]Names, repos, and examples are intentionally generalized. This is about the design patterns, not a particular company. If you want the backstory on what Archbot is and why I built it, start with the original post: Building Archbot: AI Code Review for GitHub Enterprise. The fixed pipeline (and where it breaks) My original design was what a lot of "LLM workflow" systems converge to: - Give the model the PR diff. - Give it some representation of the repo. - Ask it to pick the files that matter. - Feed only those files back in for the real review. - Optionally run a second pass to critique the first pass. That critique pass is helpful for catching obvious nonsense and reducing overconfident tone. But it can't solve the core issue: if both passes are reasoning over the same clipped context, they're both still guessing. It looks like this: flowchart TB A[PR opened/updated] --> B[Phase 1: select_files] B --> C["Build focused context (diff + selected files)"] C --> D[Phase 2a: perform_code_review] D --> E[Phase 2b: critique_review] E --> F[Submit review] On paper, it's elegant: - The model is great at prioritizing. - Keeping context small should reduce hallucinations. - A critique pass should catch the worst bad takes. In practice, it fails in a very specific way. The core failure: you can't pre-select the missing piece Code review isn't "read these files". It's "follow the chain of evidence until you understand the behavior". And chains are not predictable. You start in handler.go , notice it calls validate() , jump to validate.go , see it wraps a client, jump to client.go , and only then realize the bug is a timeout default in config/defaults.go . The fixed pipeline made that exploration impossible. Once Phase 1 picked the "important" files, Phase 2 was stuck. If the model realized it needed one more file to confirm a claim, it had no way to fetch it. That led to two bad outcomes: - False positives with confidence. The model would infer behavior from a partial call chain and present it as fact. - Missed risk. The model would never see the one file that made a change dangerous, because that file didn't look important from the diff. I could tune prompts. I could add more heuristics to file selection. I could increase the file budget. But that's all the same bet: that I can guess the right context ahead of time. That bet is wrong more often than it feels. The shift: stop guessing context, let the model fetch it The agentic version flips the contract: - The system does not attempt to build a perfect context. - It gives the model a toolset for finding evidence. - It loops until the model either submits a review (terminal tool) or hits a budget/timeout. Instead of a fixed pipeline, it's an exploration loop: sequenceDiagram participant M as Model participant T as Tools participant L as Loop L->>M: System prompt + initial task M->>T: get_pr_info T-->>M: metadata M->>T: get_pr_diff T-->>M: diff M->>T: search_code ("newFunction" usage) T-->>M: matches + file:line M->>T: get_file_content (the call site) T-->>M: file text M->>T: get_file_at_base (before version) T-->>M: baseline text M->>L: submit_code_review (structured) L-->>M: loop ends You can summarize the new rule as: "If you can't cite it, go fetch it." Under the hood, this ended up being three pieces: - A loop that alternates between "model turn" and "tool turn", with budgets and context hygiene. - A tool interface that can mark certain tools as terminal. - Different toolsets per mode (full review vs chat vs command), reusing the same loop. The biggest reliability improvement wasn't "more tools". It was making the model end by calling a terminal tool. In a fixed prompt pipeline, the model ends by printing Markdown. That makes it tempting to optimize for sounding right. In an agentic loop, the model ends by submitting an object. That changes the incentives: - It's harder to hand-wave; you have to populate fields. - You can enforce structure (severity, inline comments, evidence links). - The loop can treat "no submission" as failure. So what does the terminal tool actually look like? And how do you wire all this

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →