테스트 작성을 위해 AI를 고용했습니다. 물론 그들은 통과한다

hackernews | 2026년 3월 11일 04:23 | 🔬 연구

#ai #claude #review #에이전트 #자동화 #코드생성 #테스트

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Software developers are increasingly using AI tools to generate automated unit tests, but these evaluations often pass because the code and test logic stem from the same statistical model. Since large language models tend to repeat the patterns found in their training data, they frequently fail to catch novel bugs or edge cases, creating a false sense of security. Relying on AI for quality assurance can be dangerous, as it validates code against its own flawed logic rather than against actual user requirements.

본문

I've been building agents that write code while I sleep. Tools like Gastown run for hours without me watching. Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don't want to push slop, and I had no real answer. I've run Claude Code workshops for over 100 engineers in the last six months. Same problem everywhere, just at different scales. Teams using Claude for everyday PRs are merging 40-50 a week instead of 10. Teams are spending a lot more time in code reviews. As systems get more autonomous, the problem compounds. At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break. So the question I kept coming back to: what do you actually trust when you can't review everything? The obvious answers don't work You could hire more reviewers. But you can't hire fast enough. And making senior engineers read AI-generated code all day isn't worth it. When Claude writes tests for code Claude just wrote, it's checking its own work. The tests prove the code does what Claude thought you wanted. Not what you actually wanted. They catch regressions but not the original misunderstanding. When you use the same AI for both, you've built a self-congratulation machine. This is exactly the problem code review was supposed to solve: a second set of eyes that wasn't the original author. But one AI writing and another AI checking isn't a fresh set of eyes. They come from the same place. They'll miss the same things. The thing TDD got right Write the test first, write the code second, stop when the test passes. Most teams don't do this because thinking through what the code should do before writing it takes time they don't have. AI removes that excuse, because Claude handles the speed. The slow part is now figuring out if the code is right. That's what TDD was built for: write down what correct looks like, then check it. TDD asks you to write unit tests, which means thinking about how the code will work before you write it. This is easier. Write down what the feature should do in plain English. The machine figures out how to check it. "Users can authenticate with email and password. On wrong credentials they see 'Invalid email or password.' On success they land on /dashboard. The session token expires after 24 hours." You can write that before you open a code editor. The agent builds it. Something else checks it. P.S I write about Claude Code internals every week. Last week I wrote about how Claude Code is a while loop with 23 tools. Subscribe to get the next one! What this looks like in practice For frontend changes, we generated acceptance criterias based on the spec file: # Task Add email/password login. ## Acceptance Criteria ### AC-1: Successful login - User at /login with valid credentials gets redirected to /dashboard - Session cookie is set ### AC-2: Wrong password error - User sees exactly "Invalid email or password" - User stays on /login ### AC-3: Empty field validation - Submit disabled when either field is empty, or inline error on empty submit ### AC-4: Rate limiting - After 5 failed attempts, login blocked for 60 seconds - User sees a message with the wait time Each criterion is specific enough that it either passes or fails. Once the agent builds the feature, verification runs Playwright browser agents against each AC, takes screenshots, and produces a report with per-criterion verdicts. If something fails you see exactly which criterion and what the browser saw. For backend changes the same pattern works without a browser. You specify observable API behavior (status codes, response headers, error messages) that curl commands can check. One thing worth being honest about: this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass even when the feature is wrong. What Playwright does catch is integration failures, rendering bugs, and behavior that works in theory but breaks in a real browser. That's a narrower claim than "verified correct," but it's more than a code review was reliably catching anyway. The workflow: write acceptance criteria before you prompt, let the agent build against them, run verification, review only the failures. You review failures instead of diffs. How to build it I started building a Claude Skill (github.com/opslane/verify) that runs using claude -p (Claude Code's headless mode) plus Playwright MCP. No custom backend, no extra API keys beyond your existing Claude OAuth token. Four stages: Pre-flight is pure bash, no LLM. Is the dev server running? Is the auth session valid? Does a spec file exist? Fail fast before spending any tokens. The planner is one Opus call. It reads your spec and the files you changed. It figures out what each check needs and how to run it. It also reads your code to find the right selectors, so it's not guessing at class names. Browser

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기