Show HN: AI 에이전트를 자율적인 연구원으로 바꾸는 마크다운 파일

hackernews | 2026년 3월 23일 03:56 | 📦 오픈소스

#ai 에이전트 #claude #claude code #머신러닝 #머신러닝/연구 #자동 실험 #자율 연구

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

AI 코딩 에이전트를 자율적인 연구원으로 변신시켜 주는 'researcher.md' 파일이 공개되었습니다. 이 파일을 클로드 코드나 커서 같은 에이전트에 로드하면, 에이전트는 밤새 30회 이상의 실험을 자동으로 수행하며 가설을 검증합니다. 실제로 p99 지연 시간을 142ms에서 89ms로 개선한 사례처럼, 성공한 코드는 유지하고 실패한 시도는 폐기하는 최적화 작업이 가능합니다. 이 도구는 머신러닝을 넘어 API 속도나 프롬프트 엔지니어링 등 다양한 영역에서 자동으로 브랜치를 생성하고 학습 결과를 관리합니다.

본문

One file. Your AI coding agent becomes a scientist. Drop researcher.md into Claude Code, Codex, or any agent. It will design experiments, test hypotheses, discard what fails, keep what works — 30+ experiments overnight while you sleep. Branch: research/graph-protocol-optimization · Parent: #b1 · Type: real Hypothesis: Agents read architectural rules but treat them as optional. Separating the instruction into a READ phase ("load constraints first") and a WRITE phase ("now implement") with a guard ("if you haven't done READ, stop") should improve compliance. Changes: restructured agent rules into explicit READ/WRITE phases, added structural guard Result: 7.04/10 (was 1.82 baseline, 5.91 best) — new best Status: keep Insight: Every attempt to add verification checklists regressed. What worked was changing the structure, not adding steps. Agents respond to framing, not policing. - b0: baseline (no special instructions): 1.82/10. keep. - b1: reframe rules as "constraints, not suggestions": 5.91. keep. - b2: exhaustive checklist: regression. discard. - b3: lightweight checkpoint: regression. discard. - b4: READ/WRITE separation + structural guard: 7.04. keep. - b5: contractual "implement or document exception": regression. discard. - b6: JIT re-reading: 5.23, evaluator disagreement. interesting. - b7: mandatory pattern-triggered re-reading: 1.4. regression below baseline. discard. Real experiment from optimizing Yggdrasil agent rules. The skill works on any codebase. Same loop, different problems: npm run build takes 40s → agent gets it to 18s- prompt returns wrong format 30% of the time → agent gets it to 3% - API p99 is 200ms → agent finds the bottleneck and cuts it to 80ms - document parser misses edge cases → agent improves match rate from 74% to 91% The agent interviews you about what to optimize, sets up a lab on a git branch, and works autonomously. Thinks, tests, reflects. Commits before every experiment, reverts on failure, logs everything. It detects when it's stuck and changes strategy. Forks branches to explore different approaches. Keeps going until you stop it or it hits a target. Resume where you left off across sessions. Generalizes autoresearch beyond ML. Works on any problem where you can measure a result — code, configs, prompts, documents. All experiment history lives in an untracked .lab/ directory. Git manages code. .lab/ manages knowledge. Want the full walkthrough? Read the guide. It walks through a complete example from start to finish. How is this different from autoresearch? Autoresearch's core loop is universal, but the repo is wired to train.py , val_bpb , and GPU training. To use it on something else you'd rewrite the setup. This gives you that loop ready to go for any codebase. When would I use this instead of ML? It's not instead of ML. ML is one possible domain. This works on anything where the agent can try things, measure, and iterate. Code, scripts, documents, configs. Slow builds, flaky tests, API latency, prompt accuracy. How does it measure success for non-ML code? Whatever you can measure. Test pass rate, benchmark output, type check errors, build time. You set it up in the discovery phase. The agent asks what to measure and how. If you can run a command and get a number, that's your metric. For cases where there's no command to run, the agent scores against a qualitative rubric you define together. How does convergence detection work? The agent checks a table of signals after every experiment. If it sees 5+ failures in a row, a metric plateau, or the same area modified too many times, it knows to change approach. Some signals are advisory (consider pivoting), others are hard guardrails (you must pivot). Details in the guide. Can it improve itself? Sort of. The skill was optimized using the skill itself. A research document about how LLMs process instructions (attention decay, primacy/recency, instruction budgets) was used as criteria, and the agent ran the loop against its own prompt. Not fully recursive, but the loop was: research → skill → use skill to improve skill. Can't I just ask Claude to build this from the autoresearch repo? You can try. This saves you the work and includes things autoresearch doesn't have: thought experiments, non-linear branching, convergence detection, qualitative metrics, and session resume. MIT Yggdrasil — the agent experiments on your code. But does it understand what it's working on? Semantic memory for repositories.

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기