클로드의 기분이 더 나빠지면 하네스를 고쳐보세요

hackernews | | 📰 뉴스
#anthropic #claude #오픈소스
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

저자는 이전 글에서 클로드의 성능 저하를 불평하는 사람들을 비판했던 점을 사과하며, 해결책 없이 불만만 제기하는 것은 의미가 없음을 인정했습니다. 이번 글에서는 실질적인 행동 지침을 제공하기 위해 매일 아침, 세션마다, 밤마다 수행하는 구체적인 루틴과 프로세스를 공유할 계획입니다. 저자는 자신의 실제 스택을 상세히 소개함으로써 독자들이 클로드의 성능을 개선하는 데 도움을 주고자 합니다.

본문

If Claude Feels Worse, Fix Your Harness Take What Works and Leave the Rest A note before I start. In my last post I called a lot of people embarrassing for complaining that Claude is getting worse. A few of you said I was the embarrassing one. Fair. I told you what was wrong without telling you what to do about it. That is not how this is supposed to work. So today, the how. The whole thing. Pull up a chair. The thesis stands. The model is fine. Your harness is not. But sitting on top of that thesis without offering anything actionable is exactly the kind of post I would close angrily if I saw it on someone else’s feed. Today I will earn the take by walking you through the actual stack: what I do every morning, what I do every session, what I do every night, and what every one of those steps is built out of. One thing my last post did not make room for. Anthropic published a postmortem on April 23 about three bugs that hit a small slice of users across different time windows over the last 90 days. That is real. People who said the model felt off were not all imagining it. Either I dodged the bugs or my workflow happened to route around them. The only stretch where things felt visibly slower for me was a couple of days inside 4.6, mid-large-refactor, ten tabs open running parallel agents. Could have been a bug. Could have been me running my harness too hot. I cannot prove either way, so I will not pretend I can. The postmortem does not change the thesis. It refines it. Real bugs happen. Your harness is still the layer that determines whether one bad day costs you an afternoon or a week. Steal whatever fits. Skip whatever doesn’t. The point is that this stuff is buildable. None of it is exotic. Most of it took me an evening per piece, sometimes a week per piece, almost always less time than I have already spent reading other people complain about the model. (Yes, I notice the irony of writing a how-to less than forty eight hours after telling people they should read the system card themselves. The irony is part of the apology.) The Harness Quick definition first, because the word “harness” gets thrown around in this space without anyone bothering to define it. The harness is everything that wraps the model: the configuration, the tools the model can call, the rules it follows, the memory it carries, the loop that decides when to stop. The model is one component. The harness is the rest of the system around it. If the model is the horse, the harness is the reins, the cart, the route you planned, and the blinders that keep the horse focused on the road in front of it. This matters because when people complain that the model is getting worse, almost every time, what drifted was their harness, or lack thereof. Here is the whole thing in one breath. There are four layers that sit on top of the model and turn it from “a smart text box you copy paste into” to “a system you trust to work without you watching”: Skills. Markdown files that teach the model how to do one thing well. Hooks. Shell commands the harness runs around your tool calls so you can see, audit, or block them. Subagents. Fresh model sessions with their own context windows, called from your main session, so you can delegate research without burning the context of the session you actually care about. Memory. The system that remembers what was true last session so you do not re-explain yourself every morning. That is the whole stack. The next nine sections are how each piece works in practice. One small note before the chapters. I do not type any of this. I dictate via MacWhisper. I tried WisprFlow first and was not stoked on the monthly subscription. I tried Handy, which was very close. MacWhisper won on customization. I can rant without typing furiously on a keyboard. If you are not voice-driving your harness yet, this is the cheapest upgrade you can make. Morning. The Briefing Loop. I do not start the day with a checklist. I start the day with a briefing. A scheduled agent runs early in the morning and writes me a short briefing. It pulls from my notes system, my calendar, my open ticket list, yesterday’s journal entry, any pending personal commitments, the state of my CI pipelines. By the time I sit down with coffee, the day is already framed. I read the briefing, decide which two or three things matter, and start. Compare this to the version I used to run: open the laptop, open four tabs, scroll for things on fire, lose twenty minutes, and only then start. That version of the morning was a tax I paid every single day. The briefing pays the tax for me, on a schedule, while I am still asleep. The agent is a skill plus a scheduler. Anyone can build it. The skill knows where my notes live, how my calendar is shaped, what my standing weekly rhythms look like. The scheduler is whatever cron-flavored thing your harness gives you. If your harness does not give you scheduling, scheduling is the second thing I would build. One direct payoff: my morning standup is topic-based, not a status round-robin. By the time I am in the meeting, the briefing has already loaded the day’s context into my head. I can speak to anything on the agenda because I was there an hour earlier when the agent walked me through it. Planning. Plan Mode Is The Default. Anything more than a one-line task gets a plan first. This is non-negotiable for me at this point. The flow is simple. Tell the model what I want. The model spawns subagents in parallel to explore the codebase, find the relevant files, identify existing patterns I should reuse. The model proposes an approach and writes it to a plan file. I read the plan, push back on the bits I disagree with, ask clarifying questions, watch it revise. When the plan is right, the model exits plan mode and starts implementing against the plan it just wrote. Two hours of planning kills twelve hours of bad implementation reliably. The cheapest mistake to fix is the one you didn’t make. There is a HumanLayer post called Advanced Context Engineering for Coding Agents with a section called On Human Leverage that taught me early on how brutally the compounding effects of a good plan, or a bad one, hit your throughput. Read it. One small thing I found is my plans used to estimate “three days of work” and then the model would do the work in eleven minutes. The plan was lying to me. So I started logging both numbers. The model’s actual coding time, and my actual time across the calendar. Execution. Skills Carry the Weight. If I do something twice, it becomes a skill. A skill, again, is just a markdown file. Top of the file says when to use it. Body of the file says what to do. The model reads the file when the trigger fires and follows the steps. That is the whole machine. Examples from my own kit, deliberately generic so you can imagine the equivalents in your own world: One that runs safe database queries through a bastion, knows the schema, knows the redaction rules. One that onboards a new customer end to end through provisioning. One that reads a meeting transcript and pulls out my action items, who owns each one, which decisions landed. One that creates a nice pre-compaction snapshot of the session before I hit clear. None of these are clever. All of them are durable. The first time I do a task and feel the friction, I write the skill. The next time the task comes up, I trigger the skill and it runs. The third time, I have already forgotten that this used to be effort. Skills compound. That is the only sentence I want you to take from this whole section. Guardrails. Hooks and Permissions. The harness should not be able to do things that would scare you. Two mechanisms, both lightweight. First, a permissions list. Some commands are pre-approved (formatters, linters, read-only git, listing files). Some commands are forbidden outright (force pushing to main, dropping production tables, anything destructive in shared infrastructure). Some commands ask you for permission each time. Tuning this list is a weekend afternoon, not a research project. Second, hooks. The harness lets you run shell commands at specific lifecycle moments. Before a tool runs, after a tool runs, when a session starts, when you submit a prompt. I use hooks to block specific dangerous patterns the permission list cannot catch, to lint after every edit, to inject context at session start, to capture telemetry on what the model is actually doing. The hook layer is where you encode the knowledge that this specific kind of thing should never happen no matter what the model decides. Blast radius is a configuration choice. I will say it more gently than I said it in my last post but I will not say it less directly. The model has access to whatever you let it have access to. The discipline of deciding exactly what those access boundaries should be has a name. I call it Boundary Engineering, and I wrote a whole post about it. Read it with the access question in mind. Your harness is the layer where Boundary Engineering actually happens. If your harness can do anything to anything, you have not built a harness. You are riding a horse with no reins. Take Auto Mode in Claude Code. It now lets the model run continuously without my hand on the wheel. I can sleep while it works because the harness layer cannot let it nuke production. Today I read another post about somebody’s prod getting wiped by an autonomous coding agent. That does not happen on my machine. It is not scary. It does not have to be scary. It is programmatic, y’all. Memory. The System Knows Things. The harness has a memory system. Mine sits in a directory scoped per project, one index file and a stack of small markdown files behind it. The model reads the index at the start of every session, finds the entries that match the work I am doing, and pulls them in. What goes in: Corrections I have given the model that should apply to future sessions. Decisions on ongoing initiatives that are not visible from the code, things like why we picked one library over another or which architectural option we already ruled out. Pointers to external systems and the tickets that own a given piece of work. Hard-won knowledge about gotchas the model would otherwise step on every time. The integration that earns the most for me is memory plus tickets. Memory holds the durable context. The ticket holds the operational state. When I sit down to work on a piece of an active initiative, the model loads the project memory (here is why this initiative exists, here is what we already locked in, here is the order things need to happen in) and reads the ticket separately for the current task surface. The two reinforce each other. The model never asks me a question it has been told the answer to in a previous session, and never moves past a state the ticket has not yet reached. Memory also does something almost nothing else in this stack does. It survives. Every fresh session is technically a small disaster. The model has no memory of who you are, what you are working on, or anything you taught it last time. The memory directory is the recovery mechanism. When the context compacts mid-session, or when I hit clear and start over, the durable stuff does not go with the conversation. The session dies. The understanding does not. The bigger version of the same idea. The memory directory lives outside the codebase, which sounds like a downside until you realize you can back it up the same way you back up any other dotfiles. Git-track it, sync it, throw it on Drive, your call. If your laptop dies tomorrow, the harness’s accumulated knowledge of how you work, what you have ruled out, and which gotchas you have stepped on, survives intact. Most people back up their code and let every preference their model learned over months evaporate the first time their machine bricks. Do not be one of those people. This is the part of the stack that compounds the hardest. The first few days, the memory system feels empty and you wonder if it is doing anything. Two weeks in, you sit down to a fresh session and the model already knows things you forgot you taught it. Multi-Repo Work. When work touches multiple codebases I run one session that follows the feature through every repo it needs to touch. A recent example: five repos in one shot, fixing CI failures across the chain. Dependency versions out of sync, test assertions stale, auth headers drifted, TypeScript errors after a shared types bump. The context for that session was the feature, not any one codebase, and the model moved through each repo with awareness of what was changing in the others. I tried running a shared scratchpad between separate agents to keep them coordinated. It always got messy. One session with the full thread of context turned out cleaner than two agents with a notebook between them. For my main frontend repo where I often have parallel streams of work going at once, I use GitButler. It lets me keep several uncommitted feature branches alive in the same working tree, with hunks assigned to whichever branch they belong to. No stash, no checkout, no context-switch through git. The thing that makes it actually work is making sure your harness knows you are using GitButler for that repo. Once it knows, every commit lands on the right branch automatically. Without that one line of config, it is a mess. With it, it has been amazing. For cross-cutting changes that genuinely touch a lot of code at once, I do a big-bang merge instead of breaking the work into a polite sequence of small PRs that do not actually do anything in isolation. Big-bang works in my situation because I am the only one living in this codebase right now and I get to decide what counts as polite. When other people are in there, I will be more careful and break things into smaller, reviewable chunks. The pattern is a function of the team shape, not a universal claim at all. Observability. Instrument Everything. In my last post I mentioned the canary I run against every model I depend on, the deterministic prompt that fires every thirty minutes and pages me when latency or token rate drifts. That canary is the canonical version of a broader habit. Anything I rely on, I instrument. My CI dashboard shows every pipeline across every repo on one page, color-coded, click-through to logs, with a git-tree view and release-group views so I can see what is running where. Error tracking is a Sentry project per repo, all wired through a centralized SDK so a single library bump fixes my whole fleet. For my own usage, I built a tool that reads my local model session logs and tells me how many sessions I ran, how many hours of model time, how many hours of mine, what the leverage ratio looks like. The single most underrated feature in Claude Code right now is the /insights command. When it dropped I sat down and read the report and it was the first time I had ever seen, in numbers, where my friction lived. Where I was retrying. Where I was rejecting. Wher

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →