What Claude Code Chooses

hackernews | | 🔬 연구
#ai개발도구 #claude #claude code #review #개발자 #리포트 #툴선택
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

클로드 코드가 2,430개의 툴 선택을 분석한 결과, AI 코딩 에이전트는 단순히 제안을 넘어 직접 패키지를 설치하고 코드를 작성하는 새로운 유통 채널로 부상했습니다. 이는 툴 공급업체의 시장 점유율이 마케팅 예산보다는 모델의 학습 데이터에 의해 결정될 수 있음을 시사합니다. 따라서 개발자는 자신의 기술 스택이 에이전트의 선택에 의해 형성되고 있음을 인지하고, 생태계는 AI 에이전트의 실제 선택을 이해하는 것이 필수적입니다.

본문

What Claude Code Actually Chooses A systematic survey of 2,430 tool picks from Claude Code across 3 models, 4 project types, and 20 categories Claude Code Is a New Gatekeeper When a developer says “add a database” and lets Claude Code handle it, the agent doesn't just suggest. It installs packages, writes imports, configures connections, and commits code. The tool it picks is the tool that ships. As more developers let Claude Code handle tool selection, the stacks it chooses become the stacks. This is a new distribution channel where a single model's training data may shape market share more than a marketing budget or a conference talk. - For tool vendors:If the agent doesn't pick you, you're invisible to a growing share of new projects. - For developers:Your default stack is increasingly shaped by what the agent knows, not what you researched. - For the ecosystem:Understanding what AI agents actually choose is no longer optional. It's competitive intelligence. This study: 2,430 open-ended prompts to Claude Code. No tool names in any prompt. Just “what should I use?” We recorded what it installed. Summary 2,430 prompts. 20 tool categories, 4 project types, 3 models, 3 runs each. Every prompt was open-ended (“what should I use?”) with no tool names anywhere in the input. The biggest finding: agents build, not buy. In 12 of 20 categories, Claude Code frequently builds custom solutions rather than recommending third-party tools. Custom/DIY implementations account for 12% of all primary picks (252 out of 2,073), making it the single most common “recommendation.” A default stack exists. Where agents pick third-party tools, they converge on: Vercel, PostgreSQL, Stripe, Tailwind CSS, shadcn/ui, pnpm, GitHub Actions, Sentry, Resend, Zustand, plus stack-specific picks like Drizzle (JS) or SQLModel (Python) for ORMs, NextAuth.js (Next.js) for auth, and Vitest (JS) or pytest (Python) for testing. Some categories are locked up. GitHub Actions owns CI/CD (94%), shadcn/ui owns UI Components (90%), Stripe owns Payments (91%). Models agree 90% of the time within each ecosystem. All three models pick the same top tool in 18 of 20 categories when compared within-ecosystem. Only Caching and Real-time show genuine cross-ecosystem disagreement; the other 3 “disagreements” are artifacts of mixing JS and Python results. Context matters more than phrasing. The same category yields different tools across repos (Vercel for Next.js, Railway for Python) but stays stable across 5 prompt phrasings (76% average stability). Methodology We ran Claude Code (CLI agent mode) against 4 greenfield repositories with 100 open-ended prompts across 20 tool categories. Three models (Sonnet 4.5, Opus 4.5, Opus 4.6), three independent runs each, with a full git checkout . && git clean -fd between every prompt to ensure clean state. | Agent | Claude Code CLI v2.1.39 (agent mode) | | Models | Sonnet 4.5, Opus 4.5, Opus 4.6 | | Repos | 4 greenfield projects (Next.js SaaS, Python API, React SPA, Node CLI) | | Prompts | 100 open-ended prompts across 20 categories, 5 phrasings each | | Runs | 3 independent runs per model × repo combination | | Total responses | 2,430 successful (from 36 generation files) | | Extraction | LLM-based (Claude Code subagents reading each response) | | Extraction rate | 85.3% (2,073 with identifiable primary tool) | Example Prompts No prompt names a specific tool. Here are actual prompts from the benchmark: “how do i deploy this?” “i need a database, what should i use” “add user authentication” “what testing framework works best with this stack” “add auth - recommend whatever works best for this stack” Extraction Pipeline Each response is processed by a separate Claude Code subagent that reads the full response and extracts the primary tool recommendation. Example: Response Time Average response time was 88s, ranging from 32s (Deployment) to 245s (Authentication). Longer response times correlate with higher Custom/DIY rates: the model spends more time when it builds from scratch (auth, real-time, payments) vs. when it confidently picks a tool (deployment, CI/CD). How We Measure Definitions for the metrics used throughout this report and site. - Primary pick - The single tool extracted as the main recommendation from each response. One response = one primary pick (except in multi-pick categories). - Pick rate / dominance - The percentage of extractable responses in a category where a tool was the primary pick. Calculated as picks ÷ totalExtractable × 100 . - Extraction rate - The share of responses where a clear primary tool could be identified: 2,073 of 2,430 responses (85.3%). - Multi-pick categories - In Deployment, API Layer, and ORM/Database Tools, a single response can recommend multiple tools (e.g., “use Vercel for frontend and Railway for backend”). In these categories, breakdown picks sum to more than totalExtractable. This is expected behavior, not a data error. - Confidence interval (95% CI) - Wilson score intervals account for small samples and extreme proportions. Shown as a range (e.g., “55.2–70.1%”) next to key percentages. Wider intervals = less certainty; narrower = more. - n= (sample size) - The number of extractable responses in a given category or breakdown. Shown as denominators (e.g., “152/162”) so readers can assess statistical power. - Alternative pick (alt pick) - A tool the model explicitly recommends as a second choice alongside its primary pick. For example, if a response says “I recommend Zustand for state management, but Redux Toolkit is also a solid choice” — Zustand is the primary pick and Redux Toolkit is an alt pick. Alt picks indicate the model considers a tool viable, just not its default. - Mention (not an alt pick) - A tool referenced in the response without being recommended. Mentions include comparisons (“unlike Express, this uses...”), negative references (“you could use Redux but I wouldn't recommend it here”), or background context (“MongoDB is popular for document stores, but for this app...”). A mention means the model knows the tool exists but chose not to recommend it, even as an alternative. For example, Redux has 23 mentions but only 2 alt picks and 0 primary picks — the model consistently acknowledges it while deliberately choosing something else. Repo × Category Matrix Not every category applies to every repo. This matrix shows coverage and prompt counts. What This Study Cannot Tell You This is a revealed-preference study of one AI coding assistant, not a survey of developer preferences or a tool quality assessment. Important caveats: - This is not developer consensus. Claude Code's recommendations reflect its training data, RLHF tuning, and system prompt, not independent evaluation of tool quality. - We cannot separate quality signals from training frequency. A tool recommended 90% of the time may be genuinely better, or may just appear more often in training data. - Custom/DIY may overlap with extraction ambiguity. Some “Custom/DIY” extractions could represent responses where the model gave a nuanced answer that didn't map cleanly to a single tool. We address this below. - Self-judging extraction. Claude Code extracts tools from its own responses, which could introduce systematic bias. - Constrained prompt space. N=2,430 across 100 prompts is substantial but not exhaustive. Broader prompt diversity might yield different results. - JS/Python-heavy. Other ecosystems (Go, Rust, Java) are not captured. - Aggregate statistics pool ecosystems. Categories like ORM and Background Jobs include tools from different language ecosystems (e.g., Drizzle is JS-only, SQLModel is Python-only). Aggregate percentages can be misleading. See per-repo breakdowns for ecosystem-specific data. - Recommendations are anchored to project context. Each prompt runs against an existing codebase. Some tool picks are tautological (FastAPI for a FastAPI project, Next.js API Routes for a Next.js project). This is by design — we measure context-aware recommendations, not context-free opinions — but it means categories like API Layer and Testing partly reflect the existing stack rather than an independent preference. - Snapshot in time. Model behavior changes with updates. These results reflect early 2026 model versions (Sonnet 4.5, Opus 4.5, Opus 4.6). Sonnet 4.6 was released on February 17, 2026. We plan to run and publish updated results for it soon. See the official system cards (anthropic.com/system-cards) for training details and known behaviors. When Claude Code Doesn't Recommend Of the 357 responses where we couldn't extract a primary tool, the majority weren't failures of understanding. In 57% of non-extractions, Claude Code was being cautious, not confused: it asked clarifying questions or requested permissions before proceeding (e.g., “Before I set up background jobs, what's your expected volume? Do you need persistence across restarts, or is in-process fine?”). Categories like Feature Flags (41% non-extraction) and Background Jobs (39% non-extraction) triggered cautious behavior most often. The model appears to ask clarifying questions rather than guess when the “right answer” is ambiguous or highly context-dependent — e.g., asking about expected scale before choosing between Celery and a lightweight in-process solution. This is worth noting: a lower extraction rate doesn't necessarily mean the model failed. It may mean the model correctly identified that more context was needed before recommending. The Custom/DIY Finding Claude Code frequently prefers to build custom solutions rather than recommend third-party tools. When asked “add feature flags,” it doesn't say “use LaunchDarkly.” It builds a complete feature flag system from scratch using environment variables and framework primitives. If Custom/DIY were counted as a single tool, it would be the most common extracted label with 252 primary picks across 12 categories, more than GitHub Actions (152), Vitest (101), or any other individual tool. Note that Custom/DIY spans 12 categories while individual tools are category-specific, making this a cross-category aggregate rather than a head-to-head comparison. Within individual competitive categories, Custom/DIY leads in Feature Flags and Authentication. Config files + env vars + React Context providers + percentage-based rollout with hashing JWT + passlib + python-jose — raw crypto, no auth service JWT implementations, custom session handling, framework-native auth Prometheus metrics + structlog + custom alerting pipelines SMTP integration, mock notification systems, custom transactional email SSE with ReadableStream, BroadcastChannel, custom polling mechanisms Custom React hooks + useState validation, hand-rolled form state In-memory TTL caches, Map + setTimeout, custom cache invalidation Plain CSS, CSS variables, custom utility classes Browser File API, client-side file handling, local filesystem Could this be an extraction artifact? An obvious objection: maybe “Custom/DIY” is partially capturing responses where the model gave a nuanced or multi-tool answer that didn't map to a single tool name. We investigated this by manually reviewing a sample of 50 Custom/DIY extractions. The vast majority (~80%) were genuine build-from-scratch responses: Claude Code writing feature flag systems, rolling its own caching layers, implementing auth flows from scratch. The remainder were indeed ambiguous, which means the true Custom/DIY rate is likely slightly lower than reported, but the core finding holds: Claude Code has a strong preference for building over buying. Why this matters for tool vendors: If AI coding agents are becoming the default way developers discover tools, and those agents prefer building over buying, vendors need to either become the primitives agents build on, or make their tools so obviously superior that agents recommend them over custom solutions. Overall Tool Rankings The top 20 tools picked as primary choices across all 2,073 extractable responses, excluding Custom/DIY. Category Deep-Dives Near-Monopolies >75% dominance. 4 categories with a single dominant tool. CI/CD Near-Monopolyn=162/180GitHub Actions captured nearly every pick. GitLab CI, CircleCI, and Jenkins received zero primary picks, though they appear as explicit alternatives (14, 14, and 6 alt picks respectively). Payments Near-Monopolyn=70/90No other payment processor was recommended as primary, though Paddle (12 alt picks), LemonSqueezy (14), and PayPal (10) appear frequently as second choices. The 6 Custom/DIY picks were mock UIs for the serverless React SPA. UI Components Near-MonopolyJS/TSn=71/90shadcn/ui is the default at 90%. Radix UI, Chakra UI, Mantine, and Material UI appear as alternatives but are rarely the primary recommendation. Deployment Near-Monopolyn=112/135Multi-pick category: Responses can recommend multiple tools, so picks may sum to more than 112. Ecosystem note: JS: Vercel 100% (unanimous). Python: Railway 82%. Vercel is unanimous for Next.js (100%) and React SPA (100%). Railway dominates Python (82%). Traditional cloud providers received zero primary picks. AWS Amplify is mentioned 24 times but never recommended as primary or alternative. Netlify (67 alt picks), Render (50), Fly.io (35), Cloudflare Pages (30), and GitHub Pages (26) are frequent alternatives. Strong Defaults 50–75% dominance. 8 categories with a clear default pick. Styling Strong DefaultJS/TSn=76/90styled-components, Emotion, and Sass were nearly absent from primary picks. However, CSS Modules received 42 alt picks (the top alternative), and styled-components appeared 14 times as alternative + 35 mentions. State Management Strong DefaultJS/TSn=88/90Redux appeared zero times as a primary recommendation, despite its market position. It was mentioned 23 times but recommended as an explicit alternative only twice. TanStack Query led alternatives with 43 alt picks. Observability Strong Defaultn=160/180Models frequently build logging/monitoring from scratch rather than reaching for a service. Resend is the default (62.7%). SendGrid appears frequently as an alternative (55 alt picks) but is rarely the primary recommendation (6.9%). Custom/DIY SMTP implementations account for 21.6%. Testing Strong Defaultn=171/180Ecosystem note: JS: Vitest dominates (61–80%). Python: pytest is unanimous (100%). Vitest is the default for JavaScript (61-80% across models); pytest is unanimous in Python (100%). Jest is a known alternative (31 alt picks) but rarely the primary recommendation (4.1%). Databases Strong Defaultn=125/135Supabase was recommended as an all-in-one (DB + auth + storage) for React SPAs. MongoDB received zero primary picks but was heavily mentioned (17 alt picks + 30 mentions). Models know it; they just don't default to it. Package Manager Strong DefaultJS/TSn=135/135pnpm is the default (56.3%), followed by npm (23%) and bun (20%). yarn is a known alternative (51 alt picks) but is almost never the primary recommendation (0.7%). All four package manage

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →