HN 표시: Claude Code보다 50% 저렴하게 AI 코딩 에이전트를 구축했습니다(동일한 프롬프트).
hackernews
|
|
📦 오픈소스
#ai 서비스
#ai 코딩 에이전트
#anthropic
#claude
#claude code
#show hn
#비용 절감
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
동일한 프롬프트를 사용해 7가지 실제 코딩 작업으로 AI 코딩 에이전트 'Vix'와 'Claude Code'를 비교한 결과, Vix가 더 빠르고 저렴한 성능을 보여주었습니다. 총 비용은 Vix가 $6.64로 Claude Code($12.44)의 절반 수준이었으며, 총 소요 시간 역시 64분 6초 대 38분 30초로 Vix가 압도적으로 우수했습니다. 유일하게 3,000줄 이상의 긴 파일을 수정하는 작업에서는 Vix의 비용과 시간이 미세하게 높았으나, 이는 곧 출시될 업데이트를 통해 개선될 예정입니다. Vix가 비용과 시간을 획기적으로 절약하는 핵심 비결은 각 단계마다 다른 에이전트를 생성하는 대신 단일 '스템 에이전트(Stem Agent)'를 사용하여 탐색부터 실행까지 대화 기록 캐시를 공유함으로써 토큰 사용량을 크게 줄였기 때문입니다.
본문
Preamble: this is by no means a scientific approach. This is merely an observation, that we've been able to observe over and over again. Vix plan mode was evaluated across 7 real coding scenarios using the exact same prompt as claude code (cf. this repo). The goal was to measure time and cost in different real-life situations like starting a project from scratch, fixing a bug in an open source library (serde-rs), adding a feature to a massive codebase, refactor file based on tests, ask for test coverage above 80% etc. Here are the results: | Task | cc cost | vix cost | cc time | vix time | |---|---|---|---|---| | 1. [swift] Write a full LRU file cache spec from scratch with no existing code task · → cc plan · → vix plan | $1.82 | $0.30 | 11m30s | 2m43s | | 2. [swift] Add distributed tracing with context propagation across a caching library task · → cc plan · → vix plan | $2.18 | $0.84 | 14m34s | 5m7s | | 3. [rust] Fix non-string enum key parsing bug in serde_json task · → cc plan · → vix plan | $0.84 | $0.87 | 5m9s | 5m36s | | 4. [python] Write pytest suite to 80%+ coverage for a large export module task · → cc plan · → vix plan | $2.03 | $1.45 | 11m55s | 8m50s | | 5. [python] Refactor module guided by existing tests without breaking behavior task · → cc plan · → vix plan | $3.65 | $1.86 | 12m37s | 9m13s | | 6. [ts] Add French localization to OpenClaw following existing i18n patterns task · → cc plan · → vix plan | $1.38 | $0.97 | 5m32s | 4m53s | | 7. [rust] Fix a Rust compilation error from CI logs (no git history allowed) task · → cc plan · → vix plan | $0.53 | $0.36 | 2m45s | 2m5s | | Total | $12.44 | $6.64 | 64m6s | 38m30s | Vix is faster and cheaper on almost all tasks except on the 3rd one. Going deeper, the issue is that this specific task is about reading and editing a very long file (more than 3,000 loc). In that case the minified version wasn't helping much as it was happening only during the exploration part once, the execution part falling back to the regular read/edit tools. We already have a plan to address that and we plan to release very soon an update that will hopefully help with this specific kind of task. Regarding the quality of the plans produced, it's always hard to have a qualitative assessment of LLMs' output. We tried LLM-as-a-judge, difference-by-k-votes etc. which gave similar results for both claude code and vix, but ultimately we think this is one of those things where putting a number on it just doesn't make sense, and as a developer you just need to read these plans and make an opinion by yourself. On a personal level, we didn't see any changes with claude code, we believe that as long as the model (Claude Opus 4.6 in this case) is the same and you provide it with the same set of tools, results should be somewhat similar. All benchmarks are fully reproducible and include the full LLM transcripts and plans. https://github.com/kirby88/vix-eval Warning Works only for macOS and Linus for now. curl -fsSL https://getvix.dev/install | sh You will need to define ANTHROPIC_API_KEY in your environment variables. brew tap kirby88/vix-releases brew install vix Start the daemon: vixd and then you can start as many instances you want, each of them are isolated: vix The following features allow vix to use fewer tokens without any apparent change in quality: Vix tries to leverage cache as much as possible. The problem with the Explore/Plan/Execute approach is that each agent is specialized in a specific task (i.e. their system prompts are different), which makes it impossible to share cache between phases. Vix introduces the concept of stem agent that creates a generic agent (i.e. its system prompt is not specific to a step). It lets the LLM know that multiple phases will take place in the conversation and that for each of these phases a user message will explain exactly what will be expected from the LLM. Defining expected behavior in a user message vs. in a system prompt may in theory have some impact on the quality of the answers, but the impact should be minimal (they are both user instructions that the LLM is trained to satisfy). In practice we didn't notice any difference in quality. But that shift changes everything in terms of cost because now the same cache can be used between different steps. So when vix agent is done with the explore phase, it just tells the LLM to now act as a planner (agent specialization ). The major difference with claude code is that all the planning phase can now be done with the history of the explore phase cached. There is also the fact that Claude Code's main agent system prompt allows to spawn 3 Explore agents if needed. There is no need to spawn 3 agents to explore, as the LLM will do the exploration by itself and continue exploring and reading code until it has enough context. Worse than that, we noticed multiple times that claude code was still using 3 agents to explore the code base just because it was able to, then these agents were reading the same
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유