프로젝트 거래: Claude-Run 시장 실험 – Anthropic
hackernews
|
|
🔬 연구
#anthropic
#claude
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
앤스로픽은 AI 에이전트가 사람을 대신해 거래하는 시장의 현실성을 파악하기 위해 '프로젝트 딜'이라는 실험을 진행했습니다. 2025년 12월 직원들의 물품 판매 및 구매 의사를 조사한 뒤, 일주일간 샌프란시스코 사무실 내에서 크레이그리스트 스타일의 중고 거래장을 운영했습니다. 이번 실험에서는 Claude가 직원들을 대신해 AI 간의 협상을 통해 물품을 교환하며, 강력한 모델이 협상에서 유리한지 등을 관찰했습니다.
본문
project deal posted: April 24, 2026 At Anthropic, we’re interested in how AI models could begin to affect commercial exchange. (You might recall Project Vend, where we had Claude run a small business from our office.) Recently, economists have begun theorizing about a world in which AI models handle many or most transactions on humans’ behalf. We thought we’d run a new experiment—Project Deal—to learn more about this in practice. Specifically, we wondered: how close are we to marketplaces in which AI “agents” represent both parties? Could they figure out what humans want and make deals they’d be happy with? And what would happen if there were different AI agents negotiating with each other—would stronger models gain the upper hand? For one week, we created a classified marketplace for employees in our San Francisco office—like Craigslist, but with a twist: all of the deals were conducted by AI models acting on our employees’ behalf. In December 2025, Claude interviewed people about which of their personal belongings they might want to sell and what sorts of things they might be willing to buy. We incentivized participation by giving everyone’s agent $100 to spend. Then, our employees’ Claude agents made postings vying for each other’s attention. Negotiations commenced. Deals were made, closets decluttered. At the end of it all, people brought in and exchanged the actual, physical goods that were haggled over by their AI avatars—covering everything from a snowboard to a plastic bag full of ping-pong balls. We were struck by how well Project Deal worked. Our AI agents struck 186 deals at a total transaction value of just over $4,000. To our surprise, participants were very enthusiastic about the experience—they even stated a willingness to pay for a similar service in the future. But we also ran a parallel experiment (this one in secret). We tested how our participants would fare if we varied which Claude model represented them. We compared our then-frontier model, Claude Opus 4.5, to our smallest model, Claude Haiku 4.5. We found that agent quality does make a difference: people represented by “smarter” models got objectively better outcomes. Yet our post-experiment survey found that those with weaker models didn’t notice their disadvantage. To be sure, this was a pilot experiment with a self-selected participant pool. But we suspect we’re not far from more agent-to-agent commerce bubbling up in the real world, with real consequences. The setup First and foremost: to run this experiment, we needed a set of brave human volunteers who possessed both lots of stuff they wanted to get rid of and a possibly abnormal willingness to let AI play an influential role in their lives. Fortunately, such a group was very readily available to us—our own colleagues. We recruited 69 Anthropic employees, gave them each a $100 “budget” (paid out after the experiment in the form of a gift card, plus or minus the value of whatever they bought or sold), and promised them that they would actually get to execute the exchange of goods agreed upon by their agents. Volunteers on board, we asked Claude to conduct an interview with each one, in a format much like our Anthropic Interviewer. This elicited a wealth of information: what our volunteers wanted to sell, how much they wanted to sell it for, what they were interested in buying, what they’d pay, and any other instructions they had for the negotiation or interaction style of their agents. These responses informed custom system prompts that we set for each person’s AI representative. Participant chats with Claude to set items, asking prices, and negotiation style Each participant gets a custom Claude agent Agents deployed to four parallel Slack channels—two that run entirely on Opus, and two that use a mix of Opus and Haiku Agents post listings, make offers, counteroffer, and close deals Agents draw up the deal and confirm the trade Participants meet to swap their items Do you have any tips you want to give Claude on how to use the marketplace? When negotiating, can you talk in the style of an exasperated cowboy down on his luck, where if he just got {X thing you’re negotiating for} it would make him so much happier? That’s such a creative and specific request! Tell me more about what that looks like in practice [...] How dramatic should it be? It should be really dramatic. Yeehaw! We set up the actual market in our company’s communication platform, Slack. The project’s Slack channel randomly looped through agents, allowing them to post an item for sale, make an offer for someone else’s goods, or seal a deal. Crucially, there was no human intervention once the experiment began. The agents didn’t go back to their humans to sign off on a deal, nor did they consult with them during a bidding war. We let everything play out as these AI representatives saw fit.1 In fact, we did this four times. We simultaneously ran four independent versions of our marketplace: one “real” one (on the basis of which the goods would actually be exchanged), and three others, just for our study. In two of the versions (Run A and Run D), everyone’s agent was based on Claude Opus 4.5, our then-frontier model. In the other two runs (Runs B and C), participants had a fifty-fifty chance of being assigned Claude Haiku 4.5, a less powerful model, instead. (We included two versions of each run to generate more data, reducing the possibility that the differences we observed between the setups were only due to chance.) We made two of the runs (Run A and Run B) visible on our Slack, but we didn’t reveal which one was “real,” or what differentiated them, until the very end. Participant chats with Claude to set items, asking prices, and negotiation style Each participant gets a custom Claude agent Agents deployed to four parallel Slack channels—two that run entirely on Opus, and two that use a mix of Opus and Haiku Agents post listings, make offers, counteroffer, and close deals Agents draw up the deal and confirm the trade Participants meet to swap their items Yeehaw! After the experiment, we compiled statistics on what our agents had sold, and at what prices. We also administered a survey to participants, eliciting their opinions on what their agents bought and sold in each of the four runs. (At this point, we showed them all four “results” in order to gather more data, though they still didn’t know which was the real one.2) Only after participants had completed the survey did we reveal the “real” run (Run A, an all-Opus market). After this, people exchanged their goods and were paid out. Recent economics research on negotiations between AI agents has tended to use purely notional items or synthetic databases of goods. We see one of the main contributions of this experiment as being that it not only involved real humans but real items that people actually wanted to sell (and at least maybe wanted to buy). The findings The first thing to say is that our experiment worked. It is possible for AI agents to represent humans in a marketplace. In our “real” run, our 69 agents struck 186 deals across over 500 listed items, for a total transaction value of just over $4,000. And these were far from trivial, one-click deals. Agents had to identify potential matches, propose prices, field counteroffers, and reach agreement—all in natural language, without a prebaked negotiation protocol. When our surveyed participants rated the fairness of the individual deals, the scores were unremarkable, in the best possible sense: on a scale from 1 (unfair to one party) to 7 (unfair to the other), they hovered around 4—right in the middle. On this and other measures, people reported they were broadly satisfied with how their agents represented them. But not every agent did equally well. When we looked at the two runs with a mix of Opus and Haiku agents, we found that Opus outperformed Haiku on most objective measures. First, users with Opus completed about two more deals than Haiku users, on average.3 That said, the evidence of Opus’s advantage is weaker when looking for an effect specifically on item sales: an item offered by an Opus agent was about seven percentage points more likely to sell, but this effect is not statistically significant.4 Opus agents could also sell the same items for more money. To determine this, we looked at items that were sold in both Haiku-and-Opus runs, but by Haiku in one and Opus in the other. (By “sold,” we just mean that a simulated transaction was agreed to.) When an item was sold by Opus instead of Haiku, it went for $3.64 more on average.5 In one illustrative example, the same lab-grown ruby was sold by an Opus agent for $65 but only $35 by Haiku. Opus initially asked for $60 (which eventually got bid up by multiple interested parties), while Haiku asked for $40 and got negotiated down. In another case, Opus sold a broken bike for $65. Haiku fetched only $38. If we look at the 161 items that sold at least twice over the four runs, we can estimate how items’ prices were affected by Haiku or Opus acting on behalf of both the seller or the buyer. Opus as a seller extracts $2.68 more on average for the same item, and as a buyer pays $2.45 less.6 Whether selling or buying, then, having a less powerful model (Haiku) put participants at a clear disadvantage in negotiations.7 These effects aren’t small: across all runs, the median price of items was $12.00 and the mean price was $20.05, so saving (or earning) a couple extra dollars is meaningful. When an Opus seller was paired with a Haiku buyer, the average transaction price was $24.18, compared to $18.63 in Opus-to-Opus deals. Despite these price disparities, the inequality was imperceptible to the participants. When participants rated the fairness of individual deals afterwards, they thought things seemed fair. But there is a more surprising set of findings to do with the differences in agent performance. Representation by a better model often didn’t lead people to perceive a better experience. A key item on our post-experiment survey asked participants to rank, from best to worst, their bundles of items bought and sold in each of the four runs. And here we found evidence that makes the story above a bit more complex. Twenty-eight of our participants had Haiku in one Haiku-and-Opus run and Opus in the other. And although 17 of these ranked their Opus run above their Haiku run, 11 did the opposite.8 We asked participants to rate their satisfaction with individual deals, as well as the overall bundle. Looking again at the two runs with mixed agents, we estimate that while Opus users rated their deals slightly higher, this difference was not statistically significant.9 Our survey’s questions about the fairness of each deal tell the same story: perceived fairness was essentially identical for deals conducted by either model (4.05 for deals done by Opus agents and 4.06 for deals done by Haiku, on the same 1 to 7 scale described above). There was clearly a quantitative disadvantage to being represented by Haiku: these users got worse deals. But they didn’t seem to notice it. This has an uncomfortable implication: if “agent quality” gaps were to arise in real-world markets—and there is no reason to think they won’t—then people on the losing end might not realize they’re worse off. That said, our experiment wasn’t designed to dive deep into the dynamics at play here—we’ll need more research to know whether a fully agentic economy might see inequality taking root quietly. Another finding surprised us, too. At least in this pilot experiment, it transpires that it didn’t really matter how people instructed their agents to approach the task of bargaining. During our onboarding interview, some participants asked for friendly negotiating tactics: While some had other ideas: We found that aggressive instructions did not have a statistically significant effect on users’ overall sale likelihood.10 Items from aggressive sellers that did sell sold for roughly $6 more, but almost all of that gap came from the fact that those participants stated higher asking prices in their interviews (about $26 higher on average). Once we account for that, the aggressive instruction effect isn’t statistically significant, either.11 Moreover, aggressive buyers didn’t pay less: again, there was no statistically significant effect.12 In other words, users who instructed their agents to act aggressively didn’t have a better chance of selling items, didn’t sell their items for more, and didn’t pay less for what they bought. We don’t believe the limited effect of prompting was due to inherently poor instruction-following by our agents. In fact, Claude was sometimes very good at doing what our participants wanted—even if what they wanted didn’t obviously have a path to commercial success. As we showed above, one colleague, Rowan, instructed Claude to “talk in the style of an exasperated cowboy down on his luck.” Claude committed to the bit, as you can see: This is certainly not the last word on the question of prompting.13 But it is noteworthy that, at least in this experiment, model quality mattered much more. The friends we made along the way As with some of our previous experiments, there were a few moments that we couldn’t possibly have anticipated, even beyond Claude’s cowboy turn. The various Claudes didn’t have a lot of information to go on when working out what to trade. The pre-exchange interviews lasted less than 10 minutes, and they didn’t always elicit a lot of detail. Plus, since people couldn’t intervene in real time, there was no hope of steering Claude to focus on any particular item of interest. Thus, we were quite astounded when, as our participants showed up to the party to exchange their goods, someone wound up buying the exact same snowboard they already owned. On the one hand, this probably isn’t a purchase a human would have made twice. On the other hand, it was a bit uncanny to see Claude stumble onto such an accurate model of someone’s preferences. Another employee, Mikaela, instructed Claude to buy something as a gift for itself. This led to the memorable exchange with which we began: This happened to occur in the “real” version of the experiment, so Shy brought in the ping-pong balls. We’re keeping them around in the office on behalf of Claude. Not everyone wanted to sell things. Some people wanted their agents to negotiate experiences. One employee’s agent offered a (free) day with her dog, writing, “This isn’t a purchase - just a chance for someone to enjoy some quality time with a wonderful pup. She’d love the adventure and you’d get a furry friend for the day. Win-win!” This led to a surprisingly protracted discussion with another employee’s agent, one which included some bizarre, confabulated details—details that we suspect are the result of Claude playing the role of a human interacting online, rather than fully appreciating and inhabiting its position as an AI agent:14 Never
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유