100만 개 도메인 스캔의 주요 내용
hackernews
|
|
🤖 AI 모델
#ai 모델
#chatgpt
#claude
#perplexity
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
100만 개 도메인을 대상으로 한 크롤링 결과, 웹의 대부분은 AI 에이전트가 바로 사용할 수 있는 정돈된 상태가 아닙니다. 리디렉트, 쿠키 배너, 타임아웃 등 기술적 문제로 인해 웹은 거칠고 불안정한 모습을 보였습니다. 그럼에도 불구하고 유용한 신호는 특정 AI 기업이 아닌 일반 대중 웹에서 발견되었습니다. 따라서 AI 에이전트를 위한 웹 환경이 이미 준비되었다는 생각은 아직 이릅니다.
본문
Most of the web is not agent-ready. Most of the web is barely Tuesday-ready. The first thing a million-domain crawl teaches you is humility. The web is not a polished grid of APIs waiting for agents. It is redirects, parked domains, regional versions, cookie banners, empty pages, timeouts, CDN moods, and one server somewhere that appears to be powered by a desk fan. That makes the positive results more interesting, not less. The useful signal did not come from a hand-picked list of AI companies. It came from the regular public web, with all its loose wires showing. So if your mental model is "the agent web is already here," slow down. If your mental model is "nothing is happening," also slow down. The truth is more annoying and more interesting: most sites are not ready, but enough sites are ready enough to show the direction. A famous site can still be machine-quiet. The homepage and robots.txt were there. llms.txt was not. That boring combination showed up constantly, including on recognizable public sites. llms.txt has escaped the lab. I expected to find AI startups and developer tools. I did not expect the format to show up this broadly. The top TLDs with valid LLM files included .com, .org, .net, .de, .br, .io, .in, .ru, .ai, .pl, .uk, .it, .co, .nl, and .jp. That is the moment a web convention gets interesting: when it stops looking like a demo and starts looking like something a random business may have because a tool, plugin, consultant, or developer added it. Most sites are not publishing a clean two-file strategy. They are publishing one little note to the machine. Sometimes it is a careful briefing. Sometimes it is a link dump. Sometimes it is a machine-readable sticky note that says, more or less, "Please do not explain our company using a hallucinated coupon code." The full-context files made that weirdly concrete. terraoutdoor.com had the largest stored llms-full.txt I saw, at 997,663 bytes. pixijs.com was right behind it at 982,044 bytes. That is the part I would not have guessed: patio furniture and renderer docs ended up in the same new drawer. The largest full file I saw Its llms-full.txt was 997,663 bytes. The surprise was that the largest file came from retail, not a developer-docs company. Developer docs are where llms-full.txt makes instant sense. PixiJS had one of the largest full-context files, near a megabyte of renderer documentation. Code agents need current docs, not stale memory. A lot of AI files are just normal web garbage in a tiny hat. Checking whether /llms.txt returns 200 is not enough. Many sites return 200 for almost anything. Ask for /llms.txt , get the homepage. Ask for a missing file, get the homepage. Ask for a sandwich, probably get the homepage. The biggest invalid bucket was HTML markup fallback: 114,291 probes looked like markup, not a real text file. Some were login pages. Some were block pages. Some were error pages that were deeply committed to pretending everything was fine. This is why AI discovery needs validation. A crawler has to compare control paths, store the final URL, track content type, detect login redirects, and reject "friendly" 404 pages. The web is very good at smiling while handing you the wrong document. The broken stuff is not a side note. It is the story. Any new web standard that becomes popular will immediately be copied badly, generated badly, routed badly, cached badly, and served through ten layers of middleware that disagree about whether text exists. SEO plugins are probably the reason this format spreads. This is the least glamorous finding, which means it might be the most important. Most businesses are not going to write a model-facing retrieval policy. They are going to update a plugin and accidentally enter the agent era. That is how the web actually changes. XML sitemaps did not win because every bakery owner became an information architect. Open Graph did not spread because dentists developed a passion for social preview metadata. These things spread because tools made them default. The generated files are uneven. Some are useful. Some are boilerplate. Some expose weird titles and stale pages. But a mediocre generated file at web scale can matter more than a perfect hand-authored file on ten AI startups. If llms.txt becomes normal, it will probably not arrive as a manifesto. It will arrive as a checkbox beside "generate sitemap," followed by a help article no one reads until traffic drops. A plugin wrote the briefing. Its llms.txt starts with an All in One SEO generator note, then turns a small hosting company into a model-readable page map. The HR site became a retrieval bundle. Rank Math generated a large llms.txt full of job descriptions, HR explainers, and sitemap hints. Glamorous, no. Scalable, yes. robots.txt and llms.txt are becoming a two-file negotiation. Robots says what crawlers may fetch. LLM files increasingly say what models should understand after they fetch. One is the door policy. The other is the laminated fact sheet someone hands you after you get inside. The robots evidence included AI crawler vocabulary like GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, and other names that make the internet sound like it is being moderated by a convention badge printer. The interesting split is permission versus persuasion. A site can block training crawlers while still trying to brief answer engines. It can say "do not scrape everything" in one file and "if you talk about us, use this canonical page" in another. That is a more realistic future than one magic standard. The machine-readable web will be a stack of files, headers, feeds, schemas, API descriptions, and policies that only make sense when read together. Permission in one file, persuasion in another. Its robots.txt explicitly names GPTBot, ClaudeBot, and Google-Extended, while llms.txt explains the product positioning for assistants. Shopping agents do not need sci-fi. They need product feeds. The most concrete agent story in the crawl was not a robot buying a vacation home with a crypto wallet. It was normal retail becoming easier for software to read. Sites like allbirds.com , brooklinen.com , casper.com , skims.com , hexclad.com , mejuri.com , and owalalife.com are useful examples because they are ordinary. You do not have to explain the category. They sell things. The crawler saw product, variant, price, cart, checkout, card, and digital-wallet hints around normal storefronts. The long tail made the same point from a different angle. storeyourboard.com , decoinparis.com , lcsupply.com , dtknailsupply.com , and pokeninjapan.store showed up near the full-file/catalog side of the scan. That is less glamorous than a keynote demo, but probably more important. Catalogs are becoming context. The best "wait, this is already happening" example was dell.com . Its llms.txt does not just list pages. It points to an hourly Feedonomics product CSV and a separate detailed product feed described as similar to a ChatGPT recommended feed. That is not a theory of agent shopping. That is a merchandising department leaving snacks out for the machines. That does not mean an agent should autonomously buy pants. Sizing alone is enough to humble civilization. But it does mean the ingredients are there: products, variants, prices, carts, checkout URLs, payment hints, and a visible boundary where human control still belongs. The near-term agent shopping story is not "press button, bot buys everything." It is "press button, bot finds the right three options and brings you to a checkout with fewer tabs open." Less cinematic, more useful. Dell is already packaging products for answer engines. Its llms.txt links to an hourly Feedonomics product CSV and a separate detailed feed described as similar to a ChatGPT recommended feed. The shoes are not buying themselves yet, but the shelf is readable. Allbirds had product, variant, priced variant, checkout URL, card, and digital-wallet signals. Normal retail is where agent shopping gets real. Normal retail, machine-readable boundary Brooklinen showed product, variant, priced variant, checkout, card, and digital-wallet evidence. Catalogs are becoming model context StoreYourBoard appeared near the top of the full-file list. Retailers are publishing product memory for retrieval systems. The biggest barrier is not finding the buy button. It is authority. "Can bots buy things?" is too blunt. The better question is: how far can software get before a human, wallet, token, policy, or saved authority has to take over? On many commerce sites, the answer is: pretty far, but not all the way. The crawler can see products and carts. It can often identify checkout redirects and payment surfaces. Then the human world comes back: shipping address, tax, account state, fraud checks, age gates, subscription consent, returns, and the ancient mystery of whether "medium" means anything. This is not a failure of agents. It is where trust lives. Delegated spending needs authority, limits, auditability, and a way to say "absolutely not" before an optimistic model buys twelve patio chairs because the product description sounded confident. The crawl makes the boundary visible. That is more valuable than a hype demo. Discovery, cartability, checkout handoff, payment challenge, and post-payment control are different stages. Lumping them together as "can buy" hides the part that actually matters. The cart exists. The human still owns the checkout. The crawler found products, prices, variants, and a Shopify cart redirect. Then the boundary became human checkout, which is exactly where authority belongs. Machine-payable websites exist, but the whole category fits in a group chat. The machine-payable web is real, but it is early enough that everyone involved probably knows each other's domain names. Examples included a2alist.ai , agentcash.dev , agoragentic.com , anybrowse.dev , askclaude.shop , clashofcoins.com , x402engine.app , and quicknode.com . These were concentrated around x402, crypto rails, wallet-required payment challenges, paid APIs, and agent-native services. Not normal retail. Not yet. More like the table in the back where protocol people are already arguing about settlement. agentcash.dev was the cleanest name-you-could-explain-to-a-human example. a2alist.ai looked like the directory version of the same shift. clashofcoins.com was one of the x402 challenge examples, which is exactly the kind of phrase that sounds fake until a crawler keeps finding it in public. But the count being small is not the same as the count being meaningless. If a million domain crawl finds any live machine-payment challenges, that is a marker. It says the primitive exists in the wild, even if the mainstream use case is still waiting for better authority and better defaults. The agent-payment web exists. It is just early. AgentCash was part of the machine-payable group, where wallet-required challenges, x402, crypto rails, and API services show up before mainstream stores. Agent directories are payment directories A2A List was labeled machine-payable and surfaced x402-related payment hints. Directories change when agents can call and pay. x402 is visible in the wild Clash of Coins was one of the x402 payment-challenge examples. It is not mainstream yet, but it is no longer theoretical. MCP is turning websites from reading material into tool handles. A sitemap says pages exist. A product feed says goods exist. An API description says actions exist. MCP language points at the next shift: the website is not only something a model can read, it is something a tool-using system may try to operate. Some of the MCP mentions were intentional. Some were generated. Some looked like the web equivalent of adding "AI-powered" to a toaster. That is fine. Early standards always include useful prototypes, copy-paste fragments, and marketing confetti. The important part is the vocabulary. Sites are starting to publish hints that say: here are docs, here are tools, here is an endpoint, here is a capability, here is an action boundary. This is where e2b.dev made obvious sense. Developer infrastructure wants current docs, runnable examples, and tool-shaped boundaries. A model guessing from stale memory is annoying on a blog post. In a sandbox or API workflow, it is a very expensive way to learn that one parameter changed. Once that becomes common, "does the model know this page?" becomes the boring question. The better question becomes: what can the agent safely do with what it found? The web is writing prompts for bots before bots write the web back. The biggest takeaway is not a file count. It is the purpose of the files. Websites are starting to publish instructions for machines before those machines summarize them, cite them, route users to them, call their APIs, or attempt to buy from them. The instruction can be simple: cite this page. Use this canonical URL. Do not invent prices. Do not summarize medical advice without a caveat. Send API questions to the docs. Use this product feed. Respect this crawler policy. Start here, not on the random blog post from 2019. That is why the page is no longer the only public object that matters. The briefing matters. The feed matters. The API description matters. The policy file matters. The checkout boundary matters. Search was about getting a page ranked. This is about routing machine attention before the answer gets written. It is part SEO, part documentation, part policy, part product feed, part API map, and occasionally part payment terminal. The web has always had a human interface. It is now growing a second one for agents. It is ugly, inconsistent, sometimes broken, and already too large to dismiss. The file tells agents how to cite the company. Its llms.txt explicitly asks AI systems to attribute summaries to Embrace Pet Insurance and be careful with coverage, pricing, and medical details. This is where instructions stop being decorative. Its llms.txt tells assistants to prefer canonical program, admissions, insurance, facility, and policy pages for addiction and mental health content. The web is getting a second front door Not a prettier homepage. A machine entrance. A text scan of stored LLM evidence found 3,578 domains naming AI vendors or AI crawler vocabulary. That is evidence that publishers are writing toward those systems. It is not evidence that a specific answer from a specific model used a specific file. That distinction is not a buzzkill. It is the measurement. The bigger story is that websites are starting to separate what humans see from what software can use: briefing files, product catalogs, UCP documents, OpenAPI specs, cart endpoints, crawler policies, and a tiny number of payment challenges. The last thing I expected to find was a normal-site MCP factory. In stored LLM evidence, 2,597 files mentioned MCP or Model Context Protocol. A direct evidence scan found 1,418 Wix-generated MCP files. These were not all AI labs with pro
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유