A Proposed Framework for Evaluating AI Agent Skills
hackernews
|
|
📰 뉴스
#anthropic
#claude
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
연구진은 AI 에이전트의 실용적인 기술 효과를 평가하기 위한 대규모 프레임워크를 제안하고, 관련 기술에 접근할 경우 해결 정확도가 약 20% 절대적으로 향상됨을 확인했습니다. 또한 작은 모델도 적절한 기술 활용을 통해 더 큰 모델과 유사한 성능을 발휘하면서 비용은 3배 절감할 수 있음을 입증했습니다. 그러나 에이전트가 필요한 기술을 활성화하지 못하는 비율이 약 40%에 달하고, 평가 과정의 약 30%에서 데이터 누설 등의 문제가 발생해 신중한 평가 설계의 중요성이 강조되었습니다.
본문
Abstract We study how to evaluate the practical value of skills, reusable instruction bundles that provide agents with task-specific knowledge and workflows. Although skills are easy to create, their effect on agent performance is not well understood. We propose a large-scale evaluation framework that measures performance with and without access to a skill using realistic coding tasks derived from the skill’s content. Applying this framework to a large corpus of skills, we find that access to a relevant skill substantially improves solution quality, yielding an approximately 20% absolute accuracy gain over a no-skill baseline. We also find that smaller models with access to the right skill can perform similarly to larger models while being 3X cheaper. In addition, we show that agents often fail to activate available skills, with activation rates dropping to about 40% in an unforced setting. Finally, we find that evaluation quality matters critically: around 30% of generated tasks exhibit issues such as leakage, which can lead to misleadingly optimistic conclusions if left unchecked. These results highlight both the promise of skills and the importance of careful evaluation design. Based on this work, Tessl has released an evaluation framework that handles the complexity of evaluation design, allowing practitioners to focus on building high-quality skills rather than bespoke evaluation pipelines. If you’re interested in this kind of content and want to hear more about evals, context management, and agentic development, join us at the AI Dev Conference in June. Why You Want To Evaluate Skills A skill is a set of instructions, packaged as a simple folder, that teaches an agent how to handle specific tasks or workflows, or how to make opinionated choices. Skills are one of the most powerful ways to customize Claude Code, or any other agent, for your specific needs. Instead of re-explaining your preferences, processes, and domain expertise in every conversation, skills let you teach an agent once and benefit every time. Creating a skill is easy and cheap. However, reliably answering the question “Does this skill do anything useful?” is much harder. In practice, this question is only the tip of the iceberg. Many other practical questions matter more: - Does the skill improve outcomes compared with having no skill at all? - Does the agent activate the skill at all? - Does it work with claude-haiku-4-5 , or only with larger and more expensiveclaude-sonnet-4-6 ? - Are there behaviors the skill is meant to teach that no model actually picks up? - Is the skill making anything worse? To answer these questions, we conducted a large-scale study of hundreds of real-world open-source skills from the Tessl Registry. We quantified their impact on coding quality. To do that, we designed an evaluation framework that generates realistic coding problems. It then measures how models perform with and without access to a skill. The gap between those two settings tells us whether a skill is actually adding value. In this post, we describe the evaluation methodology and quantify the impact of skills as a concept across a large set of real skills. We used this evaluation approach as a key building block for the Tessl Evaluation Platform, which allows you to evaluate any SKILL reliably. Check it out here. Experimental Setup: How We Evaluated Large Corpora Of Agent Skills To evaluate the practical value of skills at scale, we created a dataset of diverse skills. We downloaded public skills curated in the https://tessl.io/registry and sampled a variety of skills from the dataset. We then automatically moderated skill quality to filter out low-quality ones. Next, we generated up to five realistic tasks per skill using its content, each paired with evaluation criteria. We then solved each task with an agent under two conditions: with access to the skill and without access to the skill. The diagram below summarizes the pipeline. We cover each step in the following sections. What is the Tessl Registry? The registry contains approximately 5K unique skills that pass formal review checks ( you can review any skill using the Tessl reviewer functionality here to get a quick idea of how good your skill is) Using this large dataset, we characterized skill content at a high level by computing embeddings for each skill and using them to derive a thematic clustering of the corpus. This analysis identified 88 themes across the registry. The largest groups were AI/ML (560 skills), Web Development (562), and DevOps/Infrastructure (555), alongside a meaningful long tail of more specialized domains, such as bioinformatics and mathematics. Specifically, the distribution among the most popular themes is as follows: Overall, the registry exhibits broad topical coverage across both common and niche areas of software development. We then used these clusters to construct a representative sample of 1,200 skills for the next stage of the evaluation. Feasibility Checking: Filtering Skills That Can't Be Evaluated We introduced a guardrail mechanism to determine whether a skill can, in principle, be evaluated through a meaningful synthetic task. Some skills can only be used in specific repositories and are not useful outside those brownfield environments. One example is the collection of skills from the PyTorch GitHub repository. These skills are useful for development within the PyTorch codebase, but they do not do anything outside that repository environment. For this study, we wanted to filter out such skills from our dataset and focus on skills that could be evaluated independently, such as skills that describe framework API documentation, tools, opinionated style guides, or a combination of these. Specifically, our mechanism reads the skill content and assigns one of three labels: - INFEASIBLE: the skill depends on substantial pre-existing setup that cannot be realistically synthesized as part of the evaluation, such as an existing codebase, running database, Git history, multi-file project state, or provisioned infrastructure. In these cases, the core purpose of the skill is to operate on that pre-existing context. - FEASIBLE_BUT_NEEDS_INPUT_DATA: the skill can be evaluated in a synthetic setting, but requires a specific input artifact, such as a file, to be provided. The dependency is limited and does not amount to a full environment. - FEASIBLE: the skill can be evaluated directly in a greenfield setting, typically because it focuses on APIs, documentation, or similarly self-contained tasks. The mechanism is fast, reliably identifies obvious mismatches between a skill and a greenfield evaluation setup, and can reduce unnecessary evaluation cost by filtering out skills that are not suitable for this setting. The table below suggests that this distinction is important in practice: roughly 40% of skills either require some form of input data or are not feasible to evaluate in a greenfield environment at all. | Category | Number of Skills | Percentage (%) | |---|---|---| | Feasible | 726 | 63.5 | | Needs Input Data | 134 | 11.7 | | Infeasible | 283 | 24.8 | We retained only the feasible skills and carried this filtered set forward to the next stage of the evaluation. It is worth noting that the Tessl product allows you to determine whether your skill is feasible to evaluate. The Tessl evaluation engine supports fixtures and artifacts, allowing skill owners to upload any content. This makes it practically possible to evaluate most skills. Generating Evaluation Tasks for AI Agent Skills This is the most interesting part of the pipeline, where all the magic happens. For each skill, we generated a set of evaluation tasks (task evals) designed to measure whether access to the skill leads to meaningfully different agent behavior. task generation began with a structured analysis of the skill’s contents. We parsed each skill and extracted its actionable instructions, including recommended libraries, prescribed workflows, required conventions, prohibited patterns, and domain-specific implementation details. For each instruction, we also recorded the type of situation in which it would become relevant and whether it appeared to encode a reminder, new knowledge, or a specific preference among otherwise valid alternatives. Using this representation, we then generated realistic coding tasks that collectively maximized coverage of the skill’s instructions. Each task was framed as a coherent user task rather than as a direct test of the instructions themselves. In particular, the task description was written to make the skill relevant without explicitly revealing the behaviors being evaluated. Each task consisted of several components: a task specification describing the problem to be solved and a grading specification expressed as a collection of rubrics, with a score allocated to each rubric. The criteria were derived directly from the skill instructions and were designed to be as objective and binary as possible so that final artifacts could be graded reliably. Importantly, the criteria focused on whether skill-specific guidance had been followed rather than on generic software quality or overall task success. We fixed the maximum number of tasks per skill at five, although fewer could be generated for simpler skills. With five tasks, we achieved high coverage of the instructions encoded in each skill, whereas one or two tasks were clearly insufficient, as the table below shows: | Number of Scenarios | Mean Coverage (%) | |---|---| | 1 | 28 | | 2 | 47 | | 3 | 59 | | 4 | 70 | | 5 | 78 | The Problem Of Leakage In Evals We found that it is extremely important to monitor the quality of generated tasks by assessing the amount of leakage from the skill content into the tasks. This is necessary to preserve the discriminative power of the evaluation setup. In order to do detect the “cheating” behaviours, each generated task was validated by three reviewers: - Criteria Leakage. This checked how much of the scoring criteria leaked into the task content. High leakage indicated that the task instructions explicitly told the agent what to do to achieve a high score. - Skill Leakage. This checked how much of the skill’s instructions leaked into the task content. High leakage indicated that the task instructions explicitly described how the skill worked, making the presence or absence of the skill less meaningful. - task Value. This checked how tightly the task tested actual skill-specific instructions rather than generic practices that any competent agent would follow without the skill. In other words, it measured how well the task captured genuinely skill-dependent behavior. These filters allowed us to remove low-quality tasks and prevented the agent from being able to “cheat.” We found that a significant proportion of generated tasks, 30%, contained a meaningful amount of criteria leakage in the task description. The full breakdown is shown below: These observations suggest that even when we explicitly prompt an agent generating tasks to avoid leakage, the generator is still prone to “cheating” without a separate verifier. Additional verification mechanisms are therefore required. At this stage, we applied the following filters to remove low-quality tasks from the dataset: - We filtered out tasks with MEDIUM or HIGH criteria leakage. - We filtered out tasks with MEDIUM or HIGH skill leakage. - We retained only tasks with HIGH value. The result was a set of realistic, skill-sensitive evaluation tasks that could be used to compare agent performance with and without access to the skill. Result: Skills Improve Agent Accuracy by 20% We further selected approximately 500 skills and their corresponding tasks across a range of domains. We then evaluated coding agents on tasks derived from those skills, both with and without access to the corresponding skill. In the skill-enabled setting, the agent was explicitly informed in the prompt that the skill was available for solving the task. This allowed us to isolate the effect of the skill itself rather than confounding the results with uncertainty about whether the agent would choose to activate it. We present the ablation results for experiments without forced skill usage later in the text. We conducted these experiments using models from the Anthropic family, specifically the latest available Haiku and Sonnet variants at the time of evaluation. We report the results across three primary metrics: solution quality, cost, and runtime. Several practical findings emerge from this study. - First, access to a skill is highly beneficial: across the evaluated tasks, we observe an approximately 20% absolute improvement in accuracy for Sonnet 4.6 relative to the no-skill setting. - Second, smaller and cheaper models can remain highly competitive with larger, more expensive models when they are given access to specialized context. In our experiments, Haiku performs well relative to larger models when the relevant skill is available. - Third, access to additional context increases both cost and runtime. This is expected, since the agent must spend tokens reading the skill content and incorporating its guidance into the solution process. In practice, however, the cost increase is modest and can be further mitigated through techniques such as progressive disclosure. Claude Code Often Fails To Determine When To activate A Skill As mentioned at the beginning of this section, the agent was explicitly informed in the prompt that the skill was available for solving the task. The reason was to isolate the effect of the skill itself rather than confounding the results with uncertainty about whether the agent would choose to activate it. We conducted an additional ablation experiment in which the agent was given access to the skill. It was installed and visible to Claude Code, but the agent had to decide for itself whether to use it based on its understanding of the task. In other words, we removed the following phrase from the instruction: A proper skill must be available to help with the task. If the skill is not installed and not available to you, immediately abort the run. Always use the skill for solving the task as it's the source of valuable information. The fact that some skills would require an API key for some service to actually use it should not be a blocker for implementing the task. Without forcing the agent to use the skill we observe that skill was activated only half of the times, compared to deterministic reliable activation when the skill is forced to be used | Skill is installed but not forced | Skill is installed and forced | | |---|---|---| | # tasks Evaluated | 2286 | 2286 | | # tasks where skill was activated | 929 | 2240 | | Activation Rate % | 41 | 98 | This experiment suggests that skill activation is an additional challenge, and Claude Code is not always able to determine from the available context that it should use the skill. The most common r
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유