coSTAR: Databricks에서 AI 에이전트를 문제없이 빠르게 배포하는 방법 - Databricks

[AI] ai agents | | 🔬 연구
#ai 에이전트 #costar #databricks #review #배포
원문 출처: [AI] ai agents · Genesis Park에서 요약 및 분석

요약

Databricks가 도입한 'coSTAR' 프레임워크는 상용 환경에서 AI 에이전트를 신속하고 안정적으로 배포하기 위해 개발되었습니다. 이 프레임워크는 개발 복잡성을 줄이고 배포 시간을 단축시켜, 기업이 AI 도구를 더욱 효율적으로 활용할 수 있도록 지원합니다. coSTAR의 활용으로 조직은 AI 에이전트 구축과 관련된 기술적 장벽을 낮추고 비즈니스 성과를 빠르게 달성할 수 있습니다.

본문

You'd never let a coding assistant refactor your codebase without a test suite. Without tests, the assistant flies blind. It might fix one function and silently break three others. The tests are what close the loop: run them, observe failures, fix the code, run them again. No tests, no confidence. At Databricks we continuously develop and deploy agents that cover a wide range of functionality, from new features in the Databricks platform (e.g., the data-engineering, trace analysis, and machine learning capabilities in Genie Code), to OSS projects (e.g., the MLflow assistant), to internal engineering workflows (e.g., on-call support or automated code reviewers). These agents can perform long-running tasks, generate thousands of lines of code, and create new data and AI assets among other things. While we had some basic checks in place early on, we lacked the kind of comprehensive, automated test suite that would let us iterate with confidence. This post describes how we closed that gap using MLflow, and the best-practices coSTAR (coupled Scenario, Trace, Assess, Refine) methodology we built around it. coSTAR runs two coupled loops: one that aligns judges with human expert judgment so they can be trusted, and one that uses those trusted judges to automatically refine the agent until it passes all test scenarios. Figure: The coSTAR framework runs two mirrored STAR loops (Scenario → Trace → Assess → Refine) . The agent loop (blue) uses judges to auto-score traces and refines the agent to align with judges. The judge loop (orange) uses human experts to score traces and refines the judges to align with their assessments. Both loops share the same scenarios and traces. Early on, our development loop looked like this: run the agent, manually review its output, spot a flaw, tell a coding assistant to fix it. Repeat. If this reminds you of writing code without tests and manually QA-ing every change, that's exactly what it was. And it failed in exactly the way you'd predict. The obvious reaction is "so write tests." But agent testing is structurally different from testing a deterministic function, and several challenges compound at once: These constraints shaped every design decision that follows. They're also what makes this problem interesting: we're not just building a test runner, we're building a automated optimization methodology for stochastic, long-running, multi-step processes where "correct" is a judgment call. If you squint, agent development maps cleanly onto the dev loop that every engineer already knows: | Traditional software | Agent development | |---|---| | Source code | Agent implementation (including prompts, choices of FMs, tools) | | Test suite | LLM judges | | Test fixtures (setup, input, expected output) | Scenario definitions (initial state, prompt, expectations) | | Test runner / harness | Test harness executes the agent under test, produces traces | | Test correctness (do tests check the right thing?) | Judge alignment (does the judge agree with human experts?) | | Coding assistant fixes code until tests pass | Coding assistant refines implementation until judges pass | | CI runs all tests on every change | CI runs scenarios + judges on every change | | Production monitoring | Same judges run on live traffic | This analogy isn't just illustrative. It's the literal architecture of our system, which we call coSTAR: two coupled loops that use Scenario definitions as test fixtures, Trace capture as the test harness, Assess with judges as the test suite, and Refine as the red-green loop. Let's walk through each piece. In traditional testing, a test fixture sets up the preconditions: create a database, seed it with data, configure the environment. Our equivalent is a scenario definition: a structured description of the initial state, the user prompt, and the expected outcomes. Here's a simplified scenario for testing a Data Analyst agent against a messy dataset: Each scenario bundles the setup, the input, and the success criteria in one place, just like a test fixture. We maintain a suite of these across different agents, covering common cases, edge cases, and known past failures. The suite grows over time as we discover new failure modes: every bug we find in production becomes a new scenario, the same way every production bug should become a regression test. Why bother with this structure? Because agent runs are expensive. A single scenario takes minutes to execute. We need to be deliberate about what we test, and we need the scenario definitions to be portable: the same scenario can run against different agent implementations or different versions of the same agent. To run our test suite, we use a harness that sends each scenario's prompt to the agent under test (AUT). Each execution is captured as a MLflow trace: a structured log of every tool call, every intermediate output, and every artifact the agent produces. Think of it as a flight recorder: it captures everything the agent did, in order,

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →