Claude의 dbt 기술 평가: 처음부터 평가 구축
hackernews
|
|
💼 비즈니스
#tip
#claude
#dbt
#기술 평가
#데이터 파이프라인
요약
저자는 인간의 개입 없이도 데이터 파이프라인 구축이 가능한지 알아보기 위해, AI 도구인 '클로드 코드(Claude Code)'가 dbt 프로젝트를 얼마나 잘 구축하는지 평가하는 실험을 주도했습니다. 이 평가는 두 가지 유형의 프롬프트와 세 가지 스킬 설정, 그리고 서로 다른 AI 모델을 조합하여 다양한 시나리오에서 수행되었습니다. 실험 결과는 각 설정별 비용과 소요 시간 등이 포함된 메타데이터와 함께 분석되었으며, 이를 통해 현재 AI의 역량이 데이터 엔지니어를 대체하기에는 아직 이르다는 결론을 도출했습니다.
왜 중요한가
개발자 관점
검토중입니다
연구자 관점
검토중입니다
비즈니스 관점
검토중입니다
본문
I wanted to explore the extent to which Claude Code could build a data pipeline using dbt without iterative prompting. What difference did skills, models, and the prompt itself make? I’ve written in a separate post about what I found (yes it’s good; no it’s not going to replace data engineers, yet). In this post I’m going to show how I ran these tests (with Claude) and analysed the results (using Claude), including a pretty dashboard (created by Claude): The Test 🔗 Can Claude Code build a production-ready dbt project? (is AI going to take data engineers' jobs?) | Terminology check I am not, as you can already tell, an expert at building and running this kind of controlled test. I’ve adopted my own terminology to refer to elements of what I was doing, which may or may not match what someone who knows what they’re doing would use :) | Design 🔗 I created the test to run independently, with no 'human in the loop'. That is, Claude Code was free to run whatever it wanted to in order to achieve the task I’d given it. I explored permutations of two dimensions in my scenarios: prompt (x2) and skills (x3). Each of these I then iterated over with different models. - Prompt - Rich (lots of background data analysis, specifics on what features to include, etc) View prompt I’ve explored and built pipelines for the UK Environment Agency flood monitoring API. Here’s my analysis: Build a dbt project using DuckDB for this data using idiomatic patterns and good practices. Requirements: - Proper staging → dim/fact data model - Handle known data quality issues (see blog posts for details) - SCD type 2 snapshots for station metadata - Historical backfill from CSV archives (see https://environment.data.gov.uk/flood-monitoring/archive) - Documentation and tests - Source freshness checks Run dbt build to verify your work. If it fails, fix the errors and re-run until it passes. - - Minimal (here’s an API, build me analytics) View prompt The UK Environment Agency publishes flood monitoring data, see https://environment.data.gov.uk/flood-monitoring Build an idiomatic dbt project following good practices using DuckDB that ingests this data and models it for analytics. Run the project and make sure that it works. If it fails, fix the errors and re-run until it passes. - - Skills - None - Single skill (Using dbt for Analytics Engineering) I’d meant to test the full plugin, but a snafu meant I only ended up pulling in the single skill. I realised this only after running the scenario in full, so expanded the test to include the full plugin as a separate scenario. - Full plugin (dbt Agent Skills) - - Model - Claude Sonnet 4.5 - Claude Sonnet 4.6 - Claude Opus 4.6 - Execution 🔗 One of the core things that I wanted to find out was what Claude can do on its own. Having it ask for permission to do something slows things down, and asking for input defeats the point of the exercise. So I used it with the effective but spicy flag --dangerously-skip-permissions : claude --dangerously-skip-permissions $PROMPT This was wrapped in a Docker container so that it couldn’t cause too much trouble. Claude Code writes a full transcript of its sessions to a JSONL file that usually resides in ~/.claude/ , so for the Docker container I had that copied out into the test results too, along with the actual dbt project itself and any other artefacts from the test run. The JSONL is interesting for what it tells us about how Claude Code approaches the task, particularly on multiple runs of the same configuration. Here’s an example analysis of part of a session log. I used Claude to write a bash script that then spun up a Docker container with the correct set of configuration for the test scenario. Each run’s session log was processed to produce summary metadata: { "model_requested": "claude-opus-4-6", "model_actual": "claude-opus-4-6", "cost_usd": 3.420355, "duration_ms": 1175360, "input_tokens": 718, "output_tokens": 43568, "cache_read_tokens": 2423321, "cache_creation_tokens": 162914, "num_turns": 57 } Output 🔗 Once I’d run all of the scenarios, I had a set of results on disk: ❯ tree runs -L1 runs ├── A-minimal-no-skills ├── B-rich-no-skills ├── C-minimal-with-skills ├── D-rich-with-skills ├── E-minimal-with-plugin └── F-rich-with-plugin Each folder had multiple models and within those, runs, e.g. ❯ tree runs/A-minimal-no-skills -L2 runs/A-minimal-no-skills ├── claude-opus-4-6 │ ├── run-1 │ ├── run-2 │ └── run-3 and within each of those, a dbt project (assuming that Claude had done its job successfully!): ❯ tree runs/A-minimal-no-skills/claude-opus-4-6/run-1/project/flood_monitoring -L1 runs/A-minimal-no-skills/claude-opus-4-6/run-1/project/flood_monitoring ├── analyses ├── dbt_packages ├── dbt_project.yml ├── flood_monitoring.duckdb ├── logs ├── macros ├── models ├── README.md ├── seeds ├── snapshots ├── target └── tests So we’ve got a set of dbt projects, produced by Claude Code. As part of Claude’s prompt it was instructed to iterate until they work: Run dbt buil