아침마다 데이터독 확인하기 귀찮아서 AI로 시켜봤습니다
hackernews
|
|
🔬 연구
#ai
#claude
#datadog
#review
#리뷰
#모니터링
#자동화
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
개발자는 매일 아침 데이터독(Datadog)을 확인해야 하는 번거로움을 덜기 위해 인공지능(AI)을 활용한 자동화 시스템을 직접 구축했습니다. 이 AI 시스템은 모니터링 데이터를 스스로 분석하여 중요한 이상 징후나 경보만을 사용자에게 전달함으로써 업무 효율성을 크게 높였습니다. 이 프로젝트는 반복적인 운영 업무를 자동화하여 개발자가 더 중요한 작업에 집중할 수 있게 돕는 AI 활용의 구체적인 사례를 보여줍니다.
본문
Confession Time I donât want to check Datadog every morning. There, I said it. Donât get me wrong â I love that we have monitoring. I love that alerts exist. I love that somewhere, a dashboard is faithfully tracking every 5xx error our platform produces. Itâs just a tedious job begging to be automated. At Quickchat, we handle thousands of conversations daily across Slack, Telegram, WhatsApp, Intercom, and more. Our Datadog is⦠busy. Every morning the ritual is the same: scroll through Datadog alerts on Slack, squint at error spikes, mentally classify each one as âreal problemâ or âmeh, transient,â and then finally start actually writing code around 11am. I figured there had to be a lazier way. The Laziest Possible Solution As any self-respecting programmer knows, the best kind of work is the kind you automate away. So I asked myself: what if I never had to open Datadog again? What if an AI could check it for me, figure out whatâs actually broken, dig through the codebase, fix it, and open a PR â all before I finish my first coffee? Hereâs what I built in about 30 minutes (because spending more than that would defeat the purpose of being lazy): - Datadog MCP Server gives Claude Code access to live monitoring data - A Claude Code skill tells the AI how to triage alerts like a responsible engineer (something I aspire to be) - A cron job kicks it off every weekday at 8am - Parallel AI agents each grab an issue, spin up isolated worktrees, and open PRs Let me walk you through it â slowly, because Iâm in no rush. Step 1: Plug Datadog Into Claude Code (2 minutes) The Model Context Protocol (MCP) lets AI tools talk to external services. Datadog has a remote MCP server with OAuth, so there are zero API keys to manage. My favorite kind of setup: the kind where I barely have to do anything. One file in the repo root: // .mcp.json { "mcpServers": { "datadog": { "type": "http", "url": "https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp" } } } Done. Every developer on the team gets it automatically. First launch asks you to click a button in the browser to authenticate. Maximum effort: one click. (Swap datadoghq.eu for datadoghq.com if youâre on the US1 region.) Step 2: Teach the AI to Do My Job (10 minutes) Claude Code has this concept of skills â markdown files that live in .claude/skills/ and act as reusable prompt templates. If youâre new to Claude Code workflows, our AI coding tips cover the fundamentals. I created /triage-datadog , which is essentially a document explaining to an AI how to do the morning triage Iâve been avoiding. The skill has four phases: Gather â âHey Claude, go check Datadog for anything that blew up in the last 24 hours. Monitors, error logs, incidents, the works.â Classify â sort findings into three piles: - Actionable â actual code bugs. The good stuff - Infrastructure â server problems. Not my department (just kidding, itâs also my department, but letâs pretend) - Noise â transient blips that resolved themselves. The universeâs way of testing our alert fatigue Fix â for each real bug, spin up an AI agent in an isolated git worktree. It reads the codebase, finds the root cause, writes a fix with tests, and opens a PR. All by itself. While Iâm doing literally anything else. Report â summarize everything in a neat table so I can glance at it and feel informed. The agents run in parallel because waiting for them sequentially would be⦠well, a waste of my time not doing anything. Step 3: The Cron Job That Changed My Mornings (1 minute) The skill works great when invoked manually. But manually invoking things every morning is exactly the kind of responsibility Iâm trying to escape. One line in the crontab: 3 8 * * 1-5 claude -p --dangerously-skip-permissions '/triage-datadog' Thatâs claude -p for âjust print the output and exit, donât try to have a conversation with me.â The --dangerously-skip-permissions flag sounds scary, but it just means the agent wonât pause and wait for a human to click âapproveâ on every file read. In practice, each agent runs in a dedicated, isolated environment â a sandboxed macbox session with scoped git worktrees and no access to production infrastructure, secrets, or deployment pipelines. The agent can read code, write fixes, and open PRs. Thatâs it. And 1-5 means weekdays only â even AI deserves weekends. Want to sleep better at night? You can lock down what tools it can use: claude -p --dangerously-skip-permissions --allowedTools "Bash(git:*) Bash(gh:*) Edit Read Grep Glob Agent" '/triage-datadog' This explicit tool allowlist is the final layer â on top of the isolated environment, scoped filesystem access, and git worktree sandboxing. Belt, suspenders, and a parachute. My Morning Now vs. Before Before: Wake up. Coffee. Open Datadog. Scroll. Squint. Sigh. Investigate. Maybe fix something. Start real work at 11. After: Wake up. Coffee. Check Slack. See PRs already waiting for review. Approve the good ones. Start real work at 9:15. Hereâs what the triage report lo
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유