Show HN: 로그 검색을 위한 AI 모델을 미세 조정했습니다. 정확도 50%~80%
hackernews
|
|
🔬 연구
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
일반 AI 모델은 로그 데이터의 성공/실패 구분이나 인과 관계 등에서 심각한 오류를 범해 검색 정확도가 50%에 불과했습니다. 연구진은 운영 동의어 사전과 실제 로그, 합성 시나리오를 활용해 도메인 특화 모델로 미세 조정했습니다. 그 결과 정확도는 85%로 향상되었고 실패와 성공 로그를 명확히 구분하며 인과 관계를 파악하는 데 큰 성과를 거두었습니다.
본문
The AI models that power modern search — the same ones behind Google, email search, and enterprise knowledge bases — were trained on natural language. Books, articles, web pages, conversations. They understand English beautifully. They do not understand logs. The Problem With General-Purpose Models Take a state-of-the-art text embedding model — the kind that tops industry benchmarks for document retrieval, question answering, and semantic similarity. Feed it two log messages: Log A: "OAuth token refresh failed for merchant_id=m_8472. Retry 3/5. Circuit breaker: HALF_OPEN" Log B: "Token refresh completed successfully for merchant_id=m_9921 (847ms)" A general-purpose model sees these as 98% similar. They share most of the same words: "token," "refresh," "merchant_id," numbers, punctuation. But to an SRE, these are opposites. One is a failure. The other is a success. During an incident, confusing these two logs means missing the actual error and wasting precious minutes on false leads. This isn’t a minor edge case. It’s a systematic failure mode that affects every query an on-call engineer runs during an incident. Five Ways General Models Fail on Logs We identified five specific failure modes when applying general-purpose AI models to enterprise log data: 1. Success vs Failure Blindness General models treat "failed" and "succeeded" as minor word variations — they share the same sentence structure and surrounding context. But in operations, this is the single most important distinction in a log message. 2. Operational Equivalence Ignorance "connection refused" , "ETIMEDOUT" , and "upstream host unreachable" mean the same thing to every SRE on the planet. A general model embeds them far apart because they share no words. The technical jargon is effectively out-of-vocabulary. 3. Causal Chain Blindness When a DNS timeout causes an auth failure which causes a payment error, those three log messages are deeply related — they’re the same incident described at three different points in the chain. A general model sees three unrelated messages from three different services. 4. Structured Field Insensitivity Log messages contain key=value pairs: level=ERROR , service=payment-svc , host=web-03 . General models tokenize these as random subword fragments, losing the structural meaning entirely. level=ERROR and level=INFO embed almost identically. 5. Numeric Blindness latency=2847ms and latency=12ms are operationally worlds apart — the first is a crisis, the second is normal. General models treat numbers as interchangeable tokens. What We Did: Domain-Specific Fine-Tuning We fine-tuned an AI model specifically for log data. Not from scratch — we started with a state-of-the-art base model and adapted it to understand log-specific patterns through contrastive learning on 78,000 carefully constructed training pairs. The training data came from three sources: 1. A curated operational equivalence dictionary — 26 groups of phrases that mean the same thing in operations, plus 15 pairs of opposites (success vs failure) that must be distinguished. This encodes human domain expertise directly into the model. 2. Real-world log datasets from 9 different systems — distributed storage, cloud platforms, web servers, databases, security systems. This teaches the model what real logs look like across diverse enterprise environments. 3. Synthetic incident scenarios with causal chains — 10 realistic multi-service incident chains where one failure causes the next (DNS timeout → auth failure → circuit breaker → payment error). Each scenario generates training pairs that teach the model to link causally connected events. Crucially, the training pairs use graduated similarity scores (0.0 to 1.0), not binary labels. Two logs from the same incident chain score 0.85 — close but not identical. Two logs from the same service but different incidents score 0.1. Success vs failure pairs score 0.0. This teaches the model that similarity is a spectrum, not a binary classification. The Results We evaluated the fine-tuned model against the base model across four dimensions, using standard data science metrics. ### Classification: "Are These Two Logs Related?" We tested with 40 log pairs — 20 operationally related, 20 opposites (success/failure). | Metric | Base Model | Fine-Tuned | Improvement | |---|---|---|---| | Accuracy | 50% | 85% | +35% | | Precision | 50% | 94% | +44% | | F1 Score | 67% | 83% | +17% | | AUC-ROC | 0.29 | 0.85 | +0.56 | The base model was a coin flip — 50% accuracy, unable to distinguish related from unrelated at any threshold. The fine-tuned model achieves 94% precision: when it says two logs are related, it’s right 94% of the time. The AUC-ROC jump from 0.29 (worse than random) to 0.85 (strong classifier) tells the full story. The base model’s ranking was essentially inverted — it was more confident about wrong answers than right ones. ### Correlation: "Does Predicted Similarity Match Reality?" We tested on 7,824 held-out log pairs with known similarity scores. | Metric | Base Model | Fine-Tuned | Improvement | |---|---|---|---| | Pearson r | 0.38 | 0.95 | +0.57 | | Spearman ρ | 0.29 | 0.86 | +0.57 | This is the most telling result. The base model’s similarity scores have 38% correlation with expected scores — barely better than random. The fine-tuned model achieves 95% correlation — its scores almost perfectly match what a human expert would assign. For search, this means: when results are ranked by similarity, the right logs appear at the top. ### Hard Negatives: "Can It Tell Failure From Success?" The most critical test. We measured cosine similarity between 10 failure/success pairs: | Pair | Base Similarity | Fine-Tuned Similarity | |---|---|---| | Token refresh failed / succeeded | 0.98 | 0.44 | | Circuit breaker OPEN / CLOSED | 0.97 | 0.50 | | Health check failed / passed | 0.92 | 0.17 | | Consumer lag critical / zero | 0.93 | 0.01 | | CPU at 99% / CPU at 15% | 0.94 | 0.03 | Lower is better — failure and success should be far apart. The base model sees "consumer lag critical" and "consumer lag zero" as 93% identical. The fine-tuned model correctly puts them at 1% similarity. "CPU at 99% throttling" vs "CPU at 15% idle" went from 94% to 3%. The model has learned that outcome words — failed/succeeded, open/closed, critical/zero — are the most important signal in a log message, more important than all the surrounding vocabulary. ### Causal Chain Discovery We tested whether the model links causally connected events across services: | Metric | Base Model | Fine-Tuned | Improvement | |---|---|---|---| | Chain vs unrelated gap | 0.03 | 0.75 | 27x better | The base model could barely distinguish logs from the same incident chain (gap = 0.03 — nearly indistinguishable from noise). The fine-tuned model creates a massive separation (gap = 0.75). This means: when an SRE searches for "payment failure," the DNS timeout and auth failure that caused the payment failure now appear right next to it in the results — instead of being buried in noise. What This Means for Log Search These aren’t incremental improvements. They’re category changes: - 50% → 85% accuracy means the difference between a search tool that’s a coin flip and one that’s reliable - 0.38 → 0.95 correlation means results are ranked correctly instead of randomly - 98% → 44% on hard negatives means the tool stops confusing failures with successes - 27x improvement on causal chains means root cause discovery actually works A general-purpose AI model, no matter how good it is at web search or document retrieval, cannot do this. The patterns in log data — operational equivalence, causal chains, severity ordering, structured fields — are domain-specific knowledge that must be learned from log data. The Data Flywheel This is Phase 1. We trained on synthetic incident data and public log datasets. The model is already dramatically better than any general-purpose alternative. Phase 2 — training on real customer incident data — will improve these numbers further. Every real incident a customer experiences generates high-quality training pairs: logs linked by correlation IDs, events confirmed as causally related by the SRE who resolved the incident. This data is gold for model training, and no competitor has access to it. More customers → more incident data → better model → better search → more customers. The flywheel has started. Methodology Fine-tuning was performed using contrastive learning on 78,234 training pairs with graduated similarity scores. Training data combined three sources: a curated operational equivalence dictionary (26 groups), real-world log datasets from 9 enterprise systems, and synthetic incident scenarios with 10 causal chain patterns across 30 microservices. Evaluation used standard data science metrics: Precision, Recall, F1, AUC-ROC for classification; Pearson and Spearman correlation for similarity ranking; and Precision@K, MAP, MRR, NDCG@K for retrieval quality. All training was performed on a single GPU. Total training cost: $0 (free cloud compute). Total training time: approximately 3 hours.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유