인과 추론은 기계 학습을 먹어치운다

Towards Data Science | 2026년 3월 24일 00:11 | 🔬 연구

#review #기계학습 #데이터리뷰 #머신러닝 #예측모델 #인과추론

원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석

요약

해당 글은 머신러닝 모델이 높은 예측 정확도에도 불구하고 잘못된 의사결정을 내리는 문제를 해결하기 위해 인과 추론(Causal Inference)의 중요성을 강조합니다. 저자는 이러한 문제를 진단할 수 있는 5가지 질문 리스트와 방법 비교 매트릭스, 그리고 실질적인 파이썬 워크플로우를 소개하며 인과적 사고가 모델의 실질적인 성과를 높이는 핵심임을 설명합니다.

본문

Accuracy on the held-out test set: 94%. The operations team used it to decide which patients to prioritize for follow-up calls. They expected readmission rates to drop. Rates went up. The model had captured every correlation in the data: older patients, certain zip codes, specific discharge diagnoses. It performed exactly as designed. The test metrics were clean. The confusion matrix looked textbook. But when the team acted on those predictions (calling patients flagged as high-risk, rearranging discharge protocols) the relationships in the data shifted beneath them. Patients who received extra follow-up calls didn’t improve. The ones who kept getting readmitted shared a different profile entirely: they couldn’t afford their medications, lacked reliable transportation to follow-up appointments, or lived alone without support for post-discharge care. The variables that predicted readmission were not the same variables that caused it. The model never learned that distinction, because it was never designed to. It saw correlations and assumed they were handles you could pull. They weren’t. They were shadows cast by deeper causes the model couldn’t see. If you’ve built a model that predicts well but fails when turned into a decision, you’ve already felt this problem. You just didn’t have a name for it. The name is confounding. The solution is causal inference. And in 2026, the tools to do it properly are finally mature enough for any data scientist to use. The Question Your Model Can’t Answer Machine Learning (ML) is built for one job: find patterns in data and predict outcomes. This is associational reasoning. It works brilliantly for spam filters, image classifiers, and recommendation engines. Pattern in, pattern out. But business stakeholders rarely ask “what will happen next?” They ask “what should we do?” Should we raise the price? Should we change the treatment protocol? Should we offer this customer a discount? These are causal questions. And answering them with associational models is like using a thermometer to set the thermostat. The thermometer tells you the temperature. It doesn’t tell you what would happen if you changed the dial. Judea Pearl, the computer scientist who won the 2011 Turing Award for his work on probabilistic and causal reasoning, organized this gap into what he calls the Ladder of Causation. The ladder has three rungs, and the distance between them explains why so many ML projects fail when they move from prediction to action. Level 1: Association (“Seeing”). “Patients who take Drug X have better outcomes.” This is pure correlation. Every standard ML model operates here. It answers: what patterns exist in the data? Level 2: Intervention (“Doing”). “If we give Drug X to this patient, will their outcome improve?” This requires understanding what happens when you change something. Pearl formalizes this with the do-operator: P(Y | do(X)). No amount of observational data, on its own, can answer this. Level 3: Counterfactual (“Imagining”). “This patient took Drug X and recovered. Would they have recovered without it?” This requires reasoning about realities that never happened. It is the highest form of causal thinking. Here’s what each level looks like in practice. A Level 1 model at an e-commerce company says: “Users who viewed product pages for running shoes also bought protein bars.” Useful for recommendations. A Level 2 question from the same company: “If we send a 20% discount on protein bars to users who viewed running shoes, will purchases increase?” That requires knowing whether the discount causes purchases or whether the same users would have bought anyway. A Level 3 question: “This user bought protein bars after receiving the discount. Would they have bought them without it?” That requires reasoning about a world that didn’t happen. Most ML operates on Level 1. Most business decisions require Level 2 or 3. That gap is where wrong decisions are made at scale. When Accuracy Lies The gap between prediction and causation is not theoretical. It has a body count. Consider the kidney stone study from 1986. Researchers compared two treatments for renal calculi. Treatment A outperformed Treatment B for small stones. Treatment A also outperformed Treatment B for large stones. But when the data was pooled across both groups, Treatment B appeared superior. This is Simpson’s paradox. The lurking variable was stone severity. Doctors had prescribed Treatment A for harder cases. Pooling the data erased that context, flipping the apparent conclusion. A prediction model trained on the pooled data would confidently recommend Treatment B. It would be wrong. That’s a statistics textbook example. The hormone therapy case drew blood. For decades, observational studies suggested that postmenopausal Hormone Replacement Therapy (HRT) reduced the risk of coronary heart disease. The evidence looked solid. Millions of women were prescribed HRT based on these findings. Then the Women’s Health Initiative, a

원문 보기 (Towards Data Science)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기