뉴스피드 큐레이션 SNS 대시보드 저널

실제 데이터세트에 5가지 이상값 감지 방법을 사용했습니다. 플래그가 지정된 샘플의 96%에서 동의하지 않았습니다.

KDnuggets | | 💼 비즈니스
#tip #데이터 분석 #데이터 품질 #머신러닝 #실제 데이터 #이상값 감지

요약

다섯 가지 이상 탐지 방법을 실제 와인 데이터셋에 적용한 결과, 방법 간의 불일치율이 96%에 달해 대다수 와인은 특정 방법에서만 이상치로 분류되었습니다. 전체 816개의 와인 중 5가지 방법 모두에서 이상치로 지정된 와인은 단 32개뿐이었으나, 이들에는 공통적인 특징이 존재하는 것으로 확인되었습니다.

왜 중요한가

개발자 관점

검토중입니다

연구자 관점

검토중입니다

비즈니스 관점

검토중입니다

본문

We Used 5 Outlier Detection Methods on a Real Dataset: They Disagreed on 96% of Flagged Samples Out of 816 wines flagged by at least one method, just 32 made the unanimous list. Those wines had something in common. Image by Author # Introduction All tutorials on data science make detecting outliers appear to be quite easy. Remove all values greater than three standard deviations; that's all there is to it. But once you start working with an actual dataset where the distribution is skewed and a stakeholder asks, "Why did you remove that data point?" you suddenly realize you don't have a good answer. So we ran an experiment. We tested five of the most commonly used outlier detection methods on a real dataset (6,497 Portuguese wines) to find out: do these methods produce consistent results? They didn't. What we learned from the disagreement turned out to be more valuable than anything we could have picked up from a textbook. Image by Author We built this analysis as an interactive Strata notebook, a format you can use for your own experiments using the Data Project on StrataScratch. You can view and run the full code here. # Setting Up Our data comes from the Wine Quality Dataset, publicly available through UCI's Machine Learning Repository. It contains physicochemical measurements from 6,497 Portuguese "Vinho Verde" wines (1,599 red, 4,898 white), along with quality ratings from expert tasters. We selected it for several reasons. It's production data, not something generated artificially. The distributions are skewed (6 of 11 features have skewness \( > 1 \)), so the data do not meet textbook assumptions. And the quality ratings let us check if the detected "outliers" show up more among wines with unusual ratings. Below are the five methods we tested: # Discovering the First Surprise: Inflated Results From Multiple Testing Before we could compare methods, we hit a wall. With 11 features, the naive approach (flagging a sample based on an extreme value in at least one feature) produced extremely inflated results. IQR flagged about 23% of wines as outliers. Z-Score flagged about 26%. When nearly 1 in 4 wines get flagged as outliers, something is off. Real datasets don’t have 25% outliers. The problem was that we were testing 11 features independently, and that inflates the results. The math is straightforward. If each feature has less than a 5% probability of having a "random" extreme value, then with 11 independent features: \[ P(\text{at least one extreme}) = 1 - (0.95)^{11} \approx 43\% \] In plain terms: even if every feature is perfectly normal, you'd expect nearly half your samples to have at least one extreme value somewhere just by random chance. To fix this, we changed the requirement: flag a sample only when at least 2 features are simultaneously extreme. Changing min_features from 1 to 2 changed the definition from "any feature of the sample is extreme" to "the sample is extreme across more than one feature." Here's the fix in code: # Count extreme features per sample outlier_counts = (np.abs(z_scores) > 3.5).sum(axis=1) outliers = outlier_counts >= 2 # Comparing 5 Methods on 1 Dataset Once the multiple-testing fix was in place, we counted how many samples each method flagged: Here's how we set up the ML methods: from sklearn.ensemble import IsolationForest from sklearn.neighbors import LocalOutlierFactor iforest = IsolationForest(contamination=0.05, random_state=42) lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05) Why do the ML methods all show exactly 5%? Because of the contamination parameter. It requires them to flag exactly that percentage. It's a quota, not a threshold. In other words, Isolation Forest will flag 5% regardless of whether your data contains 1% true outliers or 20%. # Discovering the Real Difference: They Identify Different Things Here's what surprised us most. When we examined how much the methods agreed, the Jaccard similarity ranged from 0.10 to 0.30. That's poor agreement. Out of 6,497 wines: - Only 32 samples (0.5%) were flagged by all 4 primary methods - 143 samples (2.2%) were flagged by 3+ methods - The remaining "outliers" were flagged by only 1 or 2 methods You might think it's a bug, but it's the point. Each method has its own definition of "unusual": If a wine has residual sugar levels significantly higher than average, it's a univariate outlier (Z-Score/IQR will catch it). But if it's surrounded by other wines with similar sugar levels, LOF won't flag it. It's normal within the local context. So the real question isn't "which method is best?" It's "what kind of unusual am I searching for?" # Checking Sanity: Do Outliers Correlate With Wine Quality? The dataset includes expert quality ratings (3-9). We wanted to know: do detected outliers appear more frequently among wines with extreme quality ratings? Extreme-quality wines were twice as likely to be consensus outliers. That's a good sanity check. In some cases, the connection is clear: a wine with way too m

관련 저널 읽기

전체 보기 →