대부분의 A/B 테스트가 거짓말을 하는 이유

Towards Data Science | | 🔬 연구
#a/b 테스트 #review #ux #데이터 리터러시 #데이터 분석 #제품 리뷰
원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석

요약

많은 A/B 테스트가 잘못된 통계적 방식으로 인해 신뢰할 수 없는 결과를 내놓는다는 문제가 지적되었습니다. 해당 기사는 테스트의 결과를 무효화하는 4가지 통계적 오류를 구체적으로 설명하고, 테스트를 진행하기 전 꼭 확인해야 할 체크리스트를 제공합니다. 또한 월요일 업무에서 바로 활용할 수 있는 베이지안과 빈도주의의 의사결정 프레임워크를 비교하여 실무 적용을 돕습니다.

본문

She screenshots the result. Posts it in the #product-wins Slack channel with a party emoji. The head of engineering replies with a thumbs-up and starts planning the rollout sprint. Here’s what the dashboard didn’t show her: if she had waited three more days (the original planned test duration), that significance would have dropped to 74%. The +8.3% lift would have shrunk to +1.2%. Below the noise floor. Not real. If you’ve ever stopped a test early because it “hit significance,” you’ve probably shipped a version of this mistake. You’re in large company. At Google and Bing, only 10% to 20% of controlled experiments generate positive results, according to Ronny Kohavi’s research published in the Harvard Business Review. At Microsoft broadly, one-third of experiments prove effective, one-third are neutral, and one-third actively hurt the metrics they intended to improve. Most ideas don’t work. The experiments that “prove” they do are often telling you what you want to hear. The four statistical sins below account for the majority of unreliable A/B test results. Each takes less than 15 minutes to fix. By the end of this article, you’ll have a five-item pre-test checklist and a decision framework for choosing between frequentist, Bayesian, and sequential testing that you can apply to your next experiment Monday morning. The Peeking Problem: 26% of Your Winners Aren’t Real Every time you check your A/B test results before the planned end date, you’re running a new statistical test. Not metaphorically. Literally. Frequentist significance tests are designed for a single look at a pre-determined sample size. When you check results after 100 visitors, then 200, then 500, then 1,000, you’re not running one test. You’re running four. Each look gives noise another chance to masquerade as signal. Evan Miller quantified this in his widely cited analysis “How Not to Run an A/B Test.” If you check results after every batch of new data and stop the moment you see p < 0.05, the actual false positive rate isn’t 5%. It’s 26.1%. One in four “winners” is pure noise. The mechanics are straightforward. A significance test controls the false positive rate at 5% for a single analysis point. Multiple checks create multiple opportunities for random fluctuations to cross the significance threshold. As Miller puts it: “If you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5%.” This is the most common sin in A/B testing, and the most expensive. Teams make product decisions, allocate engineering resources, and report revenue impact to leadership based on results that had a one-in-four chance of being imaginary. The fix is simple but unpopular: calculate your required sample size before you start, and don’t look at the results until you hit it. If that discipline feels painful (and for most teams, it does), sequential testing offers a middle path. More on that in the framework below. The Power Vacuum: Small Samples, Inflated Effects Peeking creates false winners. The second sin makes real winners look bigger than they are. Statistical power is the probability that your test will detect a real effect when one exists. The standard target is 80%, meaning a 20% chance you’ll miss a real effect even when it’s there. To hit 80% power, you need a specific sample size, and that number depends on three things: your baseline conversion rate, the smallest effect you want to detect, and your significance threshold. Most teams skip the power calculation. They run the test “until it’s significant” or “for two weeks,” whichever comes first. This creates a phenomenon called the winner’s curse. Here’s how it works. In an underpowered test, the random variation in your data is large relative to the real effect. The only way a real-but-small effect reaches statistical significance in a small sample is if random noise pushes the measured effect far above its true value. So the very act of reaching significance in an underpowered test guarantees that your estimated effect is inflated. A team might celebrate a +8% conversion lift, ship the change, and then watch the actual number settle at +2% over the following quarter. The test wasn’t wrong exactly (there was a real effect), but the team based their revenue projections on an inflated number. An artifact of insufficient sample size. The fix: run a power analysis before every test. Set your Minimum Detectable Effect (MDE) at the smallest change that would justify the engineering and product effort to ship. Calculate the sample size needed at 80% power. Then run the test until you reach that number. No early exits. The Multiple Comparisons Trap The third sin scales with ambition. Your A/B test tracks conversion rate, average order value, bounce rate, time on page, and click-through rate on the call-to-action. Five metrics. Standard practice. Here’s the problem. At a 5% significance level per metric, the probability of at least one false positive across all five is

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →