Pandas에서 루프 작성을 중단해야 하는 이유
Towards Data Science
|
|
💼 비즈니스
#pandas
#python
#데이터분석
#루프
#최적화
#데이터 분석
#루프 최적화
#벡터화
#파이썬 팁
원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석
요약
Pandas를 전문가처럼 다루기 위해 반복문을 지양하고 벡터화 연산을 활용하여 데이터를 처리해야 합니다. 이를 통해 코드 실행 속도를 획기적으로 개선하고 효율적인 데이터 분석이 가능합니다.
본문
for i in range(len(df)): if df.loc[i, "sales"] > 1000: df.loc[i, "tier"] = "high" else: df.loc[i, "tier"] = "low" It worked. And I thought, “Hey, that’s fine, right?” Turns out… not so much. I didn’t realize it at the time, but loops like this are a classic beginner trap. They make Pandas do way more work than it needs to, and they sneak in a mental model that keeps you thinking row by row instead of column by column. Once I started thinking in columns, things changed. Code got shorter. Execution got faster. And suddenly, Pandas felt like it was actually built to help me, not slow me down. To show this, let’s use a tiny dataset we’ll reference throughout: import pandas as pd df = pd.DataFrame({ "product": ["A", "B", "C", "D", "E"], "sales": [500, 1200, 800, 2000, 300] }) Output: product sales 0 A 500 1 B 1200 2 C 800 3 D 2000 4 E 300 Our goal is simple: label each row as high if sales are greater than 1000, otherwise low . Let me show you how I did it at first, and why there’s a better way. The Loop Approach I Started With Here’s the loop I used when I was learning: for i in range(len(df)): if df.loc[i, "sales"] > 1000: df.loc[i, "tier"] = "high" else: df.loc[i, "tier"] = "low" print(df) It produces this result: product sales tier 0 A 500 low 1 B 1200 high 2 C 800 low 3 D 2000 high 4 E 300 low And yes, it works. But here’s what I learned the hard way: Pandas is doing a tiny operation for each row, instead of efficiently handling the whole column at once. This approach doesn’t scale — what feels fine with 5 rows slows down with 50,000 rows. More importantly, it keeps you thinking like a beginner — row by row — instead of like a professional Pandas user. Timing the Loop (The Moment I Realized It Was Slow) When I first ran my loop on this tiny dataset, I thought, “No problem, it’s fast enough.” But then I wondered… what if I had a bigger dataset? So I tried it: import pandas as pd import time # Make a bigger dataset df_big = pd.DataFrame({ "product": ["A", "B", "C", "D", "E"] * 100_000, "sales": [500, 1200, 800, 2000, 300] * 100_000 }) # Time the loop start = time.time() for i in range(len(df_big)): if df_big.loc[i, "sales"] > 1000: df_big.loc[i, "tier"] = "high" else: df_big.loc[i, "tier"] = "low" end = time.time() print("Loop time:", end - start) Here’s what I got: Loop time: 129.27328729629517 That’s 129 seconds. Over two minutes just to label rows as "high" or "low" . That’s the moment it clicked for me. The code wasn’t just “a little inefficient.” It was fundamentally using Pandas the wrong way. And imagine this running inside a data pipeline, in a dashboard refresh, on millions of rows every single day. Why It’s That Slow The loop forces Pandas to: - Access each row individually - Execute Python-level logic for every iteration - Update the DataFrame one cell at a time In other words, it turns a highly optimized columnar engine into a glorified Python list processor. And that’s not what Pandas is built for. The One-Line Fix (And the Moment It Clicked) After seeing 129 seconds, I knew there had to be a better way. So instead of looping through rows, I tried expressing the rule at the column level: “If sales > 1000, label high. Otherwise, label low.” That’s it. That’s the rule. Here’s the vectorized version: import numpy as np import time start = time.time() df_big["tier"] = np.where(df_big["sales"] > 1000, "high", "low") end = time.time() print("Vectorized time:", end - start) And the result? Vectorized time: 0.08 Let that sink in. Loop version: 129 seconds Vectorized version: 0.08 seconds That’s over 1,600× faster. What Just Happened? The key difference is this: The loop processed the DataFrame row by row. The vectorized version processed the entire sales column in one optimized operation. When you write: df_big["sales"] > 1000 Pandas doesn’t check values one at a time in Python. It performs the comparison at a lower level (via NumPy), in compiled code, across the entire array. Then np.where() applies the labels in one efficient pass. Here’s the subtle but powerful change: Instead of asking: “What should I do with this row?” You ask: “What rule applies to this column?” That’s the line between beginner Pandas and professional Pandas. At this point, I thought I’d “leveled up.” Then I discovered I could make it even simpler. And Then I Discovered Boolean Indexing After timing the vectorized version, I felt pretty proud. But then I had another realization. I don’t even need np.where() for this. Let’s go back to our small dataset: df = pd.DataFrame({ "product": ["A", "B", "C", "D", "E"], "sales": [500, 1200, 800, 2000, 300] }) Our goal is still the same: Label each row high if sales > 1000, otherwise low . With np.where() we wrote: df["tier"] = np.where(df["sales"] > 1000, "high", "low") It’s cleaner and faster. Much better than a loop. But here’s the part that really changed how I think about Pandas: This line right here… df["sales"] > 1000 …already returns something incredibly useful. Let’s look at it: Outpu
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유