솔루션의 패턴 시각화: 데이터 구조가 코딩 스타일에 미치는 영향
KDnuggets
|
|
🔬 연구
#cte
#review
#window 함수
#데이터 구조
#코딩 스타일
#패턴 시각화
원문 출처: KDnuggets · Genesis Park에서 요약 및 분석
요약
해당 분석은 데이터 세트의 구조적 특성이 개발자가 선택하는 윈도우 함수, CTE, JOIN, 그리고 판다스 머지와 같은 코딩 패턴에 미치는 영향을 실증적으로 조사한 결과입니다. 기사는 데이터 구조가 코드 작성 방식과 해결책의 시각화 패턴을 어떻게 주도하는지에 대한 구체적인 사례를 제시하며, 데이터 모양에 따라 최적화된 쿼리 및 코딩 스타일이 달라짐을 입증합니다.
본문
Visualizing Patterns in Solutions: How Data Structure Affects Coding Style Read this empirical analysis of how dataset structure drives window functions, CTEs, JOINs, and pandas merge patterns. Image by Author # Introduction When you solve enough interview-style data problems, you start noticing a funny effect: the dataset "shape" quietly dictates your coding style. A time-series table nudges you toward window functions. A star schema pushes you into JOIN chains and GROUP BY. A pandas task with two DataFrames almost begs for .merge() and isin() . This article makes that intuition measurable. Using a set of representative SQL and pandas problems, we will identify basic code-structure traits (common table expression (CTE) utilization, the frequency of window functions, common pandas techniques) and illustrate which elements prevail and the reasons behind this. # Why Data Structure Changes Your Coding Style Rather than just logic, data problems are more like constraints wrapped in tables: // Rows That Depend On Other Rows (Time, Rank, “Previous Value”) If each row's answer depends on adjacent rows (e.g. yesterday's temperature, previous transaction, running totals), solutions naturally lean on window functions like LAG() , LEAD() , ROW_NUMBER() , and DENSE_RANK() . Consider, for example, this interview question's tables: Each customer’s result on a given day cannot be determined in an isolated way. After aggregating order costs at the customer-day level, each row must be evaluated relative to other customers on the same date to determine which total is highest. Because the answer for one row depends on how it ranks relative to its peers within a time partition, this dataset shape naturally leads to window functions such as RANK() or DENSE_RANK() rather than simple aggregation alone. // Multiple Tables With Roles (Dimensions vs Facts) When one table describes entities, and another describes events, solutions tend toward JOIN + GROUP BY patterns (SQL) or .merge() + .groupby() patterns (pandas). For instance, in this interview question, the data tables are the following: In this example, since entity attributes (users and account status) and event data (downloads) are separated, the logic must first recombine them using JOINs before meaningful aggregation (exactly the dimension) can take place. This fact pattern is what creates JOIN + GROUP BY solutions. // Small Outputs With Exclusion Logic (Anti-Join Patterns) Problems asking "who never did X" often become LEFT JOIN ... IS NULL / NOT EXISTS (SQL) or ~df['col'].isin(...) (pandas). # What We Measure: Code Structure Characteristics To compare “coding style” across different solutions, it’s useful to identify a limited set of observable features that can be extracted from SQL text and Python code. While these may not be flawless indicators of solution quality (e.g. correctness or efficiency), they can serve as trustworthy signals regarding how analysts engage with a dataset. // SQL Features We Measure // Pandas Features We Measure # Which Constructs Are Most Common To move beyond anecdotal observations and quantify these patterns, you need a more straightforward and consistent method to derive structural signals directly from solution code. As a concrete anchor for this workflow, we used all educational questions on the StrataScratch platform. In the result shown below, “total occurrences” is the raw count of times a pattern appears across all code. A single question's solution could use JOIN 3 times, so those 3 all add up. “Questions using” concerns how many distinct questions have at least one occurrence of that feature (i.e. a binary "used / not used" per question). This method reduces each solution to a limited set of observable features, enabling us to consistently and reproducibly compare coding styles across problems and to associate dataset structure with dominant constructs directly. // SQL Features // Pandas Features (Python Solutions) // Feature Extraction Code Below, we present the code snippets used, which you can use on your own solutions (or rephrase answers in your own terms) and extract features from the code text. // SQL Feature Extraction (Example) import re from collections import Counter sql = # insert code here SQL_FEATURES = { "cte": r"\bWITH\b", "join": r"\bJOIN\b", "group_by": r"\bGROUP\s+BY\b", "window_over": r"\bOVER\s*\(", "dense_rank": r"\bDENSE_RANK\b", "row_number": r"\bROW_NUMBER\b", "lag": r"\bLAG\b", "lead": r"\bLEAD\b", "not_exists": r"\bNOT\s+EXISTS\b", } def extract_sql_features(sql: str) -> Counter: sql_u = sql.upper() return Counter({k: len(re.findall(p, sql_u)) for k, p in SQL_FEATURES.items()}) // Pandas Feature Extraction (Example) import re from collections import Counter pandas = # paste code here PD_FEATURES = { "merge": r"\.merge\s*\(", "groupby": r"\.groupby\s*\(", "rank": r"\.rank\s*\(", "isin": r"\.isin\s*\(", "sort_values": r"\.sort_values\s*\(", "drop_duplicates": r"\.drop_duplicates\s*\(", "transform": r"\.transfo
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유