Python을 사용한 신용 점수 평가를 위한 탐색적 데이터 분석

Towards Data Science | 2026년 3월 14일 07:22 | 🔬 연구

#eda #python #review #리뷰 #신용 점수 #탐색적 데이터 분석

원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석

요약

이 글에서는 파이썬을 활용하여 대출자 및 대출 특성 데이터를 분석함으로써 신용 점수 모델 개발을 위한 탐색적 데이터 분석(EDA) 수행 방법을 다룹니다. 차용자의 다양한 정보와 대출 조건 등 구체적인 변수를 통계적으로 분석하면 부도 위험도를 더 정확하게 예측할 수 있습니다. 이러한 데이터 기반의 접근 방식은 금융 기관이 신용 리스크를 관리하고 의사결정의 정확성을 높이는 데 필수적입니다.

본문

In our previous post, we presented how the databases used to build credit scoring models are constructed. We also highlight the importance of asking right questions: - Who are the customers? - What types of loans are they granted? - What characteristics appear to explain default risk? In this article, we illustrate this foundational step using an open-source dataset available on Kaggle: the Credit Scoring Dataset. This dataset contains 32,581 observations and 12 variables describing loans issued by a bank to individual borrowers. These loans cover a range of financing needs — medical, personal, educational, and professional — as well as debt consolidation operations. Loan amounts range from $500 to $35,000. The variables capture two dimensions: - contract characteristics (loan amount, interest rate, purpose of financing, credit grade, and time elapsed since loan origination), - borrower characteristics (age, income, years of professional experience, and housing status). The model’s target variable is default, which takes the value 1 if the customer is in default and 0 otherwise. Today, many tools and an increasing number of AI agents are capable of automatically generating statistical descriptions of datasets. Nevertheless, performing this analysis manually remains an excellent exercise for beginners. It builds a deeper understanding of the data structure, helps highlight potential anomalies, and supports the identification of variables that may be predictive of risk. In this article, we take a simple instructional approach to statistically describing each variable in the dataset. - For categorical variables, we analyze the number of observations and the default rate for each category. - For continuous variables, we discretize them into four intervals defined by the quartiles: - ]min; Q1], ]Q1; Q2], ]Q2; Q3] and ]Q3; max] We then apply the same descriptive analysis to these intervals as for categorical variables. This segmentation is arbitrary and could be replaced by other discretization methods. The goal is simply to get an initial read on how risk behaves across the different loan and borrower characteristics. Descriptive Statistics of the Modeling Dataset Distribution of the Target Variable (loan_status) This variable indicates whether the loan granted to a counterparty has resulted in a repayment default. It takes two values: 0 if the customer is not in default, and 1 if the customer is in default. Over 78% of customers have not defaulted. The dataset is imbalanced, and it is important to account for this imbalance during modeling. The next relevant variable to analyze would be a temporal one. It would allow us to study how the default rate evolves over time, verify its stationarity, and assess its stability and its predictability. Unfortunately, the dataset contains no temporal information. We do not know when each observation was recorded, which makes it impossible to determine whether the loans were issued during a period of economic stability or during a downturn. This information is nonetheless essential in credit risk modeling. Borrower behavior can vary significantly depending on the macroeconomic environment. For instance, during financial crises — such as the 2008 subprime crisis or the COVID-19 pandemic — default rates typically rise sharply compared to more favorable economic periods. The absence of a temporal dimension in this dataset therefore limits the scope of our analysis. In particular, it prevents us from studying how risk dynamics evolve over time and from evaluating the potential robustness of a model against economic cycles. We do, however, have access to the variable cb_person_cred_hist_length, which represents the length of a customer’s credit history, expressed in years. Distribution by Credit History Length (cb_person_cred_hist_length ) This variable has 29 distinct values, ranging from 2 to 30 years. We will treat it as a continuous variable and discretize it using quantiles. Several observations can be drawn from the table above. First, more than 56% of borrowers have a credit history of four years or less, indicating that a large proportion of clients in the dataset have relatively short credit histories. Second, the default rate appears fairly stable across intervals, hovering around 21%. That said, borrowers with shorter credit histories tend to exhibit slightly riskier behavior than those with longer ones, as reflected in their higher default rates. Distribution by Previous Default (cb_person_default_on_file ) This variable indicates whether the borrower has previously defaulted on a loan. It therefore provides valuable information about the past credit behavior of the client. It has two possible values: - Y: the borrower has defaulted in the past - N: the borrower has never defaulted In this dataset, more than 80% of borrowers have no history of default, suggesting that the majority of clients have maintained a satisfactory repayment record. However, a clear difference

원문 보기 (Towards Data Science)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기