Kedro 가이드: 프로덕션용 데이터 과학 도구 상자

KDnuggets | | 💼 비즈니스
#kedro #mlops #tip #데이터과학 #파이프라인 #프로덕션 #가이드
원문 출처: KDnuggets · Genesis Park에서 요약 및 분석

요약

이 기사는 데이터 사이언스 프로젝트의 실무 적용을 위한 생산 가능한 도구인 케드로(Kedro)를 소개합니다. 글에서는 케드로의 핵심 기능과 주요 개념을 탐구하여, 독자가 실제 프로젝트에 이 프레임워크를 심층적으로 활용하기 전에 프레임워크의 특성을 명확히 이해할 수 있도록 안내합니다.

본문

A Guide to Kedro: Your Production-Ready Data Science Toolbox This article introduces and explores Kedro's main features, guiding you through its core concepts for a better understanding before diving deeper into this framework for addressing real data science projects. Image by Editor # Introduction Data science projects usually begin as exploratory Python notebooks but need to be moved to production settings at some stage, which might be tricky if not planned carefully. QuantumBlack's framework, Kedro, is an open-source tool that bridges the gap between experimental notebooks and production-ready solutions by translating concepts surrounding project structure, scalability, and reproducibility into practice. This article introduces and explores Kedro's main features, guiding you through its core concepts for a better understanding before diving deeper into this framework for addressing real data science projects. # Getting Started With Kedro The first step to use Kedro is, of course, to install it in our running environment, ideally an IDE — Kedro cannot be fully leveraged in notebook environments. Open your favorite Python IDE, for instance, VS Code, and type in the integrated terminal: pip install kedro Next, we create a new Kedro project using this command: kedro new If the command works well, you'll be asked a few questions, including a name for your project. We will name it Churn Predictor . If the command doesn't work, it might be because of a conflict related to having multiple Python versions installed. In that case, the cleanest solution is to work in a virtual environment within your IDE. These are some quick workaround commands to create one (ignore them if the previous command to create a Kedro project already worked!): python3.11 -m venv venv source venv/bin/activate pip install kedro kedro --version Then select in your IDE the following Python interpreter to work on from now onwards: ./venv/bin/python . At this point, if everything worked well, you should have on the left-hand side (in the 'EXPLORER' panel in VS Code) a full project structure inside churn-predictor . In the terminal, let's navigate to our project's main folder: cd churn-predictor/ Time to get a glimpse of Kedro's core features through our newly created project. # Exploring the Core Elements of Kedro The first element we will introduce — and create by ourselves — is the data catalog. In Kedro, this element is responsible for isolating data definitions from the main code. There's already an empty file created as part of the project structure that will act as the data catalog. We just need to find it and populate it with content. In the IDE explorer, inside the churn-predictor project, go to conf/base/catalog.yml and open this file, then add the following: raw_customers: type: pandas.CSVDataset filepath: data/01_raw/customers.csv processed_features: type: pandas.ParquetDataset filepath: data/02_intermediate/features.parquet train_data: type: pandas.ParquetDataset filepath: data/02_intermediate/train.parquet test_data: type: pandas.ParquetDataset filepath: data/02_intermediate/test.parquet trained_model: type: pickle.PickleDataset filepath: data/06_models/churn_model.pkl In a nutshell, we have just defined (not created yet) five datasets, each one with an accessible key or name: raw_customers , processed_features , and so on. The main data pipeline we will create later should be able to reference these datasets by their name, hence abstracting and completely isolating input/output operations from the code. We will now need some data that acts as the first dataset in the above data catalog definitions. For this example, you can take this sample of synthetically generated customer data, download it, and integrate it into your Kedro project. Next, we navigate to data/01_raw , create a new file called customers.csv , and add the content of the example dataset we will use. The easiest way is to see the "Raw" content of the dataset file in GitHub, select all, copy, and paste into your newly created file in the Kedro project. Now we will create a Kedro pipeline, which will describe the data science workflow that will be applied to our raw dataset. In the terminal, type: kedro pipeline create data_processing This command creates several Python files inside src/churn_predictor/pipelines/data_processing/ . Now, we will open nodes.py and paste the following code: import pandas as pd from typing import Tuple def engineer_features(raw_df: pd.DataFrame) -> pd.DataFrame: """Create derived features for modeling.""" df = raw_df.copy() df['tenure_months'] = df['account_age_days'] / 30 df['avg_monthly_spend'] = df['total_spend'] / df['tenure_months'] df['calls_per_month'] = df['support_calls'] / df['tenure_months'] return df def split_data(df: pd.DataFrame, test_fraction: float) -> Tuple[pd.DataFrame, pd.DataFrame]: """Split data into train and test sets.""" train = df.sample(frac=1-test_fraction, random_state=42) test = df.drop(train.index) return

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →