HN 표시: LLM 교육 데이터 생성, 정리 및 준비, 올인원

hackernews | | 📦 오픈소스
#llm #low-code #교육 데이터 #데이터 전처리 #머신러닝/연구 #파이프라인
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

DataFlow는 대규모 언어 모델(LLM) 학습을 위한 고품질 데이터를 생성, 정제 및 준비할 수 있는 시각적이고 저코드(LOW-CODE) 기반의 올인원 파이프라인 시스템입니다. PDF나 일반 텍스트 등의 노이즈가 많은 원시 데이터를 정제하고, 의료, 금융 등 특정 도메인에 맞춰 사전 학습이나 미세 조정을 지원합니다. 2026년 2월 기준 직관적인 웹 인터페이스(WebUI)가 업데이트되었으며, 연구자와 기업을 위해 Python 및 Ray 생태계를 기반으로 데이터 파이프라인을 재사용 가능하고 확장 가능한 형태로 제공합니다.

본문

Generate, Clean, and Prepare LLM Data, All-in-One Visual, low-code pipelines with flexible orchestration across domains and use cases.💪 Turn raw data into high-quality LLM training datasets.🔧 🎉 Get smarter LLMs cheaply — give us a star ⭐ on GitHub for the latest update. Beginner-friendly learning resources (continuously updated): [🎬 Video Tutorials] [📚 Written Tutorials] 简体中文 | English - [2026-02-02] 🖥️ DataFlow WebUI is now available! Launch the visual pipeline builder with a single command: dataflow webui . Build and run DataFlow pipelines through an intuitive web interface. 👉 WebUI Docs - [2026-01-20] 🌟 Awesome Works Using DataFlow is now live! A new section showcasing open-source projects and research built on DataFlow. Contributions are welcome! 👉 Awesome Works - [2025-12-19] 🎉 Our DataFlow technical report is now available! Read and cite our work on arXiv: https://arxiv.org/abs/2512.16676 - [2025-11-20] 🤖 Introducing New Data Agents for DataFlow! Try them out and follow the tutorial on Bilibili: https://space.bilibili.com/3546929239689711/lists/6761342?type=season - [2025-06-28] 🎉 DataFlow is officially released! Our data-centric AI system is now public. Stay tuned for future updates. DataFlow is a data preparation and training system designed to generate, refine, evaluate, and filter high-quality data for AI from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG system, in domains such as healthcare, finance, legal, and academic research. Through an operator-based design, DataFlow turns the entire data cleaning workflow into a reproducible, reusable, and shareable pipeline , providing core infrastructure for the Data-Centric AI community. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing or creating new operators on demand. - High-Quality Training Data Generation - Text, Math, and Code data generation (see DataFlow-Instruct-10K for results) - Data generation via tools like AgenticRAG and Text2SQL - Structured Data Extraction - Large-scale PDF → QA conversion - Large-scale book PDF → Visual-QA conversion - Scientific Data Workflow Management - Text2SQL workflow management (Accepted by ICDE 2026) - Math data workflows (Accepted by KDD 2026) - 10+ core operators define interaction patterns and design principles - 100+ pipeline-specific operators available for reuse or reference - Full support for creating custom operators — plug-and-play, easily packaged and distributed via GitHub or PyPI - Data governance algorithms are encapsulated as operator pipelines, enabling reproducibility and fair comparison of different data governance strategies (❤️research-friendly) - Easily reuse swap underlying large models to analyze the relationship between model performance and data quality quickly - Built on Python and Git ecosystems for easy distribution, management, and traceability of high-quality, user-defined data governance operators and pipelines (❤️enterprise-friendly) The DataFlow Suite provides the essential infrastructure to automate and scale LLM data preparation with DataFlow main repository. It comprises four tightly integrated layers: - DataFlow-WebUI – An intuitive, visual interface for constructing and managing complex data pipelines through a drag-and-drop operator workflow. - DataFlow-Agent – An AI-powered assistant that dynamically composes, executes, and optimizes operators and pipelines based on high-level user intent. - DataFlow-Ecosystem – A modular distribution layer that standardizes operator registration. It enables domain-specific modules (e.g., DataFlow-MM, DataFlow-AI4S) to contribute extensible libraries under a unified abstraction. - RayOrch – A high-performance orchestration layer built on Ray, providing distributed compute scheduling and resource management for massive-scale data tasks. Together, these components form a unified, extensible environment that transforms raw data into model-ready intelligence. Data generation and cleaning are crucial for high-quality models, but for both enterprises and individuals, these tasks are often time-consuming, labor-intensive, and costly. DataFlow provides a one-stop solution to tackle these challenges efficiently. Compared with systems like Nemo-Curator and Data-Juicer, DataFlow offers: - Enhanced Support for Data Synthesis Modules – Seamlessly integrates text, code, and math data generation pipeline for high-quality training datasets. - PyTorch-like Programming Management – Clear Pipeline → Operator → Prompt hierarchical structure for workflow control. - Principled and Multi-Category Operator Classification – Operators are systematically organized into multiple functional categories such as generation, evaluation, filtering, and refinement, forming a scientifically grounded, multi-dimensional taxonomy

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →