교육학적 사전 훈련 확장: 최적의 혼합에서 10B 토큰까지
hackernews
|
|
💼 비즈니스
#sutra-10b
#tip
#교육학적학습
#데이터확장
#머신러닝
#사전훈련
#10b 토큰
#교육학적 사전훈련
#데이터 확장
#최적의 혼합
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
허깅페이스(Hugging Face)는 교육적 사전학습(Pedagogical Pre-Training)의 확장을 위해 최적의 데이터 혼합 방식을 사용하여 100억 개의 토큰을 활용한 모델을 개발했다고 발표했습니다. 이를 통해 더 효율적이고 강력한 AI 모델 구축을 목표로 하며, 학습 데이터의 질과 양 모두를 고려한 새로운 접근 방식을 도입했습니다.
본문
That work left us with a natural question: what happens when you take the insights from optimal mixing and scale up the data itself? This post is the story of Sutra-10B, a 10 billion token pedagogical pre-training dataset, and the framework we built to create it. We describe how the Sutra generation pipeline works, from knowledge graph to quality filtering, what happened when we trained SmolLM2-70M on it for 3 full epochs (30.6 billion tokens total), and what the results tell us about the limits of small models and the value of curated data. Sutra-10B is the largest in a family of pedagogical datasets we have released at multiple scales, all collected in our Sutra Pedagogical Datasets collection. Our mixing experiments showed that textbook-quality content (finePDFs) was the most valuable ingredient in the mix, consistently anchoring strong validation performance. But we were limited by available high-quality educational content. FinePDFs, Cosmopedia, and similar sources only go so far when you need billions of tokens. This is a challenge the field has been grappling with broadly. The HuggingFace team addressed it with FineWeb-Edu [1], using classifier-based quality filtering to extract educational content from web crawls, and later with Cosmopedia [2], which generates synthetic textbooks and articles seeded by 34,000 BISAC subject categories. The SmolLM2 family [3] demonstrated that combining these filtered and synthetic sources with careful multi-stage training can push sub-2B models to state-of-the-art. Microsoft's Phi-4 [4] showed that strategically placed synthetic data throughout pre-training, generated via multi-agent prompting and instruction reversal, can make a 14B model punch well above its weight class on reasoning tasks. But these approaches share a common limitation: they either filter existing content (losing volume) or generate synthetic content without a structured curriculum. We wanted to combine both. So we built Sutra, a framework for generating pedagogical content at scale, guided by a knowledge graph that defines what to teach, in what order, and across what domains. Sutra is not a single prompt that asks an LLM to write textbook pages. It is a multi-stage pipeline with six components: a knowledge graph that defines the curriculum, a content generator that produces educational text, and a quality evaluator that scores output on six pedagogical dimensions. It also includes a diversity manager for broad topic coverage, a rephraser that transforms content into multiple formats, and a cleaner that removes duplicates and low-quality entries. Here is how each piece works. At the heart of Sutra is a knowledge graph containing 1,942 concepts organized across 9 domains: mathematics, science, technology, language arts, social studies, arts and creativity, life skills, philosophy and ethics, and interdisciplinary topics. Each concept carries a complexity level (1 through 10), a list of prerequisites, a set of downstream concepts it builds toward, and cross-domain connections to related concepts in other fields. Concepts fall into four tiers based on complexity: fundamental (levels 1-3), intermediate (4-6), advanced (7+), and synthesis. The graph validates itself for circular dependencies and missing prerequisites, ensuring that the curriculum structure is internally consistent before any content gets generated. The most important property of the knowledge graph is that it can produce a learning sequence for any set of target concepts: a generation order that respects prerequisite chains, so foundational content gets created before advanced material. This mirrors how a well-designed textbook builds knowledge incrementally. Cross-domain bridges connect related concepts across fields, so a concept like "statistical mechanics" in science links bidirectionally to "probability distributions" in mathematics. Recent curriculum learning studies [5, 6] have shown that ordering training data by difficulty can reduce training steps by 18-45%, though the interaction with learning rate decay schedules complicates things in practice [7]. Our knowledge graph provides the scaffolding for potential curriculum strategies, even though for the Sutra-10B training run itself we used standard shuffled pre-training. The generator walks the knowledge graph and plans generation tasks across all concept and content type combinations. There are 14 content types, ranging from concept introductions down to advanced applications and synthesis pieces. Not every content type applies to every concept: synthesis content only appears at complexity 3+, meta-learning at 4+, and code-related content types (implementation, explanation, debugging, optimization) are restricted to domains where they make sense, like programming, engineering, mathematics, and science. A breadth-first priority system ensures complete coverage before depth. Concepts with no generated entries yet receive the strongest priority boost. New content types on existi
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유