Laimark – 스스로 발전하는 8B LLM. 소비자 GPU

hackernews | 2026년 4월 18일 17:18 | 📦 오픈소스

#llama #머신러닝/연구

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

LAIMARK는 언어 모델이 스스로 훈련 데이터를 생성하고 검증 가능한 보상을 통해 강화 학습으로 자가 개선하는지 연구하는 시스템입니다. HumanEval에서 Qwen3-8B 모델이 직접 생성한 커리큘럼을 사용할 경우 정확도가 63.4%에서 76.8%로 향상되어 외부 데이터의 약 65% 효과를 입증했습니다. 하지만 반복 학습 시 개선이 누적되지 않거나, 과제 유형이 편향될 때 성능이 저하되는 등 구조적 한계가 존재합니다.

본문

LAIMARK (Local AI Metacognitive Agent with Recursive Knowledge) studies whether a language model can generate its own training curriculum and improve via reinforcement learning from verifiable reward. Four things run on a single base model: a prompt-evolution loop, a GRPO weight update, prompt re-optimization on the updated weights, and a problem-generation step that feeds the next GRPO round. Nothing outside the model participates, other than the Python interpreter used to check that generated code passes its own tests. Paper: LAIMARK: Gains and Structural Limits of Self-Generated Curricula in Reinforcement Learning from Verifiable Reward (April 2026) · DeepSeek-R1 and related RLVR systems improve base models using curated external problem sets paired with automatic evaluators. We ask what happens when the problem set comes from the model itself. On HumanEval with Qwen3-8B (HuggingFace fp16 harness): | Configuration | External problems | pass@1 | |---|---|---| | Base model | — | 63.4% | | GRPO, self-generated (G=4) | 0 | 76.8% | | GRPO, curated (HumanEval + MBPP) | hundreds | 84.1% | Self-generation with calibration captures about 65% of the curated-benchmark gain on two orders of magnitude less data. Three limits cap this approach. First, iteration does not accumulate: a second GRPO round trained on problems calibrated against the first-round checkpoint converges back to it. Second, a curriculum dominated by a single task type (for example, 84% abduction-style problems) drops pass@1 to 61.0% — below the pre-training baseline — by shifting the output-format prior in a direction that misfits HumanEval. Third, at 32B parameters the learnability window closes entirely: the base model already solves nearly every problem it can formulate with a verified reference solution, so the selection criterion has nothing to accept. laimark/ generate_problems.py stage 1 — candidate problem generation calibrate_problems.py stage 2 — calibration against current policy train_grpo.py stage 3 — GRPO training (main entry point) eval_adapter.py stage 4 — HumanEval evaluation (HF fp16) generate_and_calibrate_hf.py HF-only pipeline (no Ollama required) generate_deduction_abduction.py §5.2 — task-type diversity generation calibrate_deduction_abduction.py §5.2 — calibration with type mix train_dpo.py, train_lora.py §4 — baselines for failed approaches export_gguf.sh GGUF export for deployment data/calibrated_selfgen.jsonl 22 self-generated calibrated problems (the curriculum loaded by --selfgen_only) docs/experimental-results.md consolidated results tables paper/laimark.tex paper source (+ references.bib) paper/Laimark.pdf compiled paper git clone https://github.com/seetrex-ai/laimark.git cd laimark python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -e ".[dev]" Training and evaluation require a GPU with enough VRAM for Qwen3-8B in fp16 (we used an NVIDIA A100 80GB). Train directly on the shipped curriculum (data/calibrated_selfgen.jsonl , 22 problems) and evaluate: python laimark/train_grpo.py --selfgen_only --num_generations 4 --epochs 2 --seed 42 python laimark/eval_adapter.py --adapter_path ./grpo_output/final --seed 42 To regenerate the curriculum from scratch, the full pipeline is: # 1. Generate ~1000 candidate problems (Ollama local backend) python laimark/generate_problems.py --count 1000 --seed 42 # 2. Calibrate against the base model — keep problems in the learnability window [0.2, 0.8] # Roughly one candidate in ten survives verification and calibration. python laimark/calibrate_problems.py --samples 8 --lo 0.2 --hi 0.8 --seed 42 # 3. GRPO on the calibrated self-generated curriculum, no external benchmarks python laimark/train_grpo.py --selfgen_only --num_generations 4 --epochs 2 --seed 42 # 4. Evaluate the trained adapter on HumanEval python laimark/eval_adapter.py --adapter_path ./grpo_output/final --seed 42 Every stage accepts --seed (default 42) and passes it to the underlying sampler (Ollama's seed option, torch.manual_seed , GRPOConfig.seed ). Regeneration from step 1 is deterministic up to Ollama kernel non-determinism on multi-GPU setups; the shipped data/calibrated_selfgen.jsonl is a snapshot of one specific run. --num_generations drives the result at small data volumes (paper §4). On the same 22-problem curriculum, moving from 2 to 4 gains 6.7 pass@1 points; adding more problems at G=2 does less. Each of the four configurations in the paper corresponds to one pipeline run: | Paper section | Command | Pass@1 | |---|---|---| | §4 — selfgen v2 | train_grpo.py --selfgen_only --num_generations 4 | 76.8% | | §4 — curated | train_grpo.py --num_generations 4 (default: HE+MBPP mix) | 84.1% | | §5.1 — iteration | train_grpo.py --selfgen_only --base_adapter grpo_output/final | 76.8% (no gain) | | §5.2 — task-type R1 | calibrate_deduction_abduction.py then GRPO | 61.0% | | §5.3 — 32B base | eval_adapter.py --model Qwen3-32B --full_function | 89.0% | Trained LoRA adapters are not committed. Re-running the pipeline with --seed 42 reproduces the curriculum and training trajectory; exact pass@1 parity depends on the hardware and on GPU-nondeterministic operations in PyTorch. Warning The pipeline runs model-generated Python code in a subprocess with a 5-second timeout. The sandbox does not block filesystem, network, or environment access from the executed code. Run inside a container or disposable VM; see SECURITY.md. @article{tabares2026laimark, title = {LAIMARK: Gains and Structural Limits of Self-Generated Curricula in Reinforcement Learning from Verifiable Reward}, author = {Tabares Montilla, Jes{\'u}s}, year = {2026}, doi = {10.5281/zenodo.19639751}, url = {https://doi.org/10.5281/zenodo.19639751} } The prompt-evolution component of LAIMARK builds on the open-ended evolution framework of Darwin Gödel Machine (Zhang et al., 2025); the weight-update and self-generation components are new to this work. @article{zhang2025darwin, title={Darwin G{\"o}del Machine: Open-Ended Evolution of Self-Improving Agents}, author={Zhang, Jenny and Hu, Shengran and Lu, Cong and Lange, Robert and Clune, Jeff}, journal={arXiv preprint arXiv:2505.22954}, year={2025} } Apache 2.0. Upstream attribution in NOTICE.

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기