Nanocode: TPU의 순수 JAX에서 200달러로 구입할 수 있는 최고의 Claude 코드

hackernews | | 📦 오픈소스
#ai #anthropic #claude #jax #tpu #반도체 #클로드 #하드웨어/반도체
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

새로운 오픈소스 라이브러리인 'nanocode'는 앤스로픽의 Constitutional AI 방식을 채택하여 사용자가 직접 자신만의 에이전틱 코딩 모델을 엔드투엔드로 학습할 수 있도록 지원합니다. 이 프로젝트는 JAX와 TPU 환경에 최적화되었으며, 기초 학습 데이터에 'The Stack-V2' 코딩 데이터를 추가하여 코드 토큰화 효율을 크게 높인 것이 특징입니다. 실제로 13억 개의 파라미터를 가진 모델은 TPU v6e-8 환경에서 약 9시간, 200달러의 비용으로 학습이 가능하여 매우 경제적입니다. 구글의 무료 TPU 지원 프로그램(TRC)을 활용하면 초기 비용 없이도 프로젝트를 시작하고 재현해 볼 수 있습니다.

본문

You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert I'm so excited to share nanocode. This is a library showing you how to train your own Claude Code end-to-end. To a first approximation, we will follow the simplest possible approach for training using Constitutional AI - the approach used by Anthropic to train their Claude models. We'll write our own SOUL.md, define the agentic interface which our model will use to interact with the world, generate synthetic data, and use preference optimisation to align the model with our SOUL. nanocode is written entirely in JAX and designed to be trained using TPUs. I adapted the core training infrastructure and philosophy from Karpathy's incredible nanochat project, so if you're familiar with nanochat, nanocode should feel very similar. This is how my d24 1.3B parameter nanocode turned out: nanocode.mp4 You can get started for free using the Google TRC program which gives you free access to pre-emptible TPUs for a month - and I think new Google Cloud accounts also get $300 in credits. I was fortunate to have access to the TRC program for 3 months for this project, and I found most of the time that my spot instances were rarely interrupted and I could easily have the same pod up for a week or more. You can reproduce nanocode-d24 (1.3B params) in around ~9 hours in total on a TPU v6e-8 costing $200, or train nanocode-d20 (477M params) in ~1.5 hours costing $34. If you're using NVIDIA GPUs, nanocode should also work out of the box, but you should be aware that nanocode has been highly optimised for TPUs. Training nanocode: a friendly agentic coding partner Andrej's original release post for nanochat does a great job of explaining what we're doing here, and the commands you'll use in nanocode are virtually identical, so I'd recommend reading through his work first. I'll go over what we've done differently to elicit agentic coding behaviours from our model. Tokenization and Pre-training The pre-training and tokenizer training process is pretty much identical to nanochat's, but I found that including additional coding data from The Stack-V2 at a ratio of 1:5 in both the pre-training and tokenizer mixture resulted in a stronger coding model and more efficient code tokenization, which helped a ton. Let's first download the dataset shards we'll need for tokenizer training and model pre-training: # we'll be training our d24, 1.3B parameter model. but you can adapt MODEL_TAG for your model size.export NANOCODE_BASE_DIR="$HOME/.cache/nanocode"export MODEL_TAG=d24 python -m data.pretrain -d fineweb-edu -n 300 # I've pre-packed and sharded The Stack similar to FinewWeb python -m data.pretrain -d the-stack-v2-dedup -n 60 For reference, we can compare with nanochat's tokenizer which is identical aside from the addition of The Stack in the training mixture (well, I've also added special tokens and templating logic to support more sophisticated tool calling, but more on that later). We can see that this gives a big boost for code at the cost of general text tokenization efficiency, but this is okay since we want our model to do one thing very well; agentic coding. Our models are trained with a param:data ratio of 8 (following nanochat's scaling law analysis). Let's kick off a training run like so: Vocab size: 32768 World size: 8 1342.17728M model parameters 67.108864M wte parameters 1207.959552M h parameters 67.108864M lm_head parameters Training on 10737418240 tokens over 10241 steps ==================== Estimated FLOPs per token: 10066329600 Scaling the LR for the AdamW parameters ∝1/√(2048/768) = 0.612372 Step: 0/10241 | Loss: 10.398 | dt: 104.58s || tkps: 10026 | mfu: 1.37 | ETA: -1.0 min | lr_multiplier: 1.000 Peak bytes reserved/limit: 14.86/22.27 Step: 1/10241 | Loss: 9.771 | dt: 2.74s || tkps: 382082 | mfu: 52.37 | ETA: -1.0 min | lr_multiplier: 1.000 Step: 2/10241 | Loss: 8.209 | dt: 2.74s || tkps: 382220 | mfu: 52.39 | ETA: 234.1 min | lr_multiplier: 1.000 Step: 3/10241 | Loss: 7.327 | dt: 2.74s || tkps: 382193 | mfu: 52.39 | ETA: 312.1 min | lr_multiplier: 1.000 ... fwe_bpb: 0.7626 | sv2_bpb: 0.4356 | avg_bpb: 0.5991 | dt: 90.53s The capital of France is Paris. It is the largest city in France and the most populous city inThe chemical symbol of gold is Au. Gold is a soft, malleable, yellow metal that is The closest planet to the Sun is Mercury, which is the smallest planet in the solar system. It is the closest The opposite of hot is cold. The opposite of cold is heat. The opposite of heat is cold. The second-last day of the week is the day of the Lord. (Leviticus 23:2) ... CORE metric: 0.2352 | dt: 56.86s Total training time: 467.15min Our model has attained some knowledge about the world, which is nice. It still doesn't know about Saturday though : ). Let's look at some more thorough quantit

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →