Grove: AirDrop을 통한 분산 LLM 교육

hackernews | | 📦 오픈소스
#airdrop #llm #macbook #머신러닝 #머신러닝/연구 #분산학습
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

애플 실리콘 기반 맥북을 활용해 별도의 IP나 SSH 설정 없이 AirDrop 프로토콜(AWDL)로 분산 머신러닝 학습이 가능한 오픈소스 툴 'Grove'가 공개되었습니다. 이 도구는 coordinator 역할을 하는 기기와 worker 기기가 자동으로 연결되어 그라디언트를 동기화하며, MLX 프레임워크를 지원합니다. 특히 DiLoCo와 SparseLoCo 기술을 적용해 네트워크 비용을 최소화하고 기기 간 통신 효율을 극대화했으며, 약 32배 더 적은 데이터 전송으로도 모델 학습이 가능합니다.

본문

Distributed ML training across MacBooks. Zero config. pip install grove-ml Mac A: grove start train.py -n 2 Mac B: grove join Both machines discover each other automatically, sync gradients, and train together. No SSH, no IP addresses, no configuration files. Grove discovers peers over AWDL (the protocol behind AirDrop), then upgrades to direct WiFi when both devices share a network. If WiFi isn't available (e.g. eduroam, or no network at all), everything stays on AWDL. Write a training script with a main() function: # train.py import grove import mlx.core as mx import mlx.nn as nn import mlx.optimizers as optim def main(): world = grove.init() model = nn.Linear(64, 64) optimizer = optim.SGD(learning_rate=0.01) for step in range(100): x = mx.random.normal((8, 64)) y = mx.random.normal((8, 64)) loss, grads = nn.value_and_grad(model, lambda m, x, y: mx.mean((m(x) - y) ** 2))(model, x, y) grads = grove.average_gradients(grads) optimizer.update(model, grads) mx.eval(model.state, optimizer.state) Single device: grove run train.py Multiple devices: grove start train.py -n 2 # coordinator grove join # worker (shows interactive picker) Workers receive the training script from the coordinator automatically. Each device trains independently for H steps, then syncs pseudo-gradients with Nesterov momentum. Good default for most setups. diloco = grove.diloco(model, H=500, outer_lr=0.7) for step in range(total_steps): loss, grads = loss_and_grad(model, batch) optimizer.update(model, grads) mx.eval(model.state, optimizer.state) diloco.step(model) | Parameter | Default | Description | |---|---|---| H | 500 | Inner steps between syncs | outer_lr | 0.7 | Outer optimizer learning rate | outer_momentum | 0.9 | Nesterov momentum | overlap | False | Async overlap (sync in background) | quantize | False | E3M0 4-bit pseudo-gradients | DiLoCo with top-k compression and error feedback. Sends only the largest 1-3% of values each round, with unsent values carrying forward. ~32x less communication than dense DiLoCo. sloco = grove.sparseloco(model, H=500, topk=64, chunk=4096) for step in range(total_steps): loss, grads = loss_and_grad(model, batch) optimizer.update(model, grads) mx.eval(model.state, optimizer.state) sloco.step(model) | Parameter | Default | Description | |---|---|---| H | 30 | Inner steps between syncs | outer_lr | 1.0 | Outer optimizer learning rate | topk | 64 | Values kept per chunk | chunk | 4096 | Chunk size for top-k selection | error_decay | 0.95 | Decay on error buffer | overlap | True | Async overlap (on by default) | DCT-compressed per-step sync. Transforms gradients to frequency space and sends the most significant components. Syncs every step rather than every H steps. Better suited for fast local networks. demo = grove.demo(model, lr=1e-3, topk=32) for step in range(total_steps): loss, grads = loss_and_grad(model, batch) demo.step(model, grads) | Parameter | Default | Description | |---|---|---| lr | 1e-3 | Learning rate | decay | 0.999 | EMA decay | topk | 32 | DCT components kept per chunk | chunk | 64 | Chunk size | world = grove.init() world.rank() # this device's rank (0 = coordinator) world.size() # total number of devices grove.average_gradients(grads) # all-reduce + average grove.all_sum(x) # sum an MLX array across devices grove.all_gather(x) # gather an MLX array from all devices grove.send(x, dst) # send to a specific rank grove.recv(shape, dtype, src) # receive from a specific rank grove.barrier() # wait for all devices grove.report(loss) # report loss to dashboard grove.rank # int grove.world_size # int grove.is_available() # True if world_size > 1 grove run Run on a single device grove start -n N Start a cluster with N nodes grove start --name X Start with a specific cluster name grove join [name] Join a cluster (interactive picker if no name) grove status System info and nearby clusters Add --logs to any command to see raw log output instead of the dashboard. | Variable | Effect | |---|---| GROVE_NO_WIFI | Skip WiFi upgrade probe, use AWDL only | - macOS with Apple Silicon (M1+) - Python 3.10+ - MLX - Xcode command-line tools (for compiling the Swift helper on first run) MIT

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →