종이 테이프만 있으면 됩니다 – 1976년 미니컴퓨터에서 트랜스포머 훈련
hackernews
|
|
📦 오픈소스
#pdp-11
#transformer
#레트로 컴퓨팅
#머신러닝
#머신러닝/연구
#어셈블리어
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
1970년대 미니컴퓨터인 PDP-11용 어셈블리어로 작성된 단일 층 트랜스포머 모델이 개발되어, 당시 하드웨어로도 디지털 숫자열 뒤집기 학습이 가능함을 입증했습니다. 메모리 제약을 극복하기 위해 정밀도가 다른 고정 소수점 연산과 수동 최적화된 학습률을 적용한 결과, 훈련 시간을 6.5시간에서 2.5시간으로 단축하고 모델을 32KB 메모리에 최적화했습니다.
본문
A single-layer, single-head transformer written in PDP-11 assembly language. This project is the spiritual successor of Xortran, a neural network that learns XOR with backpropagation in Fortran IV on the IBM 1130 (1965) and PDP-11/20 (1970). The natural next step was to see if those machines could successfully train a small transformer in an acceptable amount of time (a few hours). Architecturally, a transformer is actually a fairly modest extension of a basic neural network. The building blocks such as matrix multiplies, backpropagation, SGD, and cross-entropy are already there. The three new components are: - Self-attention: dot-product score between projected queries and keys - Positional encoding: learned position embeddings, added to the input - Softmax: to turn scores into a probability distribution The goal is to train the Transformer to reverse a sequence of digits. Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google's reference implementation of the original transformer in 2017. The data path is straightforward: tokens are embedded, passed through self-attention with a residual connection, then projected back to the vocabulary and softmaxed into a prediction: Tokens -> Embedding -> Self-Attention -> Residual -> Projection -> Softmax | Hyperparameter | Value | |---|---| | Layers | 1 | | Heads | 1 | | d_model | 16 | | Sequence length | 8 | | Vocabulary | 10 (digits 0–9) | | Parameters | 1,216 | The model is an encoder-only transformer: embedding, self-attention with residual connection, and output projection. It's a genuine Transformer with self-attention, but not BERT or a GPT either: it has no layer norm, no feed-forward network, and no decoder. The task requires no transformation of the token representations, so attention and the residual connection are sufficient. Layer normalization, useful in deeper networks to prevent activation drift, is unnecessary with a single layer. The first implementation followed Xortran and was written in Fortran IV. With a uniform learning rate of 0.01, the model took 25mn for 100 steps and needed 1,500 training steps to reach 100% accuracy, which on real hardware would have translated to about 6.5 hours of training, and possibly a whole week on the IBM 1130. This was unacceptably long even by 1970s standards, as those machines were time-shared and computing time was very valuable. A first improvement was the switch to hand-tuned, per-layer learning rates: | Layer | Learning Rate | |---|---| | Wq, Wk, Wv (attention) | 0.08 | | Token & position embeddings | 0.01 | | Wout (output projection) | 0.0025 | Attention weights, which encode the reversal pattern, benefit from a high learning rate, while the output projection converges better with a smaller one. With this tuning, training dropped to 600 steps and an estimated 2.5 hours. The optimizer is plain stochastic gradient descent (SGD). Adam would automatically adapt the step size per parameter, but at the cost of two extra state vectors per weight, tripling the memory devoted to parameters. It would also add a square root and division per update, both expensive on the PDP-11 even with the EIS. The per-layer learning rates achieve a similar effect at no additional cost, and the model is small enough for the three rates to be tuned by hand. It also allows the Transformer to fit in 32KB of core memory instead of 64KB, which was important in the 1970s. Side note: since it's bare-metal assembly, ATTN/11 doesn't use more memory than Xortran, which pays the cost of RT-11 V3 and the Fortran runtime. The resulting binary is also fairly compact, at exactly 6,179 bytes. The core arithmetic operations use NN11, a minimal fixed-point neural network stack designed for ATTN/11 and the PDP-11. NN11 is organized in levels not unlike BLAS: scalar primitives (FXMATH), vector operations such as dot product and scaling (VECOP), then matrix–vector operations (MATOP), each level building on the one below. Two additional modules extend the stack beyond linear algebra: activation functions and their lookup tables (ACTFN), and layer-level routines (LAYER) that compose the previous operations into embedding, projection, and attention. The arithmetic is adapted to each pass: | Pass | Format | Precision | |---|---|---| | Forward | Q8 | 8 fractional bits (1/256) | | Backward | Q15 | 15 fractional bits (1/32768) | | Weight accumulators | Q16 | 32-bit (16.16 fixed-point) | The choice of Q8 forward and Q15 backward pairs well on the PDP-11: multiplying a Q8 value by a Q15 value yields Q23 in a 32-bit register pair, and a single ASHC #-8 brings it back to Q15. The backward pass multiply thus costs no more than the forward pass one, while giving gradients 128
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유