Jupyter notebook: https://github.com/gangfang/nanogpt/blob/main/gpt_dev.ipynb
Results analysis
experiment 1
batch_size = 32
block_size = 6
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 5000
Arch:
- Multi headed attention
- residual connections
22849 parameters
Results:
- Training time: 25.1s
- Performance: train loss 2.1236, val loss 2.1695
experiment 2
batch_size = 32
block_size = 6
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 10000 # increased
Arch:
- Multi headed attention
- residual connections
22849 parameters
Results
- Training time: 62.5s
- Performance: train loss 2.0394, val loss 2.1237
experiment 3
batch_size = 32
block_size = 32 # increased
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 10000
Arch:
- Multi headed attention
- residual connections
23681 parameters
Results:
- Training time: 2m 36.3s
- Performance: train loss 1.8381, val loss 1.9865
experiment 4
batch_size = 32
block_size = 32
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 10000
learning_rate = 3e-4 # decreased from 1e-3
Arch:
- Multi headed attention
- residual connections
23681 parameters
Results
- Training time: 2m 36.3s
- Performance: train loss 2.0735, val loss 2.1210
Observation: This learning rate is too slow for the majority of training
experiment 5
batch_size = 32
block_size = 32
n_embd = 32
n_head = 4
n_layer = 6 # increased
max_iters = 10000
learning_rate = 1e-3
Arch:
- Multi headed attention
- residual connections
42113 parameters
Results
- Training time: 4m 58s
- Performance: train loss 1.7351, val loss 1.8810
Observation: The losses got below 2.0 with half of 10000 iterations
experiment 6
batch_size = 32
block_size = 32
n_embd = 128 # increased
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
Arch:
- Multi headed attention
- residual connections
610625 parameters
Results
- Training time: 8m 4.5s
- Performance: train loss 1.4459, val loss 1.6770
experiment 7 (too long to train)
batch_size = 32
block_size = 256 # increased
n_embd = 128
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
Arch:
- Multi headed attention
- residual connections
639297 parameters
Results
- Training time: too long to train
- Performance: too long to train
Observation: this only increased the size of pos_embd_table. The parameters count didn’t increase by much relatively speaking. However, Training takes significantly longer. It took 6 minutes to complete 500 iterations, that is 2hr estimated for 10000 iterations.
experiment 8: see if this is faster than set 7
batch_size = 32
block_size = 32
n_embd = 256 # increased
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
Arch:
- Multi headed attention
- residual connections
2,400,833 parameters (GPT3 has 175b params)
Results
- Training time: 14m 45s
- Performance: train loss 1.4654, val loss 1.6836
Observation:
- This is much faster than set 7. 500 iterations took about a minute. Seems like these params aren’t created equal when it comes to impacts on training speed. Interestingly, the doubling of
n_embdincreased the parameter count by 4 folds while the increase ofblock_sizeonly added about 30000 params to the count. - There is no perf improvement over set 6. The second half of 10000 iterations didn’t do much in terms of reducing the losses
Question: Why is the training time not proportional to the increase of param count? AN (Claude):
- The large increase in block_size (set 7) significantly slowed down training due to the quadratic complexity of attention mechanisms.
- Increasing n_embd (set 8) increased the parameter count more but had a smaller impact on training time. This could be due to a combination of factors including better cache utilization and the M2’s ability to handle larger matrix operations relatively efficiently.
- the key point is the impact on training time depends more on how the parameters affect the computational structure of the model rather than just the raw number of parameters.
experiment 9: add layernorm to see if the second half of training works better
batch_size = 32
block_size = 32
n_embd = 256
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
Arch:
- Multi headed attention
- residual connections
- layernorm
2,407,489 parameters (GPT3 has 175b params)
Results
- Training time: 14m
- Performance: train loss 1.3485, val loss 1.5908
Observations:
- I’m not sure if the second half of the training worked better but the training performance improved over experiment 8. Layernorm contributed to the overall training effectiveness.
- overfitting is significant given the gap between train loss and val loss
experiment 10: add dropout to reduce overfitting
batch_size = 32
block_size = 32
n_embd = 256
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
dropout = 0.2 # added
Arch:
- Multi headed attention
- residual connections
- layernorm
- dropout
2,407,489 parameters (GPT3 has 175b params)
Results
- Training time: 17m 40s
- Performance: train loss 1.4242, val loss 1.6229
Observation: dropout seemed to maintain the gap between train loss and val loss. But it also seemed to hamper overall training performance. This run is worse than the previous even though overfitting is less severe.
GPU training
experiment 11: same set up as experiment 10, see how much faster the GPU can make possible
GPU instance: gpu_1x_a100_sxm4, Lambda Labs
Results
- training time: 5m
- Performance: train loss 1.4252, val loss 1.6392
Observation: 1/3 the time of CPU training.
experiment 12: same set up but use an instance of multiple gpus
GPU instance: gpu_1x_a100_sxm4, Lambda Labs
Results
- training time: 4m
- Performance: train loss 1.4305, val loss 1.6353
Observation:
- no improvement over previous. I probably need to set up compute distribution to utilize the multiple GPUs.
- and I did. I attempted to use
nn.DataParallelfor simpler multi-GPU set up but it doesn’t work with the custom definition of transformer. DDP is used in https://github.com/karpathy/nanoGPT/blob/master/train.py#L8 and is recommended. But since it is more complex, I will do this when I study the nanoGPT train code.
experiment 13: same set up but use a better GPU
GPU instance: gpu_1x_h100_pcie, Lambda Labs
Results
- training time: 9m 30s
- Performance: train loss 1.4297, val loss 1.6355
Observation: this is so much slower than gpu_1x_a100_sxm4. Why is that? I guess it is not that important to have an answer to this now
experiment 14: scaling up the model further
GPU instance: gpu_1x_a100_sxm4, Lambda Labs
batch_size = 64 # increased
block_size = 256 # increased
n_embd = 384 # increased
n_head = 6 # increased
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
dropout = 0.2
Arch:
- Multi headed attention
- residual connections
- layernorm
- dropout
5,466,689 parameters (GPT3 has 175b params)
Results
- Training time: 16m
- Performance: train loss 0.8349, val loss 1.5687
observation: this training run massively overfit the model in the second half of the training. let me reduce the learning rate and see if overfitting gets any better
step 0: train loss 4.2937, val loss 4.2903
step 500: train loss 1.6452, val loss 1.8075
step 1000: train loss 1.3973, val loss 1.6076
step 1500: train loss 1.2997, val loss 1.5436
step 2000: train loss 1.2340, val loss 1.5108
step 2500: train loss 1.1880, val loss 1.4952
step 3000: train loss 1.1508, val loss 1.4896
step 3500: train loss 1.1157, val loss 1.4844
step 4000: train loss 1.0861, val loss 1.4781
step 4500: train loss 1.0569, val loss 1.4957
step 5000: train loss 1.0328, val loss 1.5006
step 5500: train loss 1.0064, val loss 1.5027
step 6000: train loss 0.9837, val loss 1.5126
step 6500: train loss 0.9603, val loss 1.5160
step 7000: train loss 0.9385, val loss 1.5198
step 7500: train loss 0.9175, val loss 1.5382
step 8000: train loss 0.8999, val loss 1.5411
step 8500: train loss 0.8814, val loss 1.5569
step 9000: train loss 0.8643, val loss 1.5571
step 9500: train loss 0.8486, val loss 1.5714
step 9999: train loss 0.8349, val loss 1.5687
experiment 15: use a smaller learning rate
batch_size = 64 # increased
block_size = 256 # increased
n_embd = 384 # increased
n_head = 6 # increased
n_layer = 6
max_iters = 8000 # reduced too so it doesn't run for too long
learning_rate = 3e-4 # reduced
dropout = 0.2
Results
- Training time: 10m 40s
- Performance: step 7999: train loss 1.1198, val loss 1.4897
Observation: 5000 steps are enough. The rest didn’t do anything useful
Leave a comment