Experiments with a 2M-param transformer

Jupyter notebook: https://github.com/gangfang/nanogpt/blob/main/gpt_dev.ipynb

Results analysis

experiment 1

batch_size = 32
block_size = 6
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 5000

Arch:

  1. Multi headed attention
  2. residual connections

22849 parameters

Results:

  • Training time: 25.1s
  • Performance: train loss 2.1236, val loss 2.1695

experiment 2

batch_size = 32
block_size = 6
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 10000 # increased

Arch:

  1. Multi headed attention
  2. residual connections

22849 parameters

Results

  • Training time: 62.5s
  • Performance: train loss 2.0394, val loss 2.1237

experiment 3

batch_size = 32
block_size = 32 # increased
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 10000

Arch:

  1. Multi headed attention
  2. residual connections

23681 parameters

Results:

  • Training time: 2m 36.3s
  • Performance: train loss 1.8381, val loss 1.9865

experiment 4

batch_size = 32
block_size = 32
n_embd = 32
n_head = 4
n_layer = 3
max_iters = 10000
learning_rate = 3e-4 # decreased from 1e-3

Arch:

  1. Multi headed attention
  2. residual connections

23681 parameters

Results

  • Training time: 2m 36.3s
  • Performance: train loss 2.0735, val loss 2.1210

Observation: This learning rate is too slow for the majority of training

experiment 5

batch_size = 32
block_size = 32
n_embd = 32
n_head = 4
n_layer = 6 # increased
max_iters = 10000
learning_rate = 1e-3

Arch:

  1. Multi headed attention
  2. residual connections

42113 parameters

Results

  • Training time: 4m 58s
  • Performance: train loss 1.7351, val loss 1.8810

Observation: The losses got below 2.0 with half of 10000 iterations

experiment 6

batch_size = 32
block_size = 32
n_embd = 128 # increased
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3

Arch:

  1. Multi headed attention
  2. residual connections

610625 parameters

Results

  • Training time: 8m 4.5s
  • Performance: train loss 1.4459, val loss 1.6770

experiment 7 (too long to train)

batch_size = 32
block_size = 256 # increased
n_embd = 128
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3

Arch:

  1. Multi headed attention
  2. residual connections

639297 parameters

Results

  • Training time: too long to train
  • Performance: too long to train

Observation: this only increased the size of pos_embd_table. The parameters count didn’t increase by much relatively speaking. However, Training takes significantly longer. It took 6 minutes to complete 500 iterations, that is 2hr estimated for 10000 iterations.

experiment 8: see if this is faster than set 7

batch_size = 32
block_size = 32
n_embd = 256 # increased
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3

Arch:

  1. Multi headed attention
  2. residual connections

2,400,833 parameters (GPT3 has 175b params)

Results

  • Training time: 14m 45s
  • Performance: train loss 1.4654, val loss 1.6836

Observation:

  1. This is much faster than set 7. 500 iterations took about a minute. Seems like these params aren’t created equal when it comes to impacts on training speed. Interestingly, the doubling of n_embd increased the parameter count by 4 folds while the increase of block_size only added about 30000 params to the count.
  2. There is no perf improvement over set 6. The second half of 10000 iterations didn’t do much in terms of reducing the losses

Question: Why is the training time not proportional to the increase of param count? AN (Claude):

  • The large increase in block_size (set 7) significantly slowed down training due to the quadratic complexity of attention mechanisms.
  • Increasing n_embd (set 8) increased the parameter count more but had a smaller impact on training time. This could be due to a combination of factors including better cache utilization and the M2’s ability to handle larger matrix operations relatively efficiently.
  • the key point is the impact on training time depends more on how the parameters affect the computational structure of the model rather than just the raw number of parameters.

experiment 9: add layernorm to see if the second half of training works better

batch_size = 32
block_size = 32
n_embd = 256
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3

Arch:

  1. Multi headed attention
  2. residual connections
  3. layernorm

2,407,489 parameters (GPT3 has 175b params)

Results

  • Training time: 14m
  • Performance: train loss 1.3485, val loss 1.5908

Observations:

  • I’m not sure if the second half of the training worked better but the training performance improved over experiment 8. Layernorm contributed to the overall training effectiveness.
  • overfitting is significant given the gap between train loss and val loss

experiment 10: add dropout to reduce overfitting

batch_size = 32
block_size = 32
n_embd = 256
n_head = 4
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
dropout = 0.2 # added

Arch:

  1. Multi headed attention
  2. residual connections
  3. layernorm
  4. dropout

2,407,489 parameters (GPT3 has 175b params)

Results

  • Training time: 17m 40s
  • Performance: train loss 1.4242, val loss 1.6229

Observation: dropout seemed to maintain the gap between train loss and val loss. But it also seemed to hamper overall training performance. This run is worse than the previous even though overfitting is less severe.

GPU training

experiment 11: same set up as experiment 10, see how much faster the GPU can make possible

GPU instance: gpu_1x_a100_sxm4, Lambda Labs

Results

  • training time: 5m
  • Performance: train loss 1.4252, val loss 1.6392

Observation: 1/3 the time of CPU training.

experiment 12: same set up but use an instance of multiple gpus

GPU instance: gpu_1x_a100_sxm4, Lambda Labs

Results

  • training time: 4m
  • Performance: train loss 1.4305, val loss 1.6353

Observation:

  • no improvement over previous. I probably need to set up compute distribution to utilize the multiple GPUs.
  • and I did. I attempted to use nn.DataParallel for simpler multi-GPU set up but it doesn’t work with the custom definition of transformer. DDP is used in https://github.com/karpathy/nanoGPT/blob/master/train.py#L8 and is recommended. But since it is more complex, I will do this when I study the nanoGPT train code.

experiment 13: same set up but use a better GPU

GPU instance: gpu_1x_h100_pcie, Lambda Labs

Results

  • training time: 9m 30s
  • Performance: train loss 1.4297, val loss 1.6355

Observation: this is so much slower than gpu_1x_a100_sxm4. Why is that? I guess it is not that important to have an answer to this now

experiment 14: scaling up the model further

GPU instance: gpu_1x_a100_sxm4, Lambda Labs

batch_size = 64 # increased
block_size = 256 # increased
n_embd = 384 # increased
n_head = 6 # increased
n_layer = 6
max_iters = 10000
learning_rate = 1e-3
dropout = 0.2

Arch:

  1. Multi headed attention
  2. residual connections
  3. layernorm
  4. dropout

5,466,689 parameters (GPT3 has 175b params)

Results

  • Training time: 16m
  • Performance: train loss 0.8349, val loss 1.5687

observation: this training run massively overfit the model in the second half of the training. let me reduce the learning rate and see if overfitting gets any better

step 0: train loss 4.2937, val loss 4.2903
step 500: train loss 1.6452, val loss 1.8075
step 1000: train loss 1.3973, val loss 1.6076
step 1500: train loss 1.2997, val loss 1.5436
step 2000: train loss 1.2340, val loss 1.5108
step 2500: train loss 1.1880, val loss 1.4952
step 3000: train loss 1.1508, val loss 1.4896
step 3500: train loss 1.1157, val loss 1.4844
step 4000: train loss 1.0861, val loss 1.4781
step 4500: train loss 1.0569, val loss 1.4957
step 5000: train loss 1.0328, val loss 1.5006
step 5500: train loss 1.0064, val loss 1.5027
step 6000: train loss 0.9837, val loss 1.5126
step 6500: train loss 0.9603, val loss 1.5160
step 7000: train loss 0.9385, val loss 1.5198
step 7500: train loss 0.9175, val loss 1.5382
step 8000: train loss 0.8999, val loss 1.5411
step 8500: train loss 0.8814, val loss 1.5569
step 9000: train loss 0.8643, val loss 1.5571
step 9500: train loss 0.8486, val loss 1.5714
step 9999: train loss 0.8349, val loss 1.5687

experiment 15: use a smaller learning rate

batch_size = 64 # increased
block_size = 256 # increased
n_embd = 384 # increased
n_head = 6 # increased
n_layer = 6
max_iters = 8000 # reduced too so it doesn't run for too long
learning_rate = 3e-4 # reduced
dropout = 0.2

Results

  • Training time: 10m 40s
  • Performance: step 7999: train loss 1.1198, val loss 1.4897

Observation: 5000 steps are enough. The rest didn’t do anything useful

Leave a comment