Weekly DL study notes: building GPT from stretch, part 2

  • we use a varying context length up to the block_size. This is to facilitate the transformer’s ability to predict the next character with context of different sizes. For example, if the block size is 8, then the transformer will be able to predict the next character using the previous one character or 2 … or 8 characters.
  • Simpler models can take a larger learning rate like 1e-3. Usually 1e-4 is used for more advanced networks
  • Using hardware to accelerate training
  • F.cross_entropy() expects logits to be of form (minbatch, C, d1, d2, ..., dk). The channel dimension should be the second dimension
  • Notice the difference between the organization of this data and that of makemore (i.e., name) data. The makemore data is simpler in that it consists of names, which structures the data with each name having an start and an end. A block, defined by the block_size, operates within a name. On the contrary, this shakespera data is a blob of text, there is no inherent structure to it and hence allows abitrary slicing. A block in this context operates in the entire text blob and the positioning of it is determined by a random process (i.e., randint()) since there is no natural start points.
    • WAIT, the above analysis doesn’t seem to be complete. Even though there is a natural start and end point of a name, the def build_dataset(words): function uses a block to slide through a word and repeats for all words. And the X in data contains examples of blocks instead of words. And then, I can use the same technique (i.e., block sliding) for the entire text blob in this data, producing a dataset of a size of (len(text) – block_size).
    • So I guess they are different options for setting up the data.
  • why is my code using AdamW not optimizing the bigram model? AN: because I put zero_grad() before step(). Since .data -= lr * .grad, nothing happens when .grad is 0
  • printing of matrix data: I only need to print one example when an individual example is the focus. Printing the entire data set makes it harder to reason
  • what makes the bigram model bigram in terms of its training loop? AN: the answer lays in the targets and the construction of the logits from idx. The data loader defines a target simply as the next character after a char in idx (line 34 in the data loader cell). So after I convert idx to logits using the embedding table and the subsequent .views, what I get is the logits of one character and the corresponding target of that character, the next character. This whole process makes the model bigram. This analysis is incorrect. The data loader doesn’t make this model bigram because the same data loader, which produces next-token target, is used in fancier models. These fancier models are still “next token predictors”. It doesn’t have anything to do with its training loop either, training and the model definition are orthogonal. What really makes the bigram model bigram is the architecture that consists solely of an embedding table of (vocal size by vocal size), which is a 1 to 1 lookup table and restricts information flow from one single character to the next.
    • self.position_embedding_table(torch.arange(T)) simply returns the weights of the emb table. When this position_embedding_table is used in the bigram model, we have awareness of the position of the char in the T dimension but under the context of a bigram, as in the block preceding a char is still not used by the model.
    • What makes the GPT model not bigram? AN: By definition, a bigram model predicts the next token based on the current. GPT uses attention mechanism where embedding vectors in the time dimension update each other and an updated time dimension of embedding vectors is returned from the attention module GPT is not bigram because it predict the next token based on information relevant to the current token within the block. Relevant information retrieval is made possible by the attention mechanism and relevancy is determined by the training data.
  • Embedding explained by 3Bule1Brown (video): the directions in the high dimension space of the embedding can correspond with semantic meaning. An example is E(aunt) - E(uncle) ~= E(woman) - E(man)
  • the feedforward layer after the multi-headed attention layer is applied to each individual token in the input sequence/block. Andrej said this is individual token doing the thinking after communicating with other tokens in a input sequence. Based on self.ffwd = FeedForward(n_embd) and x = self.ffwd(x) # (B,T,C), we see self.ffwd(x) works on all tokens in (B, T). There are B * T of them.
  • three blocks of communication followed by computation is already considered deep by Andrej and he said this suffers from optimization issues. But there are two techniques to cope with the depth of the network:
    • residual connections
    • layernorm
  • Kaiming He’s 3 contributions to deep net training: (this is his theme)
    • initialization
    • normalization
    • residual connections
  • Andrew Ng: good performance on the training set it’s usually the prerequisite of good performance on the eval set or test set
  • An identity function or identity operation is a function that returns its input without any change
  • [[Residual Networks (ResNets)]]
  • [[Attention mechanism]]
  • Why is a proj = nn.Linear(n_embd, n_embd) needed in the MultiHeadAttention class and FeedForward class? AN: because we need to get the inner dimension of the main path, whatever it might be, back to the dimension of the skip connection. Their dimension has to be the same
  • How does GPT use the entire context to complete predict the next token? By context, I mean everything in the prompt. I have this question because of logits = logits[:, -1, :] # becomes (B, C) in the def generate(...) method. This line implies the prediction only replies on the last token in the time dimension. AN: there is this line that comes before the above: idx_cond = idx[:, -block_size:]. The context grows until it hits the predefined max context length block_size and communications happen within the block. At the end of each forward pass, predictions are made for each and every token in the time dimension. But since we want to generate the next token after the entire context, we ignore predictions for all tokens in the context but those for the last token.
  • The sole purpose of having the time dimension is to facilitate attention mechanism
  • Imagine the batch dimension and the time dimension together serve as a “plate of computation”. Computation doesn’t happen in neither of those dimensions but only happen “on the plate”, in the channel dimension or in the embedding space. The embedding vectors “on the plate” change in value and in size. They change in value when they are passed through a feed-forward layer. They change in both value and size when they are passed through an output layer whose fan_in isn’t equal to fan_out
  • layer norm almost identical to batch norm in terms of implementation. The only difference is in layer norm, the xmean and xvar are calculated in the same way in both training and inference. There is no need to maintain a running_mean and running_var
  • “pre-norm” formulation vs. “post-norm” formulation: they refer to arrangements of layer normalization in a ANN. Pre-norm means layer-norm is applied before the main operation of a layer, such as an attention layer or a feed-forward layer.
  • the drop out rate used by Andrej is 20%. This is bigger than I imagined
  • The transformer arch implemented in Andrei’s video decoder only transformer architecture. The encoder and the cross attention mechanism where the query comes from decoder but the key and the value come from the encoder are not implemented. We don’t need the encoder and cross attention because the model is supposed to generate texts, unconditioned on anything. The encoder-decoder arch presented in the original paper is used to translate between languages so the input language needs to be encoded.
  • Andrej: “What makes it a decoder is that we are using the triangular mask in our transformer. So it has this autoregressive property where we can just go and sample from it.”
    • autoregressive property: refers to the capacity of a model to generate one part of the output at a time, conditioning each part on the parts generated before it. This means each output token is predicted based on all previously predicted tokens.
    • the use of the triangular mask makes it a decoder because without it, the model would be an encoder. Autoregression enforced by the triangular mask is an essential property of a decoder.
    • the triangular mask provides autoregressive property for transformers training. Text generation or inference using transformers is by nature auto regressive because a predicted token is appended to the input before the input is used by model to predict the next token. So the purpose of the triangular mask is really to enforce left to right language modeling during training, and therefore aligns the training process with the inference process.
    • Sampling refers to the multimodal(...) operation using the predicted prob dist.
  • Why a feedforward layer is used an attention layer? Claude 3.5: (this is a surface level but useful answer) non-linearity, feature transformation and increased model capacity.
    • Note that the compute in “communicate-then-compute” mental model applies to each position independently. What is that independent? AN: because nn.Linear(n_embd, n_embd, bias=False), the linear layer is applied to each embedding vector, which represents one position in a block. Therefore, the computation is done locally to each position and independently from other positions in a block

Parallelization: By removing sequential dependencies, Transformers could be trained much more efficiently on parallel hardware like GPUs.

  • how exactly is parallelization achieved?
  • This is an important consideration when it comes to the scaling law. Scaling law might apply to all sort of architectures. But less efficient training of non-transformer architecture might be the primary prohibitor preventing them from the success transformers have achieved.
    • TODO: how to transformers and RNNs compare when it comes to training?

Leave a comment