One of the first objectives for subnet 29 was to see whether the sample packing during training makes a difference for the achievable loss.

TL;DR

Sample packing makes significant difference. Per-token loss of a trained model is generally higher when training on packed samples, and evaluating on single samples. Or, in other words, not packing samples leads to lower losses and better models.

What is sample packing?

In pretraining, samples are often packed, meaning that samples are first tokenized, then concatenated with the EOS token as separator, and cut in pieces of some fixed number of tokens. EOS is the token that indicates “End Of Stream,” that is, is signals that it is a good place to end the generated text.

What’s the deal here?

Although training on fixed-length strings of tokens, containing samples, may be efficient on some metrics, it doesn’t necessarily result in the lowest loss per sample. Our idea was that training per sample should result in a model that is better aligned to the dataset, i.e. a model producing lower losses. This turned out to be true, but only if the model would be evaluated per sample.

Wait… I’m confused. Elaborate.

If a model is evaluated on packed samples of length N, then training on packed samples of length N is the best guarantee to capture any oddity arising from that sample packing. One particular oddity stands out: The first token after EOS plays absolutely no role in the regular workings of an LLM. But when packing, on average, 6 samples into one combined sample, at 5 places the first token of a sample is predicted as the next token after EOS. Normally, however, the first token of a sample is not even predicted and no loss is assigned to it. Another oddity introduced in (some forms of) sample packing is that samples are cut somewhere in the middle at the start and end of the packed string.

But still, why does this matter?

The capacity of a model to predict the next token correctly is fundamentally limited, as the model size is limited. If some fraction of the model is optimized to predict tokens that don’t occur in practice, that fraction is effectively wasted, and the capacity that remains for useful predictions is smaller.

How much does it actually matter, quantitatively?

The initial parameters of SN29 are identical to SN9, except for the lack of sample packing. As a first step after launching our subnet, we uploaded the SN9 top model as a start, and spent a few days on 2*Ada6000 training that model on non-packed samples. The loss dropped 0.2% – which is a lot for such a short training run. This is an indication that model performance indeed suffers from packing.

Can’t this be fixed in a simpler way?

A more elegant way to solve this issue is to maintain sample packing during training, reset the internal model state on EOS, and mask the first token after EOS. This would require modifications to model code that is part of transformers. We see no immediate reason why this should not work, but we also see no reason to put in the time and effort to implement this right now.

Is there more capacity wasted in these models?

Certainly. The first and last step in an LLM relate to the mapping of text to tokens. Tokens can be anything, ranging from single characters, to parts of words, to complete words, to combinations of words, whitespace and punctuation marks. The tokenizer is the component responsible for this translation, using the vocabulary (often referred to as vocab). In Llama models these first and last steps are called embed_tokens and lm_head respectively, and for a 6.9 billion parameter model with a 100k vocabulary, these steps take up 10% of the parameters. If you consider that a significant fraction of tokens never occur in the dataset, this means that perhaps 20% of embed_tokens and lm_head, so 2% of the model parameters, can be eliminated.

Any plans to do something about this?

Certainly. Freeing up model space to put parameters to better use, for example by trading tokenizer parameters for larger MLP intermediate size, or more layers, should allow the model to be trained to a lower loss. However, this would require allowing miners to choose (or design) their own vocabulary. Community feedback indicated that comparing losses between tokenizers is not necessarily fair. This is a topic we would like to research.

And how will you research freeing the tokenizer?

We start by having a model trained to the same dataset, with a different model architecture. The initial competition (c00) requires a Llama model with a Xenova/gpt-4 tokenizer. The second competition (c01) requires Phi or Phi3 models, with any tokenizer. This is the first step towards having different architectures and tokenizers trained side by side, allowing comparisons.

But the Phi/Phi3 model of c01 is 3.9B and the Llama model of c00 is 6.9B..?

That’s correct. To compare Phi and Llama fairly, the size of the Phi model will be steadily increased from 3.9B to 6.9B. This is another topic of research: what is the most effective way to increase the number of parameters of a model.

Is there even more capacity wasted in these models?

Possibly. In the first model we published on SN29, we identified some strange properties of the MLP block of layer 1: two gates that seem to dominate the behavior of the MLP block. Zeroing these gates leads to a 200% increase of loss. Zeroing other gates of the MLP block of layer 1 has negligible impact on loss. This indicates that some 40M of 6.9B parameters are not effectively used. In Llama there is no way to cut these parameters out and repurpose them, so any solution must be found in some form of training. This will be the topic of further research and blog posts.

Categories: training

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *