On training (part 1)

One of the first objectives for subnet 29 was to see whether the sample packing during training makes a difference for the achievable loss. TL;DR Sample packing makes significant difference. Per-token loss of a trained model is generally higher when training on packed samples, and evaluating on single samples. Or, Read more…