TST Cuts Pre-training Cost by 60%
In a breakthrough that could reshape the economics of large-scale AI training, Nous Research has unveiled Token Superposition Training (TST), a novel method that slashes pre-training costs by roughly 60% without altering model architecture. In a 10B-A1B mixture-of-experts experiment, TST consumed just 38.7% of the GPU hours (4,768 B200-hours versus 12,311) while delivering superior loss and downstream performance.
TST splits the pre-training process into two distinct phases: a superposition phase that aggregates consecutive tokens for coarse-grained learning, followed by a recovery phase that reverts to standard next-token prediction. This curriculum redesign allows the model to build foundational representations more efficiently early on, reducing the computational burden of processing every token individually from the start.
Compared with DeepSeek's heavy system-level optimization approach, TST offers a lighter path to efficiency. Rather than overhauling infrastructure or hardware, Nous Research attacks the problem at the learning algorithm level—redesigning the early learning curriculum to compress redundant information. The result is a method that is architecture-agnostic and could democratize access to large-scale model training, especially for teams with limited compute budgets. While detailed ablation studies and scaling laws remain to be published, the initial results suggest TST may represent a fundamental shift in how we think about pre-training efficiency.