Octonion Bitnet with Fused Triton Kernels: Exploring Sparsity and Dimensional Specialization
Analysis
This post details an experiment combining Octonions and ternary weights from Bitnet, implemented with a custom fused Triton kernel. The key innovation is reducing multiple matmul kernel launches into a single fused kernel, along with Octonion head mixing. Early results show rapid convergence and good generalization, with validation loss sometimes dipping below training loss. The model exhibits a natural tendency towards high sparsity (80-90%) during training, enabling significant compression. Furthermore, the model appears to specialize in different dimensions for various word types, suggesting the octonion structure is beneficial. However, the author acknowledges the need for more extensive testing to compare performance against float models or BitNet itself.
Key Takeaways
“Model converges quickly, but hard to tell if would be competitive with float models or BitNet itself since most of my toy models have only been trained for <1 epoch on the datasets using consumer hardware.”