If the whole thing scales it could reduce the environmental impact and operational costs of AI systems.
Matrix multiplication (often abbreviated to "MatMul") is at the centre of most neural network computational tasks today. GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations in parallel.
In their new paper with the racy title “Scalable MatMul-free Language Modeling," the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar performance to conventional large language models (LLMs).
They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per second on a GPU accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU's power draw).
They write that a more efficient FPGA "paves the way for the development of more efficient and hardware-friendly architectures."
The paper doesn't provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a traditional model. However, in our experience, you can run a 2.7B parameter version of Llama 2 competently on a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply.
This means that you could theoretically completely run an LLM in only 13 watts on an FPGA (without a GPU), which would result in a 38-fold decrease in power usage.
The technique has not yet been peer-reviewed, but the researchers -- Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian -- claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models.
They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment on resource-constrained hardware like smartphones.
The researchers say that scaling laws observed in their experiments suggest that the MatMul-free LM may also outperform traditional LLMs at very large scales.
The researchers project that their approach could theoretically intersect with and surpass the performance of standard LLMs at scales around 10^23 FLOPS, which is roughly equivalent to the training compute required for models like Meta's Llama-3 8B or Llama-2 70B.
However, the authors note that their work has limitations. Due to computational constraints, the MatMul-free LM has not been tested on extremely large-scale models (e.g., 100 billion-plus parameters).
They call for institutions with larger resources to invest in scaling up and further developing this lightweight approach to language modelling.