A brand new architecture has been developed to enrich the weaknesses of the ‘transformer’ architecture, which slows down inference, requires plenty of memory space, and consumes plenty of power as input data grows. It’s explained that it complements the shortcomings of the ‘attention mechanism’, but it surely is identified that verification remains to be needed.
TechCrunch reported on the seventeenth (local time) that researchers at Stanford University, UC San Diego, UC Berkeley, and Meta have developed a ‘TTT (Test Time Training)’ architecture that may process more data at a lower cost than a transformer. Post your paper within the archiveHe said he did it.
Transformer architectures utilized in LLM, akin to ‘ChatGPT’ and ‘Gemini’, have the drawback that the required memory and computation time increase exponentially because the context window grows.
For instance, if you happen to increase the input size from 1000 tokens to 2000 tokens, the memory and computation time required to process the input will increase fourfold, not only twofold. It is because the eye mechanism processes the input information in parallel to uncover correlations between tokens within the text.
One in every of the fundamental components of the eye mechanism of the transformer is the ‘hidden state’, which is basically an extended list of information. When the transformer processes something, it adds an entry to the hidden state, which is a memory like a lookup table to recollect what it just processed. For instance, if the LLM is processing a book, the hidden state stores tokens representing words or parts of words.
Hidden states are one in all the things that make transformers powerful, but in addition they limit them. For a transformer to say even a single word a couple of book it just read, LLM has to scan the whole lookup table, which is as computationally expensive as rereading the whole book.
TTT is a technique of replacing hidden states with machine learning models, constructing a ‘model inside a model’. Because the internal machine learning model encodes the info as weights, it doesn’t proceed to grow when additional data is processed. Whatever the amount of information the TTT model processes, the scale of the inner model doesn’t change. Because of this the TTT model performs so well.
“TTT can inform you X words a couple of book without the computational complexity of rereading the book X times,” the researchers said.
He emphasized that groundbreaking progress is feasible, especially when applied to the video field. “Transformer-based large-scale video models akin to Sora have only a lookup table, in order that they are limited to processing videos of about 10 seconds,” he said. “Our ultimate goal is to develop a system that may process long videos much like the human visual experience.”
Nevertheless, TTT has not yet been proven to be suitable as a alternative for transformers. The researchers have only developed two small models for research purposes, so it’s difficult to match the TTT method with larger transformers at this point.
“I feel this can be a really interesting innovation,” said Mike Cook, professor of data studies at King’s College London. “It’s excellent news, but I can’t say whether it’s higher than existing architectures.”
Meanwhile, latest technologies to enrich the weaknesses of the transformer architecture are being released one after one other.
Israeli startup AI21 Labs has released an LLM called Jamba, which mixes the very best features of the SSM-based Mamba and Transformer architectures. Google has unveiled its Infini-attention technology, which may infinitely expand the length of the LLM context window.
Meta also unveiled the LLM ‘Megalodon’ model, which allows the context window to scale to hundreds of thousands of tokens without requiring huge amounts of memory.
As well as, startup Symbolica introduced a ‘Symbolic AI’ technique that defines tasks by manipulating symbols to resolve the issue of high costs in running LLM based on the transformer architecture.
Reporter Park Chan cpark@aitimes.com