Artificial intelligence (AI) startup Sakana AI has developed a brand new technology that may efficiently use the memory of the LLM (Language Model). Because of this costs incurred when constructing applications using LLM or transformer-based models will be reduced.
Sakana AI has recently been working on ‘Universal Transformer Memory’. Post the paper within the archivedid it This imitates the human memory approach to concentrating on remembering vital information and quickly forgetting unnecessary information.
LLM based on Transformer architecture generates a response based on the knowledge provided within the input prompt, and the ‘context window’ plays a crucial role on this. The context window is the scope of data that the model processes, and will be viewed because the model’s working memory.
Appropriately adjusting the contents of this context window can have a big impact on the performance of the model, and prompt engineering will be seen as an extension of this.
The present model relies on 128,000 tokens, and ‘Gemini 1.5 Pro’ supports a protracted context window containing 2 million tokens. This permits users to enter more information, but long prompts can increase computational costs and cause performance degradation.
Subsequently, the core of this research is the necessity for optimization that removes unnecessary tokens while maintaining vital information.
Current prompt optimization technologies have the drawback of being resource-consuming or requiring users to manually test various settings. Nevertheless, Sakana AI’s general-purpose transformer memory uses the ‘Neural Attention Memory Model (NAMM)’ to optimize prompts. NAMM is a straightforward neural network that decides whether to recollect or forget each token.
“This system allows the transformer to discard unnecessary or redundant details and concentrate on crucial information,” the researchers said. “That is critical for inference tasks that require long context.”
NAMM is trained individually from LLM and is combined with a pre-trained model during inference. This makes it flexibly applicable and simple to deploy. Nevertheless, since it is a technique of accessing and activating the within the model, it is simply applicable to open source models.
NAMM is trained through an evolutionary algorithm. Evolutionary algorithms optimize efficiency and performance by iteratively transforming and choosing models that perform best. It is taken into account a crucial process since it performs the duty of remembering or discarding tokens.
It also operates within the ‘attention layer’, a key element of transformer architecture. This layer evaluates the connection and importance of every token and decides which tokens to maintain and which to discard. For that reason, NAMM will be applied on to other models without the necessity for extra modification.
The benchmark results are also impressive. In consequence of coaching the ‘Rama 3-8B’ model with NAMM, it showed higher performance in natural language processing and code solving of long sequences.
Particularly, by removing unnecessary tokens, as much as 75% of cache memory might be saved during model work.
The researchers said, “It’s interesting that NAMM mechanically adjusts its behavior depending on the duty.” For instance, in coding tasks, we discard chunks of tokens that do not affect execution, resembling comments and whitespace. Then again, in natural language tasks, grammatically duplicate tokens are removed in order that the meaning of the sequence will not be affected.
Sakana emphasized, “This research not only improves each the performance and efficiency of pre-trained transformer models, but can also be a technology that will be applied to numerous foundation models without retraining.” The code to create NAMM is On GitHub revealed.
Meanwhile, Sakana is a startup founded in Japan by Google researchers, including Lillian Jones, creator of the Transformers paper, and David Ha, who created the initial concept of the World Model (LWM). This research is the results of supercomputing support from the Japanese Ministry of Economy, Trade and Industry.
Reporter Park Chan cpark@aitimes.com