Artificial Intelligence

How To Scale Transformers’ Memory as much as 262K Tokens With a Minor Change? What’s the difficulty? What’s the answer? What’s the result? The KNN lookup in a single scheme: Experiment and Results

-

March 15, 2023

Extending Transformers by memorizing as much as 262K tokens

Figure 1. Extending Transformers with access to (key, value) pairs of previously seen subsequences. source

Figure 1. Extending Transformers with access to (key, value) pairs of previously seen subsequences. source

Well, it seems the technical subject is alleged; Now let’s dive into experiments they did.

It’s common for all papers to say what they did outperformed predecessors or has some advantages over its counterparts. Here is identical, but I’m not gonna re-write them here. I suffice to supply daring results.

Effect of External Memory

Average token-level perplexities of every model when trained for 500K steps source

As you possibly can see adding external memory to each Vanilla Transformer (popular one) and Transformer-XL, the complexity is considerably improved. For instance; for dataset PG19, by adding a memory size of 8192, the complexity for vanilaTran sees an improvement from 13.71 to 12.39, and the identical for Transformer-XL.

Increasing the dimensions of the memory increases the advantage of the memory

One of the best complexity of all models and datasets is for those with a memory size of 65K.

Is that this approach Scalable (from an architectural perspective)??

It seems yes, they did this by increasing the sizes of 1 and eight billion parameters.

Adding a memory of 8K tokens improves perplexity across different model sizes. source

The end in their words: “”

Funtuning on a bigger memory

It showed unstable training.

Finetuning for 20K steps to utilize a bigger memory on the arXiv dataset. source

Finetuning a non-memory model to make use of memory

Q) can we use a pre-trained Transformer and the finetune it to make use of external memory?

A) , in fact.

Finetuning a 1B vanilla Transformer model to make use of external memory of size 65K. source

The model to make use of . (here it only takes which is barely of the pre-trained one)

which tokens show a profit from memory?

Difference in loss for every token in randomly chosen paper, using the identical model once with a memory size of 8K and once with 32K. Higher numbers mean the longer memory helped compared to the shorter memory. This paper is 22K tokens long. source

We are able to understand that its profit is sparse.

The foremost point of this approach is that you simply need the smallest amount of changes in code to adapt this external memory. It is a big point for everybody who’s using transformers of their work.
The official code is publicly available in and .
The foremost paper.

What are your thoughts on this topic?
Let us know in the comments below.

6 COMMENTS

0 0 votes

Article Rating

6 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

ASK DUKE http://bardai.ai

Share this article

Recent posts

Previous article

Hierarchical text-conditional image generation with CLIP latents

Next article

The necessity for artificial intelligence school education is emerging in the USA

6

0

Would love your thoughts, please comment.x

()