Large Language Models Are Memorizing the Datasets Meant to Test Them

-

memory

 

In machine learning, a test-split is used to see if a trained model has learned to unravel problems which might be similar, but not equivalent to the fabric it was trained on.

So if a brand new AI ‘dog-breed recognition’ model is trained on a dataset of 100,000 pictures of dogs, it is going to normally feature an 80/20 split – 80,000 pictures supplied to coach the model; and 20,000 pictures held back and used as material for testing the finished model.

Obvious to say, if the AI’s training data inadvertently includes the ‘secret’ 20% section of test split, the model will ace these tests, since it already knows the answers (it has already seen 100% of the domain data). After all, this doesn’t accurately reflect how the model will perform later, on recent ‘live’ data, in a production context.

Movie Spoilers

The issue of AI cheating on its exams has grown in line with the dimensions of the models themselves. Because today’s systems are trained on vast, indiscriminate web-scraped corpora corresponding to Common Crawl, the chance that benchmark datasets (i.e., the held-back 20%) slip into the training mix is not any longer an edge case, however the default – a syndrome referred to as data contamination; and at this scale, the manual curation that might catch such errors is logistically inconceivable.

This case is explored in a brand new paper from Italy’s Politecnico di Bari, where the researchers concentrate on the outsized role of a single movie suggestion dataset, MovieLens-1M, which they argue has been partially memorized by several leading AI models during training.

Because this particular dataset is so widely utilized in the testing of recommender systems, its presence within the models’ memory potentially makes those tests meaningless: what appears to be intelligence may actually be easy recall, and what looks like an intuitive suggestion skill could be a statistical echo reflecting earlier exposure.

The authors state:

The temporary recent paper is titled , and comes from six Politecnico researchers. The pipeline to breed their work has been made available at GitHub.

Method

To grasp whether the models in query were truly learning or just recalling, the researchers began by defining what memorization means on this context, and commenced by testing whether a model was capable of retrieve specific pieces of knowledge from the MovieLens-1M dataset, when prompted in only the precise way.

If a model was shown a movie’s ID number and will produce its title and genre, that counted as memorizing an item; if it could generate details a couple of user (corresponding to age, occupation, or zip code) from a user ID, that also counted as user memorization; and if it could reproduce a user’s next movie rating from a known sequence of prior ones, it was taken as evidence that the model could also be recalling , quite than learning general patterns.

Each of those types of recall was tested using rigorously written prompts, crafted to nudge the model without giving it recent information. The more accurate the response, the more likely it was that the model had already encountered that data during training:

Source: https://arxiv.org/pdf/2505.10212

Data and Tests

To curate an appropriate dataset, the authors surveyed recent papers from two of the sphere’s major conferences, ACM RecSys 2024 , and ACM SIGIR 2024. MovieLens-1M appeared most frequently, cited in only over one in five submissions. Since earlier studies had reached similar conclusions,  this was not a surprising result, but quite a confirmation of the dataset’s dominance.

MovieLens-1M consists of three files: , which lists movies by ID, title, and genre; , which maps user IDs to basic biographical fields; and , which records who rated what, and when.

To seek out out whether this data had been memorized by large language models, the researchers turned to prompting techniques first introduced within the paper , and later adapted within the subsequent work .

The strategy is direct: pose a matter that mirrors the dataset format and see if the model answers appropriately. , , and were tested, and it was found that the last method, by which the model is shown a couple of examples, was essentially the most effective; even when more elaborate approaches might yield higher recall, this was considered sufficient to disclose what had been remembered.

Few-shot prompt used to test whether a model can reproduce specific MovieLens-1M values when queried with minimal context.

To measure memorization, the researchers defined three types of recall: , , and . These tests examined whether a model could retrieve a movie title from its ID, generate user details from a UserID, or predict a user’s next rating based on earlier ones. Each was scored using a coverage metric* that reflected how much of the dataset could possibly be reconstructed through prompting.

The models tested were GPT-4o; GPT-4o mini; GPT-3.5 turbo; Llama-3.3 70B; Llama-3.2 3B; Llama-3.2 1B; Llama-3.1 405B; Llama-3.1 70B; and Llama-3.1 8B. All were run with temperature set to zero, top_p set to at least one, and each frequency and presence penalties disabled. A hard and fast random seed ensured consistent output across runs.

Proportion of MovieLens-1M entries retrieved from movies.dat, users.dat, and ratings.dat, with models grouped by version and sorted by parameter count.

To probe how deeply MovieLens-1M had been absorbed, the researchers prompted each model for exact entries from the dataset’s three (aforementioned) files: , , and .

Results from the initial tests, shown above, reveal sharp differences not only between GPT and Llama families, but in addition across model sizes. While GPT-4o and GPT-3.5 turbo get better large portions of the dataset with ease, most open-source models recall only a fraction of the identical material, suggesting uneven exposure to this benchmark in pretraining.

These should not small margins. Across all three files, the strongest models didn’t simply outperform weaker ones, but recalled of MovieLens-1M.

Within the case of GPT-4o, the coverage was high enough to suggest that a nontrivial share of the dataset had been directly memorized.

The authors state:

Next, the authors tested for the impact of memorization on suggestion tasks by prompting each model to act as a recommender system. To benchmark performance, they compared the output against seven standard methods: UserKNN; ItemKNN; BPRMF; EASER; LightGCN; MostPop; and Random.

The MovieLens-1M dataset was split 80/20 into training and test sets, using a leave-one-out sampling technique to simulate real-world usage. The metrics used were Hit Rate (HR@); and nDCG(@):

Recommendation accuracy on standard baselines and LLM-based methods. Models are grouped by family and ordered by parameter count. Bold values indicate the highest score within each group.

Here several large language models outperformed traditional baselines across all metrics, with GPT-4o establishing a large lead in every column, and even mid-sized models corresponding to GPT-3.5 turbo and Llama-3.1 405B consistently surpassing benchmark methods corresponding to BPRMF and LightGCN.

Amongst smaller Llama variants, performance varied sharply, but Llama-3.2 3B stands out, with the best HR@1 in its group.

The outcomes, the authors suggest, indicate that memorized data can translate into measurable benefits in recommender-style prompting, particularly for the strongest models.

In a further remark, the researchers proceed:

Regarding the impact of model scale on this issue, the authors observed a transparent correlation between size, memorization, and suggestion performance, with larger models not only retaining more of the MovieLens-1M dataset, but in addition performing more strongly in downstream tasks.

Llama-3.1 405B, for instance, showed a median memorization rate of 12.9%, while Llama-3.1 8B retained only 5.82%. This nearly 55% reduction in recall corresponded to a 54.23% drop in nDCG and a 47.36% drop in HR across evaluation cutoffs.

The pattern held throughout – where memorization decreased, so did apparent performance:

The ultimate test examined whether memorization reflects the popularity bias baked into MovieLens-1M. Items were grouped by frequency of interaction, and the chart below shows that larger models consistently favored the most well-liked entries:

Item coverage by model across three popularity tiers: top 20% most popular; middle 20% moderately popular; and the bottom 20% least interacted items.

GPT-4o retrieved 89.06% of top-ranked items but only 63.97% of the least popular. GPT-4o mini and smaller Llama models showed much lower coverage across all bands. The researchers state that this trend suggests that memorization not only scales with model size, but in addition amplifies preexisting imbalances within the training data.

They proceed:

Conclusion

The dilemma is not any longer novel: as training sets grow, the prospect of curating them diminishes in inverse proportion. MovieLens-1M, perhaps amongst many others, enters these vast corpora without oversight, anonymous amidst the sheer volume of knowledge.

The issue repeats at every scale and resists automation. Any solution demands not only effort but human judgment –  the slow, fallible kind that machines cannot supply. On this respect, the brand new paper offers no way forward.

 

*

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x