A prompt-based experiment to enhance each accuracy and transparent reasoning in content personalization.
At DER SPIEGEL, we’re continually exploring ways to enhance how we recommend news articles to our readers. In our latest (offline) experiment, we investigated whether Large Language Models (LLMs) could effectively predict which articles a reader can be involved in, based on their reading history.
Our Approach
We conducted a study with readers who participated in a survey where they rated their interest in various news articles. This gave us a ground truth of reader preferences. For every participant, we had two key pieces of data: their actual reading history (which articles that they had read before taking the survey) and their rankings of a set of recent articles within the survey. Read more about this mixed-methods approach to offline evaluation of reports recommender systems here:
We then used the Anthropic API to access Claude 3.5 Sonnet, a state-of-the-art language model, as our suggestion engine. For every reader, we provided the model with their reading history (news title and article summary) and asked it to predict how interested they’d be within the articles from the survey. Here is the prompt we used:
You're a news suggestion system. Based on the user's reading history,
predict how likely they're to read latest articles. Rating each article from 0 to 1000,
where 1000 means highest likelihood to read.Reading history (Previous articles read by the user):
[List of previously read articles with titles and summaries]
Please rate the next articles (provide a rating 0-1000 for every):
[List of candidate articles to rate]
You need to respond with a JSON object on this format:
{
"recommendations": [
{
"article_id": "article-id-here",
"score": score
}
]
}
With this approach, we will now compare the actual rankings from the survey against the rating predictions from the LLM. This comparison provides an excellent dataset for evaluating the language model’s ability to predict reader interests.
Results and Key Findings
The findings were impressively strong. To know the performance, we will have a look at two key metrics. First, the Precision@5: the LLM achieved a rating of 56%, which suggests that when the system beneficial its top 5 articles for a user (out of 15), on average (almost) 3 out of those 5 articles were actually among the many articles that user rated highest in our survey. Taking a look at the distribution of those predictions reveals much more impressive results: for twenty-four% of users, the system appropriately identified no less than 4 or 5 of their top articles. For one more 41% of users, it appropriately identified 3 out of their top 5 articles.
To place this in perspective, if we were to recommend articles randomly, we’d only achieve 38.8% precision (see previous medium article for details). Even recommendations based purely on article popularity (recommending what most individuals read) only reach 42.1%, and our previous approach using an embedding-based technique achieved 45.4%.
The graphic below shows the uplift: While having any kind of data concerning the users is healthier than guessing (random model), the LLM-based approach shows the strongest performance. Even in comparison with our sophisticated embedding-based logic, the LLM achieves a big uplift in prediction accuracy.
As a second evaluation metric, we use Spearman correlation. At 0.41, it represents a considerable improvement over our embedding-based approach (0.17). This also shows that the LLM just isn’t just higher at finding relevant articles, but in addition at understanding how much a reader might prefer one article over one other.
Beyond Performance: The Power of Explainability
What sets LLM-based recommendations apart just isn’t just their performance but their ability to elucidate their decisions in natural language. Here is an example of how our system analyzes a user’s reading patterns and explains its recommendations (prompt not shown):
User has 221 articles in reading historyTop 5 Comparison:
--------------------------------------------------------------------------------
Top 5 Predicted by Claude:
1. Wie ich mit 38 Jahren zum ersten Mal lernte, strukturiert zu arbeiten (Rating: 850, Actual Value: 253.0)
2. Warum wir den Umgang mit der Sonne neu lernen müssen (Rating: 800, Actual Value: 757.0)
3. Lohnt sich ein Speicher für Solarstrom vom Balkon? (Rating: 780, Actual Value: 586.0)
4. »Man muss sich fragen, ob dieser spezielle deutsche Weg wirklich intelligent ist« (Rating: 750, Actual Value: 797.0)
5. Wie Bayern versucht, sein Drogenproblem unsichtbar zu machen (Rating: 720, Actual Value: 766.0)
Actual Top 5 from Survey:
4. »Man muss sich fragen, ob dieser spezielle deutsche Weg wirklich intelligent ist« (Value: 797.0, Predicted Rating: 750)
5. Wie Bayern versucht, sein Drogenproblem unsichtbar zu machen (Value: 766.0, Predicted Rating: 720)
2. Warum wir den Umgang mit der Sonne neu lernen müssen (Value: 757.0, Predicted Rating: 800)
6. Abitur als Lotterie? (Value: 601.0, Predicted Rating: 650)
3. Lohnt sich ein Speicher für Solarstrom vom Balkon? (Value: 586.0, Predicted Rating: 780)
Performance Metrics:
--------------------------------------------------------------------------------
Spearman Correlation: 0.673
Precision@5: 0.800
Claude's Evaluation:
--------------------------------------------------------------------------------
1. Key patterns in user's reading history:
- Strong interest in social and political issues
- Regular engagement with lifestyle and private development content
- Interest in health and wellness topics
- Frequent reading of articles about economic and financial matters
- Attraction to critical evaluation and commentary pieces
- Interest in food and cooking content
2. Scoring explanation:
Highest scored articles:
- "Wie ich mit 38 Jahren zum ersten Mal lernte, strukturiert zu arbeiten" (850) - Aligns with the user's interest in personal development and lifestyle content
- "Warum wir den Umgang mit der Sonne neu lernen müssen" (800) - Matches interest in health and wellness topics
- "Lohnt sich ein Speicher für Solarstrom vom Balkon?" (780) - Corresponds to interest in practical advice and economic considerations
Medium scored articles:
- "Man muss sich fragen, ob dieser spezielle deutsche Weg wirklich intelligent ist" (750) - Matches pattern of interest in political commentary
- "Wie Bayern versucht, sein Drogenproblem unsichtbar zu machen" (720) - Matches interest in social issues and significant reporting
- "Abitur als Lotterie?" (650) - Aligns with interest in educational and social topics
Lower scored articles:
- "Eine Brise Formel 1" (550) - Limited sports content in reading history
- "Reizender Absatz" (450) - Less alignment with demonstrated interests
- "Hier wird jetzt auf ganz, ganz hohem Niveau gemeckert" (400) - Style and topic less aligned with user preferences
The scoring prioritizes articles that match the user's demonstrated interests in social issues, practical advice, and significant evaluation while giving lower scores to sports and lighter content that appears less often of their reading history.
Fairly than operating as a black box, the system could articulate why it thinks a selected article is likely to be interesting to a reader: Since you often read articles about practical advice and economic matters, you may find this evaluation concerning the cost-effectiveness of balcony solar storage particularly relevant. This type of transparent reasoning could make recommendations feel more personal and trustworthy.
Conclusion
While our results are promising, several challenges have to be addressed. Because of long prompts (a whole lot of article summaries per user), probably the most significant is cost. At about $0.21 per user for a single suggestion run, scaling this to full readerships can be irresponsibly expensive. Testing high-performing open-source models, could potentially reduce these costs. Moreover, the present implementation is comparatively slow, taking several seconds per user. For a news platform where content updates often and reader interests evolve sometimes even throughout a single day, we would want to run these recommendations multiple times every day to remain relevant.
Moreover, we used a single, straightforward prompt with none prompt engineering or optimization. There may be likely (significant) room for improvement through systematic prompt refinement.[1] Moreover, our current implementation only uses article titles and summaries, without leveraging available metadata. We could potentially increase the performance by incorporating additional signals similar to reading time per article (how long users spent reading every bit) or overall article popularity. Anyhow, as a result of high API costs, running iterative evaluation pipelines is currently not an option.
All in all, the mix of strong predictive performance and natural language explanations suggests that LLMs shall be a priceless tool in news suggestion systems. And beyond recommendations, they add a brand new way on how we analyze user journeys in digital news. Their ability to process and interpret reading histories alongside metadata opens up exciting possibilities: from understanding content journeys and topic progressions to creating personalized review summaries.