Using AI to Predict a Blockbuster Movie

Although film and tv are sometimes seen as creative and open-ended industries, they’ve long been risk-averse. High production costs (which can soon lose the offsetting advantage of cheaper overseas locations, not less than for US projects) and a fragmented production landscape make it difficult for independent firms to soak up a big loss.

Due to this fact, over the past decade, the industry has taken a growing interest in whether machine learning can detect trends or patterns in how audiences reply to proposed film and tv projects.

The important data sources remain the Nielsen system (which offers scale, though its roots lie in TV and promoting) and sample-based methods similar to focus groups, which trade scale for curated demographics. This latter category also includes scorecard feedback from free movie previews – nonetheless, by that time, most of a production’s budget is already spent.

The ‘Big Hit’ Theory/Theories

Initially, ML systems leveraged traditional evaluation methods similar to linear regression, K-Nearest Neighbors, Stochastic Gradient Descent, Decision Tree and Forests, and Neural Networks, normally in various mixtures nearer in style to pre-AI statistical evaluation, similar to a 2019 University of Central Florida initiative to forecast successful TV shows based on mixtures of actors and writers (amongst other aspects):

Source: https://arxiv.org/pdf/1910.12589

Probably the most relevant related work, not less than that which is deployed within the wild (though often criticized) is in the sector of recommender systems:

Source: https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2023.1281614/full

Nonetheless, these sorts of approaches analyze projects which can be already successful. Within the case of prospective latest shows or movies, it just isn’t clear what sort of ground truth can be most applicable – not least because changes in public taste, combined with improvements and augmentations of information sources, mean that a long time of consistent data is normally not available.

That is an instance of the problem, where advice systems must evaluate candidates with none prior interaction data. In such cases, traditional collaborative filtering breaks down, since it relies on patterns in user behavior (similar to viewing, rating, or sharing) to generate predictions. The issue is that within the case of most latest movies or shows, there just isn’t yet enough audience feedback to support these methods.

Comcast Predicts

A brand new paper from Comcast Technology AI, in association with George Washington University, proposes an answer to this problem by prompting a language model with about unreleased movies.

The inputs include , , , , , and , with the model returning a ranked list of likely future hits.

The authors use the model’s output as a stand-in for audience interest when no engagement data is out there, hoping to avoid early bias toward titles which can be already well-known.

The very short (three-page) paper, titled , comes from six researchers at Comcast Technology AI, and one from GWU, and states:

If the approach proves robust, it could reduce the industry’s reliance on retrospective metrics and heavily-promoted titles by introducing a scalable technique to flag promising content prior to release. Thus, slightly than waiting for user behavior to signal demand, editorial teams could receive early, metadata-driven forecasts of audience interest, potentially redistributing exposure across a wider range of latest releases.

Method and Data

The authors outline a four-stage workflow: construction of a dedicated dataset from movie metadata; the establishment of a baseline model for comparison; the evaluation of apposite LLMs using each natural language reasoning and embedding-based prediction; and the optimization of outputs through prompt engineering in generative mode, using Meta’s Llama 3.1 and 3.3 language models.

Since, the authors state, no publicly available dataset offered a direct technique to test their hypothesis (because most existing collections predate LLMs, and lack detailed metadata), they built a benchmark dataset from the Comcast entertainment platform, which serves tens of hundreds of thousands of users across direct and third-party interfaces.

The dataset tracks newly-released movies, and whether or not they later became popular, with popularity defined through user interactions.

The gathering focuses on movies slightly than series, and the authors state:

Labels were assigned by analyzing the time it took for a title to change into popular across different time windows and list sizes. The LLM was prompted with metadata fields similar to , , , , , , , , and .

For comparison, the authors used two baselines: a random ordering; and a Popular Embedding (PE) model (which we are going to come to shortly).

The project used large language models as the first rating method, generating ordered lists of flicks with predicted popularity scores and accompanying justifications – and these outputs were shaped by prompt engineering strategies designed to guide the model’s predictions using structured metadata.

The prompting strategy framed the model as an ‘editorial assistant’ assigned with identifying which upcoming movies were most certainly to change into popular, based solely on structured metadata, after which tasked with reordering a set list of titles introducing latest items, and to return the output in JSON format.

Each response consisted of a ranked list, assigned popularity scores, justifications for the rankings, and references to any prior examples that influenced the end result. These multiple levels of metadata were intended to enhance the model’s contextual grasp, and its ability to anticipate future audience trends.

Tests

The experiment followed two important stages: initially, the authors tested several model variants to ascertain a baseline, involving the identification of the version which performed higher than a random-ordering approach.

Second, they tested large language models in , by comparing their output to a stronger baseline, slightly than a random rating, raising the issue of the duty.

This meant the models needed to do higher than a system that already showed some ability to predict which movies would change into popular. In consequence, the authors assert, the evaluation higher reflected real-world conditions, where editorial teams and recommender systems are rarely selecting between a model and probability, but between competing systems with various levels of predictive ability.

The Advantage of Ignorance

A key constraint on this setup was the time gap between the models’ knowledge cutoff and the actual release dates of the films. Since the language models were trained on data that ended six to 12 months before the films became available, that they had no access to post-release information, ensuring that the predictions were based entirely on metadata, and never on any learned audience response.

Baseline Evaluation

To construct a baseline, the authors generated semantic representations of movie metadata using three embedding models: BERT V4; Linq-Embed-Mistral 7B; and Llama 3.3 70B, quantized to 8-bit precision to satisfy the constraints of the experimental environment.

Linq-Embed-Mistral was chosen for inclusion as a consequence of its top position on the MTEB (Massive Text Embedding Benchmark) leaderboard.

Each model produced vector embeddings of candidate movies, which were then in comparison with the typical embedding of the highest 100 hottest titles from the weeks preceding each movie’s release.

Popularity was inferred using cosine similarity between these embeddings, with higher similarity scores indicating higher predicted appeal. The rating accuracy of every model was evaluated by measuring performance against a random ordering baseline.

erformance improvement of Popular Embedding models compared to a random baseline. Each model was tested using four metadata configurations: V1 includes only genre; V2 includes only synopsis; V3 combines genre, synopsis, content rating, character types, mood, and release era; V4 adds cast, crew, and awards to the V3 configuration. Results show how richer metadata inputs affect ranking accuracy.. Source: https://arxiv.org/pdf/2505.02693

Source: https://arxiv.org/pdf/2505.02693

The outcomes (shown above), show that BERT V4 and Linq-Embed-Mistral 7B delivered the strongest improvements in identifying the highest three hottest titles, although each fell barely short in predicting the only hottest item.

BERT was ultimately chosen because the baseline model for comparison with the LLMs, as its efficiency and overall gains outweighed its limitations.

LLM Evaluation

The researchers assessed performance using two rating approaches: and . Pairwise rating evaluates whether the model appropriately orders one item relative to a different; and listwise rating considers the accuracy of the whole ordered list of candidates.

This mix made it possible to guage not only whether individual movie pairs were ranked appropriately (local accuracy), but additionally how well the total list of candidates reflected the (global accuracy).

Full, non-quantized models were employed to stop performance loss, ensuring a consistent and reproducible comparison between LLM-based predictions and embedding-based baselines.

Metrics

To evaluate how effectively the language models predicted movie popularity, each ranking-based and classification-based metrics were used, with particular attention to identifying the highest three hottest titles.

4 metrics were applied: Accuracy@1 measured how often the most well-liked item appeared in the primary position; Reciprocal Rank captured how high the highest actual item ranked in the anticipated list by taking the inverse of its position; Normalized Discounted Cumulative Gain (NDCG@k) evaluated how well the whole rating matched actual popularity, with higher scores indicating higher alignment; and Recall@3 measured the proportion of truly popular titles that appeared within the model’s top three predictions.

Since most user engagement happens near the highest of ranked menus, the evaluation focused on lower values of , to reflect practical use cases.

Performance improvement of large language models over BERT V4, measured as percentage gains across ranking metrics. Results are averaged over ten runs per model-prompt combination, with the top two values highlighted. Reported figures reflect the average percentage improvement across all metrics.

The performance of Llama model 3.1 (8B), 3.1 (405B), and three.3 (70B) was evaluated by measuring metric improvements relative to the earlier-established BERT V4 baseline. Each model was tested using a series of prompts, starting from minimal to information-rich, to look at the effect of input detail on prediction quality.

The authors state:

Performance improved when solid awards were included as a part of the prompt – on this case, the variety of major awards received by the highest five billed actors in each film. This richer metadata was a part of essentially the most detailed prompt configuration, outperforming a less complicated version that excluded solid recognition. The profit was most evident within the larger models, Llama 3.1 (405B) and three.3 (70B), each of which showed stronger predictive accuracy when given this extra signal of prestige and audience familiarity.

In contrast, the smallest model, Llama 3.1 (8B), showed improved performance as prompts became barely more detailed, progressing from genre to synopsis, but declined when more fields were added, suggesting that the model lacked the capability to integrate complex prompts effectively, resulting in weaker generalization.

When prompts were restricted to genre alone, models under-performed against the baseline, demonstrating that limited metadata was insufficient to support meaningful predictions.

Conclusion

LLMs have change into the poster child for generative AI, which could explain why they’re being put to work in areas where other methods could possibly be a greater fit. Even so, there’s still loads we don’t learn about what they’ll do across different industries, so it is sensible to provide them a shot.

On this particular case, as with stock markets and weather forecasting, there is simply a limited extent to which historical data can function the muse of future predictions. Within the case of flicks and TV shows, the very is now a moving goal, in contrast to the period between 1978-2011, when cable, satellite and portable media (VHS, DVD, et al.) represented a series of transitory or evolving historical disruptions.

Neither can any prediction method account for the extent to which the success or failure of productions may influence the viability of a proposed property – and yet that is incessantly the case within the movie and TV industry, which likes to ride a trend.

Nonetheless, when used thoughtfully, LLMs could help strengthen advice systems through the cold-start phase, offering useful support across a variety of predictive methods.

Using AI to Predict a Blockbuster Movie

The ‘Big Hit’ Theory/Theories

Comcast Predicts

Method and Data