Predicting future states is a critical mission in computer vision research – not least in robotics, where real-world situations should be considered. Machine learning systems entrusted with mission-critical tasks subsequently need adequate understanding of the physical world.
Nevertheless, in some cases, an apparently impressive knowledge of temporal reality may very well be deceptive: a brand new paper from the United Arab Emirates has found that state-of-the-art Multimodal Large Language Models (MLLMs), including sector leaders GPT-4o and Google Gemini, fall short in terms of interpreting how time is represented in images.
Example sequential pairs (see image below), which could be unchallenging for humans even when put within the flawed order, can fox advanced MLLMs when presented in unexpected contexts or configurations (corresponding to second-image-first, concatenated into single images, sequential multiple images which can or may not represent the proper temporal order, and so forth.).
https://huggingface.co/datasets/fazliimam/temporal-vqa/viewer
The researchers tasked the models with basic temporal reasoning challenges, corresponding to determining event order or estimating time gaps, and located that the seven MLLMs tested performed notably below human accuracy:
Machine learning systems are designed to optimize to essentially the most accurate, but in addition essentially the most efficient and people-pleasing results*. Since they don’t reveal their reasoning explicitly, it could possibly be difficult to inform after they’re cheating, or using ‘shortcuts’.
In such a case, the MLLM may arrive on the by the. The incontrovertible fact that such a solution could be correct may encourage false confidence within the model, which could produce incorrect results by the identical method in later tasks presented to it.
Worse yet, this misdirection can turn into much more deeply embedded in the event chain if humans are impressed by it, and provides positive feedback in trials and annotation sessions which can contribute to the direction that the info and/or the model might take.
On this case, the suggestion is that MLLMs are ‘faking’ a real understanding of chronology and temporal phenomena, by observing and anchoring on secondary indicators (corresponding to time-stamps, for example, in video data, order of images in a layout, and even – potentially – sequentially-numbered file-names).
It further indicates that MLLMs currently fail to satisfy any real definition of getting generalized an idea of temporal phenomena – at the least, to the extent that humans can.
The recent paper is titled , and comes from three researchers on the Mohamed bin Zayed University of Artificial Intelligence and Alibaba International Digital Commerce.
Data and Tests
The authors note that prior benchmarks and studies, corresponding to MMMU and TemporalBench, consider single-image inputs or else formulate questions for the MLLMs which may be slightly too easy to reply, and should not uncover a bent towards shortcut behavior.
Due to this fact the authors offer two updated approaches: (TOU) and (TLE). The TOU approach tests the models on their ability to find out the proper sequence of events from pairs of video frames; the TLE method evaluates the MLLM’s ability to estimate the time difference between two images, starting from seconds to years.

Source: https://arxiv.org/pdf/2501.10674
The researchers curated 360 image pairs for the TOU benchmark, using open source videos from Pixabay and Pexels, in order that it will be possible to make the dataset available via a GUI.
The videos covered a variety of subjects, from people in on a regular basis activities to non-human content corresponding to animals and plants. From these, pairs of frames were chosen to depict a sequence of events with sufficient variation to make the starting frame ‘obvious’.
Human selection was used to be certain that the frames may very well be definitively ordered. For instance, certainly one of the curated pairs shows a partially-filled teacup in a single frame, and the identical cup fully stuffed with tea in the following, making the sequence logic easy to discover.

In this manner, 360 image pairs were obtained.
For the TLE approach, copyright-free images were chosen from Google and Flickr, in addition to select frames from copyright-free videos on YouTube. The topic-matter of those videos featured scenes or objects whose change interval ranged from seconds to days to seasons – for instance, ripening fruit, or the change of seasons in landscapes.
Thus 125 image pairs were curated for the TLE method.
Not all the MLLMs tested were in a position to process multiple images; subsequently tests differed to accommodate each model’s capabilities.
Multiple versions of the curated datasets were generated, during which a few of the pairs were concatenated vertically, and others horizontally. Further variations swapped the true and proper temporal sequence of the pairs.
Two prompt-types were developed. The primary followed this template:
The second followed this schema:
For TLE, questions were multiple-choice, asking the models to guage the time-lapse between the 2 presented images, with , , , , and available because the time-units. On this configuration, essentially the most recent image was presented on the appropriate.
The prompt used here was:
The MLLMs tested were ChatGPT-4o; Gemini1.5-Pro; LlaVa-NeXT; InternVL; Qwen-VL; Llama-3-vision; and LLaVA-CoT.
Temporal Order Understanding: Results

Regarding the outcomes shown above, the authors found that each one tested MLLMs, including GPT-4o (which showed the very best overall performance), struggled significantly with the TemporalVQA benchmark – and even GPT-4o didn’t consistently exhibit reliable temporal reasoning across different configurations.
The authors contend that the consistently low accuracy across LLMs highlights significant shortcomings within the models’ ability to interpret and reason about temporal sequences from visual data. The researchers note that these challenges persist even with using multi-image inputs and optimized prompts, pointing to fundamental limitations in current model architectures and training methods.
The tests showed significant variations in performance across prompting strategies. While GPT-4o improved with optimized prompts (reaching 4% in single-image and 65.3% in multi-image settings), performance remained below acceptable levels.
Models corresponding to LLaVA-NeXT and Qwen-VL were much more sensitive, with performance declining when alternate prompts were used, suggesting that prompt engineering alone cannot overcome the MLLMs’ fundamental limitations in regard to temporal reasoning.
Tests also indicated that image layout (i.e., vertical vs. horizontal) significantly impacted model performance. GPT-4o improved its consistency with vertical arrangements, rising from 39.2% to 52.8%; nonetheless, other models, including the LLaVA strains, showed strong directional biases, excelling in a single orientation but failing in one other.
The paper indicates that these inconsistencies suggest reliance on spatial cues, slightly than true temporal reasoning, with the MLLMs not genuinely analyzing the sequence of events or understanding the progression over time. As an alternative, they seem to have relied on patterns or visual features related to the layout of images, corresponding to their position or alignment, with a view to make decisions.

Comparison tests between single-image and multi-image inputs demonstrated limited overall improvement, with GPT-4o performing barely higher on multi-image input, rising from 31.0% to 43.6% (with P1) and 46.0% to 65.3% (with P2).
Other models, corresponding to InternVL, demonstrated stable but low accuracy, while Qwen-VL saw minor gains. The authors conclude that these results indicate that additional visual context doesn’t substantially enhance temporal reasoning capabilities, since models struggle to integrate temporal information effectively.
Human Study
In a human study, three surveys were conducted to evaluate how closely the best-performing multimodal MLLM perfgormed against human estimation.
Humans achieved 90.3% accuracy, outperforming GPT-4o’s 65.3% by 25%. The dataset proved reliable, with minimal human errors and consistent agreement on correct answers.

Time-lapse Estimation: Results

In these tests, the MLLMs performed only adequately on time-lapse estimation: GPT-4o achieved 70% accuracy, but the opposite models performed significantly worse (see table above), and performance also varied notably across the varied time scales.
The authors comment:
Human Study
Within the human study for TLE, average human performance improved on GPT-4o (the best-performing model also on this category) by 12.3%.
The authors note that a few of the challenges were particularly demanding, and that in a single case all of the human participants returned a flawed answer, together with all of the AI participants.
The authors conclude that GPT-4o exhibits ‘reasonably robust reasoning capabilities, notwithstanding the order of images presented to it.
Conclusion
If MLLMs eventually amass and absorb enough ‘shortcut’ data to cover even the trickiest challenges of the kind presented by the authors on this study, whether or not they could be said to have developed human-style generalization capabilities on this domain could turn into a moot point.
Neither is it known exactly by what route we obtain our own abilities in temporal reasoning – can we likewise ‘cheat’ until the sheer quantity of learned experience reveals a pattern that performs as ‘instinct’ with reference to this sort of test?
Â
*