When humans develop a deep enough understanding of a website, akin to gravity or other basic physical principles, we move beyond specific examples to know the underlying abstractions. This permits us to use that knowledge creatively across contexts and to acknowledge recent instances, even those we have now never seen before, by identifying the principle in motion.
When a website carries enough importance, we may even begin to perceive it , as with pareidolia, driven by the high cost of failing to acknowledge an actual instance. So strong is that this pattern-recognizing survival mechanism that it even disposes us to search out a wider range of patterns where there are none.
The sooner and more repetitively a website is instilled in us, the deeper its grounding and lifelong persistence; and considered one of the earliest visual datasets that we’re exposed to as children is available in the shape of teaching-clocks, where printed material or interactive analog clocks are used to show us learn how to tell time:
Source: https://www.youtube.com/watch?v=IBBQXBhSNUs
Though changing fashions in watch design may sometimes challenge us, the resilience of this early domain-mastery is sort of impressive, allowing us to discern analogue clock faces even within the face of complex or ‘eccentric’ design selections:

couture Source: https://www.ablogtowatch.com/wait-a-minute-legibility-is-the-most-important-part-of-watch-design/
Humans don’t need 1000’s of examples to find out how clocks work; once the essential concept is grasped, we are able to recognize it in almost any form, even when distorted or abstracted.
The problem that AI models face with this task, in contrast, highlights a deeper issue: their apparent strength may depend more on high-volume exposure than on understanding.
Beyond the Imitation Game?
The stress between surface-level performance and real ‘understanding’ has surfaced repeatedly in recent investigations of huge models. Last month Zhejiang University and Westlake University re-framed the query in a paper titled (not the main target of this text), concluding:
This week the query arises again, now in a collaboration between Nanjing University of Aeronautics and Astronautics and the Universidad Politécnica de Madrid in Spain. Titled , the recent paper explores how well multimodal models understand time-telling.
Though the progress of the research is roofed only in broad detail within the paper, the researchers’ initial tests established that OpenAI’s GPT-4.1 multimodal language model struggled to appropriately read the time from a various set of clock images, often giving incorrect answers even on easy cases.
This points to a possible gap within the model’s training data, raising the necessity for a more balanced dataset, to check whether the model can actually learn the underlying concept. Subsequently the authors curated an artificial dataset of analog clocks, evenly covering every possible time, and avoiding the same old biases present in web images:

Source: https://huggingface.co/datasets/migonsa/analog_watches_finetune
Before fine-tuning on the brand new dataset, GPT-4.1 consistently did not read these clocks. After some exposure to the brand new collection, nevertheless, its performance improved – but only when the brand new images looked like ones it had already seen.
When the form of the clock or the sort of the hands modified, accuracy fell sharply; even small tweaks, akin to thinner hands or arrowheads (rightmost image below), were enough to throw it off; and GPT-4.1 struggled moreover to interpret Dali-esque ‘melting clocks’:

Source: https://arxiv.org/pdf/2505.10862
The authors deduce that current models akin to GPT-4.1 may subsequently be learning clock-reading mainly through , moderately than any deeper concept of time, asserting:
Enough Time
Most training datasets depend on scraped web images, which are likely to repeat certain times – especially 10:10, a popular setting in watch advertisements:

Consequently of this limited range of times depicted, the model might even see only a narrow range of possible clock configurations, limiting its ability to generalize beyond those repetitive patterns.
Regarding why models fail to appropriately interpret the distorted clocks, the paper states:
The authors contend that identifying the basis reason for these failures is essential to advancing multimodal models: if the difficulty lies in how the model perceives spatial direction, fine-tuning may offer a straightforward fix; but when the issue stems from a broader difficulty in integrating multiple visual cues, it points to a more fundamental weakness in how these systems process information.
Effective-Tuning Tests
To check whether the model’s failures might be overcome with exposure, GPT-4.1 was fine-tuned on the aforementioned and comprehensive synthetic dataset. Before fine-tuning, its predictions were widely scattered, with significant errors across every type of clocks. After fine-tuning on the gathering, accuracy improved sharply on standard clock faces, and, to a lesser extent, on distorted ones.
Nonetheless, clocks with modified hands, akin to thinner shapes or arrowheads, continued to supply large errors.
Two distinct failure modes emerged: on normal and distorted clocks, the model typically misjudged the direction of the hands; but on clocks with altered , it often confused the function of every hand, mistaking for or for .

This means that the model had learned to associate visual features like hand thickness with specific roles, and struggled when these cues modified.
The limited improvement on unfamiliar designs raises further doubts about whether a model of this sort learns the abstract concept of time-telling, or merely refines its pattern-matching.
Hand Signs
So, although fine-tuning improved GPT-4.1’s performance on conventional analog clocks, it had far less impact on clocks with thinner hands or arrowhead shapes, raising the likelihood that the model’s failures stemmed less from abstract reasoning and more from confusion over which hand was which.
To check whether accuracy might improve if that confusion were removed, a brand new evaluation was conducted on the model’s predictions for the ‘modified-hand’ dataset. The outputs were divided into two groups: cases where GPT-4.1 appropriately recognized the hour, minute, and second hands; and cases where it didn’t.
The predictions were evaluated for Mean Absolute Error (MAE) before and after fine-tuning, and the outcomes in comparison with those from standard clocks; angular error was also measured for every hand using dial position as a baseline:

Confusing the roles of the clock hands led to the most important errors. When GPT-4.1 mistook the hour hand for the minute hand or vice versa, the resulting time estimates were often far off. In contrast, errors brought on by misjudging the direction of a appropriately identified hand were smaller. Among the many three hands, the hour hand showed the very best angular error before fine-tuning, while the second hand showed the bottom.

To deal with directional errors alone, the evaluation was limited to cases where the model appropriately identified each hand’s function. If the model had internalized a general concept of time-telling, its performance on these examples must have matched its accuracy on standard clocks. It didn’t, and accuracy remained noticeably worse.
To look at whether hand interfered with the model’s sense of direction, a second experiment was run: two recent datasets were created, each containing sixty synthetic clocks with only an hour hand, pointing to a unique minute mark. One set used the unique hand design, and the opposite the altered version. The model was asked to call the tick mark that the hand was pointing to.
Results showed a slight drop in accuracy with the modified hands, but not enough to account for the model’s broader failures. A appeared able to disrupting the model’s overall interpretation, even in tasks it had previously performed well.

Conclusion
While the paper’s focus could appear trivial at first glance, it doesn’t especially matter if vision-language models ever learn to read analog clocks at 100% accuracy. What gives the work weight is its deal with a deeper recurring query: whether saturating models with more (and more diverse) data can result in the type of domain understanding humans acquire through abstraction and generalization; or whether the one viable path is to flood the domain with enough examples to anticipate every likely variation at inference.
Either route raises doubts about what current architectures are truly able to learning.