AI’s Struggle to Read Analogue Clocks May Have Deeper Significance

-

 

When humans develop a deep enough understanding of a website, akin to gravity or other basic physical principles, we move beyond specific examples to know the underlying abstractions. This permits us to use that knowledge creatively across contexts and to acknowledge recent instances, even those we have now never seen before, by identifying the principle in motion.

When a website carries enough importance, we may even begin to perceive it , as with pareidolia, driven by the high cost of failing to acknowledge an actual instance. So strong is that this pattern-recognizing survival mechanism that it even disposes us to search out a wider range of patterns where there are none.

The sooner and more repetitively a website is instilled in us, the deeper its grounding and lifelong persistence; and considered one of the earliest visual datasets that we’re exposed to as children is available in the shape of teaching-clocks, where printed material or interactive analog clocks are used to show us learn how to tell time:

Source: https://www.youtube.com/watch?v=IBBQXBhSNUs

Though changing fashions in watch design may sometimes challenge us, the resilience of this early domain-mastery is sort of impressive, allowing us to discern analogue clock faces even within the face of complex or ‘eccentric’ design selections:

Some challenging faces in watch couture. Source: https://www.ablogtowatch.com/wait-a-minute-legibility-is-the-most-important-part-of-watch-design/

couture Source: https://www.ablogtowatch.com/wait-a-minute-legibility-is-the-most-important-part-of-watch-design/

Humans don’t need 1000’s of examples to find out how clocks work; once the essential concept is grasped, we are able to recognize it in almost any form, even when distorted or abstracted.

The problem that AI models face with this task, in contrast, highlights a deeper issue: their apparent strength may depend more on high-volume exposure than on understanding.

Beyond the Imitation Game?

The stress between surface-level performance and real ‘understanding’ has surfaced repeatedly in recent investigations of huge models. Last month Zhejiang University and Westlake University re-framed the query in a paper titled (not the main target of this text), concluding:

This week the query arises again, now in a collaboration between Nanjing University of Aeronautics and Astronautics and the Universidad Politécnica de Madrid in Spain. Titled , the recent paper explores how well multimodal models understand time-telling.

Though the progress of the research is roofed only in broad detail within the paper, the researchers’ initial tests established that OpenAI’s GPT-4.1 multimodal language model struggled to appropriately read the time from a various set of clock images, often giving incorrect answers even on easy cases.

This points to a possible gap within the model’s training data, raising the necessity for a more balanced dataset, to check whether the model can actually learn the underlying concept. Subsequently the authors curated an artificial dataset of analog clocks, evenly covering every possible time, and avoiding the same old biases present in web images:

An example from the researchers' synthetic analog clock dataset, used to fine-tune a GPT model in the new work. Source: https://huggingface.co/datasets/migonsa/analog_watches_finetune

Source: https://huggingface.co/datasets/migonsa/analog_watches_finetune

Before fine-tuning on the brand new dataset, GPT-4.1 consistently did not read these clocks. After some exposure to the brand new collection, nevertheless, its performance improved – but only when the brand new images looked like ones it had already seen.

When the form of the clock or the sort of the hands modified, accuracy fell sharply; even small tweaks, akin to thinner hands or arrowheads (rightmost image below), were enough to throw it off; and GPT-4.1 struggled moreover to interpret Dali-esque ‘melting clocks’:

Clock images with standard design (left), distorted shape (middle), and modified hands (right), alongside the times returned by GPT-4.1 before and after fine-tuning. Source: https://arxiv.org/pdf/2505.10862

Source: https://arxiv.org/pdf/2505.10862

The authors deduce that current models akin to GPT-4.1 may subsequently be learning clock-reading mainly through , moderately than any deeper concept of time, asserting:

Enough Time

Most training datasets depend on scraped web images, which are likely to repeat certain times – especially 10:10, a popular setting in watch advertisements:

From the new paper, an example of the prevalence of the 'ten past ten' time in analog clock images.

Consequently of this limited range of times depicted, the model might even see only a narrow range of possible clock configurations, limiting its ability to generalize beyond those repetitive patterns.

Regarding why models fail to appropriately interpret the distorted clocks, the paper states:

The authors contend that identifying the basis reason for these failures is essential to advancing multimodal models: if the difficulty lies in how the model perceives spatial direction, fine-tuning may offer a straightforward fix; but when the issue stems from a broader difficulty in integrating multiple visual cues, it points to a more fundamental weakness in how these systems process information.

Effective-Tuning Tests

To check whether the model’s failures might be overcome with exposure, GPT-4.1 was fine-tuned on the aforementioned and comprehensive synthetic dataset. Before fine-tuning, its predictions were widely scattered, with significant errors across every type of clocks. After fine-tuning on the gathering, accuracy improved sharply on standard clock faces, and, to a lesser extent, on distorted ones.

Nonetheless, clocks with modified hands, akin to thinner shapes or arrowheads, continued to supply large errors.

Two distinct failure modes emerged: on normal and distorted clocks, the model typically misjudged the direction of the hands; but on clocks with altered , it often confused the function of every hand, mistaking for or for .

A comparison illustrating the model’s initial weakness, and the partial gains achieved through fine-tuning, showing predicted versus actual time, in seconds, for 150 randomly selected clocks. On the left, before fine-tuning, GPT-4.1's predictions are scattered and often far from the correct values, indicated by the red diagonal line. On the right, after fine-tuning on a balanced synthetic dataset, the predictions align much more closely with the ground truth, although some errors remain.

This means that the model had learned to associate visual features like hand thickness with specific roles, and struggled when these cues modified.

The limited improvement on unfamiliar designs raises further doubts about whether a model of this sort learns the abstract concept of time-telling, or merely refines its pattern-matching.

Hand Signs

So, although fine-tuning improved GPT-4.1’s performance on conventional analog clocks, it had far less impact on clocks with thinner hands or arrowhead shapes, raising the likelihood that the model’s failures stemmed less from abstract reasoning and more from confusion over which hand was which.

To check whether accuracy might improve if that confusion were removed, a brand new evaluation was conducted on the model’s predictions for the ‘modified-hand’ dataset. The outputs were divided into two groups: cases where GPT-4.1 appropriately recognized the hour, minute, and second hands; and cases where it didn’t.

The predictions were evaluated for Mean Absolute Error (MAE) before and after fine-tuning, and the outcomes in comparison with those from standard clocks; angular error was also measured for every hand using dial position as a baseline:

Error comparison for clocks with and without hand-role confusion in the modified-hand dataset before and after fine-tuning.

Confusing the roles of the clock hands led to the most important errors. When GPT-4.1 mistook the hour hand for the minute hand or vice versa, the resulting time estimates were often far off. In contrast, errors brought on by misjudging the direction of a appropriately identified hand were smaller. Among the many three hands, the hour hand showed the very best angular error before fine-tuning, while the second hand showed the bottom.

Angular error by hand type for predictions with and without hand-role confusion, before and after fine-tuning, in the modified-hand dataset.

To deal with directional errors alone, the evaluation was limited to cases where the model appropriately identified each hand’s function. If the model had internalized a general concept of time-telling, its performance on these examples must have matched its accuracy on standard clocks. It didn’t, and accuracy remained noticeably worse.

To look at whether hand interfered with the model’s sense of direction, a second experiment was run: two recent datasets were created, each containing sixty synthetic clocks with only an hour hand, pointing to a unique minute mark. One set used the unique hand design, and the opposite the altered version. The model was asked to call the tick mark that the hand was pointing to.

Results showed a slight drop in accuracy with the modified hands, but not enough to account for the model’s broader failures. A appeared able to disrupting the model’s overall interpretation, even in tasks it had previously performed well.

Overview of GPT-4.1’s performance before and after fine-tuning across standard, distorted, and modified-hand clocks, highlighting uneven gains and persistent weaknesses.

Conclusion

While the paper’s focus could appear trivial at first glance, it doesn’t especially matter if vision-language models ever learn to read analog clocks at 100% accuracy. What gives the work weight is its deal with a deeper recurring query: whether saturating models with more (and more diverse) data can result in the type of domain understanding humans acquire through abstraction and generalization; or whether the one viable path is to flood the domain with enough examples to anticipate every likely variation at inference.

Either route raises doubts about what current architectures are truly able to learning.

 

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x