Image recognition accuracy: An unseen challenge confounding today’s AI

Artificial Intelligence

Image recognition accuracy: An unseen challenge confounding today’s AI

admin

December 16, 2023

Image recognition accuracy: An unseen challenge confounding today’s AI

Imagine you might be scrolling through the photos in your phone and also you come across a picture that at the beginning you may’t recognize. It looks like possibly something fuzzy on the couch; could or not it’s a pillow or a coat? After a few seconds it clicks — after all! That ball of fluff is your friend’s cat, Mocha. While a few of your photos may very well be understood straight away, why was this cat photo far more difficult?

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers were surprised to seek out that despite the critical importance of understanding visual data in pivotal areas starting from health care to transportation to household devices, the notion of a picture’s recognition difficulty for humans has been almost entirely ignored. Considered one of the main drivers of progress in deep learning-based AI has been datasets, yet we all know little about how data drives progress in large-scale deep learning beyond that greater is best.

In real-world applications that require understanding visual data, humans outperform object recognition models despite the proven fact that models perform well on current datasets, including those explicitly designed to challenge machines with debiased images or distribution shifts. This problem persists, partially, because we have now no guidance on absolutely the difficulty of a picture or dataset. Without controlling for the problem of images used for evaluation, it’s hard to objectively assess progress toward human-level performance, to cover the range of human abilities, and to extend the challenge posed by a dataset.

To fill in this data gap, David Mayo, an MIT PhD student in electrical engineering and computer science and a CSAIL affiliate, delved into the deep world of image datasets, exploring why certain images are tougher for humans and machines to acknowledge than others. “Some images inherently take longer to acknowledge, and it’s essential to grasp the brain’s activity during this process and its relation to machine learning models. Perhaps there are complex neural circuits or unique mechanisms missing in our current models, visible only when tested with difficult visual stimuli. This exploration is crucial for comprehending and enhancing machine vision models,” says Mayo, a lead creator of a latest paper on the work.

This led to the event of a latest metric, the “minimum viewing time” (MVT), which quantifies the problem of recognizing a picture based on how long an individual must view it before making an accurate identification. Using a subset of ImageNet, a well-liked dataset in machine learning, and ObjectNet, a dataset designed to check object recognition robustness, the team showed images to participants for various durations from as short as 17 milliseconds to so long as 10 seconds, and asked them to decide on the proper object from a set of fifty options. After over 200,000 image presentation trials, the team found that existing test sets, including ObjectNet, appeared skewed toward easier, shorter MVT images, with the overwhelming majority of benchmark performance derived from images which are easy for humans.

The project identified interesting trends in model performance — particularly in relation to scaling. Larger models showed considerable improvement on simpler images but made less progress on more difficult images. The CLIP models, which incorporate each language and vision, stood out as they moved within the direction of more human-like recognition.

“Traditionally, object recognition datasets have been skewed towards less-complex images, a practice that has led to an inflation in model performance metrics, not truly reflective of a model’s robustness or its ability to tackle complex visual tasks. Our research reveals that harder images pose a more acute challenge, causing a distribution shift that is commonly not accounted for in standard evaluations,” says Mayo. “We released image sets tagged by difficulty together with tools to robotically compute MVT, enabling MVT to be added to existing benchmarks and prolonged to varied applications. These include measuring test set difficulty before deploying real-world systems, discovering neural correlates of image difficulty, and advancing object recognition techniques to shut the gap between benchmark and real-world performance.”

“Considered one of my biggest takeaways is that we now have one other dimension to judge models on. We wish models which are capable of recognize any image even when — perhaps especially if — it’s hard for a human to acknowledge. We’re the primary to quantify what this may mean. Our results show that not only is that this not the case with today’s state-of-the-art, but additionally that our current evaluation methods don’t have the power to inform us when it’s the case because standard datasets are so skewed toward easy images,” says Jesse Cummings, an MIT graduate student in electrical engineering and computer science and co-first creator with Mayo on the paper.

From ObjectNet to MVT

A number of years ago, the team behind this project identified a big challenge in the sector of machine learning: Models were scuffling with out-of-distribution images, or images that weren’t well-represented within the training data. Enter ObjectNet, a dataset comprised of images collected from real-life settings. The dataset helped illuminate the performance gap between machine learning models and human recognition abilities, by eliminating spurious correlations present in other benchmarks — for instance, between an object and its background. ObjectNet illuminated the gap between the performance of machine vision models on datasets and in real-world applications, encouraging use for a lot of researchers and developers — which subsequently improved model performance.

Fast forward to the current, and the team has taken their research a step further with MVT. Unlike traditional methods that deal with absolute performance, this latest approach assesses how models perform by contrasting their responses to the best and hardest images. The study further explored how image difficulty may very well be explained and tested for similarity to human visual processing. Using metrics like c-score, prediction depth, and adversarial robustness, the team found that harder images are processed otherwise by networks. “While there are observable trends, comparable to easier images being more prototypical, a comprehensive semantic explanation of image difficulty continues to elude the scientific community,” says Mayo.

Within the realm of health care, for instance, the pertinence of understanding visual complexity becomes much more pronounced. The power of AI models to interpret medical images, comparable to X-rays, is subject to the range and difficulty distribution of the pictures. The researchers advocate for a meticulous evaluation of difficulty distribution tailored for professionals, ensuring AI systems are evaluated based on expert standards, somewhat than layperson interpretations.

Mayo and Cummings are currently taking a look at neurological underpinnings of visual recognition as well, probing into whether the brain exhibits differential activity when processing easy versus difficult images. The study goals to unravel whether complex images recruit additional brain areas not typically related to visual processing, hopefully helping demystify how our brains accurately and efficiently decode the visual world.

Toward human-level performance

Looking ahead, the researchers usually are not only focused on exploring ways to reinforce AI’s predictive capabilities regarding image difficulty. The team is working on identifying correlations with viewing-time difficulty with a view to generate harder or easier versions of images.

Despite the study’s significant strides, the researchers acknowledge limitations, particularly by way of the separation of object recognition from visual search tasks. The present methodology does think about recognizing objects, leaving out the complexities introduced by cluttered images.

“This comprehensive approach addresses the long-standing challenge of objectively assessing progress towards human-level performance in object recognition and opens latest avenues for understanding and advancing the sector,” says Mayo. “With the potential to adapt the Minimum Viewing Time difficulty metric for quite a lot of visual tasks, this work paves the way in which for more robust, human-like performance in object recognition, ensuring that models are truly put to the test and are ready for the complexities of real-world visual understanding.”

“That is an enchanting study of how human perception may be used to discover weaknesses within the ways AI vision models are typically benchmarked, which overestimate AI performance by concentrating on easy images,” says Alan L. Yuille, Bloomberg Distinguished Professor of Cognitive Science and Computer Science at Johns Hopkins University, who was not involved within the paper. “It will help develop more realistic benchmarks leading not only to improvements to AI but additionally make fairer comparisons between AI and human perception.”

“It’s widely claimed that computer vision systems now outperform humans, and on some benchmark datasets, that is true,” says Anthropic technical staff member Simon Kornblith PhD ’17, who was also not involved on this work. “Nevertheless, a variety of the problem in those benchmarks comes from the obscurity of what is in the pictures; the common person just doesn’t know enough to categorise different breeds of dogs. This work as a substitute focuses on images that folks can only get right if given enough time. These images are generally much harder for computer vision systems, but the perfect systems are only a bit worse than humans.”

Mayo, Cummings, and Xinyu Lin MEng ’22 wrote the paper alongside CSAIL Research Scientist Andrei Barbu, CSAIL Principal Research Scientist Boris Katz, and MIT-IBM Watson AI Lab Principal Researcher Dan Gutfreund. The researchers are affiliates of the MIT Center for Brains, Minds, and Machines.

The team is presenting their work on the 2023 Conference on Neural Information Processing Systems (NeurIPS).

LEAVE A REPLY Cancel reply