Do Labels Make AI Blind? Self-Supervision Solves the Age-Old Binding Problem

-

paper from Konrad Körding’s Lab [1], “Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?” gives insights right into a foundational query in visual neuroscience: what’s required to bind visual elements and textures together as objects? The goal of this text is to offer you a background on this problem, review this NeurIPS paper, and hopefully offer you insight into each artificial and biological neural networks. I may even be reviewing some deep learning self-supervised learning methods and visual transformers, while highlighting the differences between current deep learning systems and our brains.

1. Introduction

After we view a scene, our visual system does not only hand our consciousness a high-level summary of the objects and composition; we even have conscious access to a whole visual hierarchy.

We are able to “grab” an object with our attention within the higher-level areas, just like the Inferior Temporal (IT) cortex and Fusiform Face Area (FFA), and access all of the contours and textures which can be coded within the lower-level areas like V1 and V2.

If we lacked this capability to access our entire visual hierarchy, we’d either not have conscious access to low-level details of the visual system, or the dimensionality would explode within the higher-level areas attempting to convey all this information. This is able to require our brains to be substantially larger and eat more energy.

This distribution of data of the visual scene across the visual system signifies that the components or objects of the scene must be sure together in some manner. For years, there have been two important factions on how this is completed: one faction argued that object binding used neural oscillations (or more generally, synchrony) to bind object parts together, and the opposite faction argued that increases in neural firing were sufficient to bind the attended objects. My academic background puts me firmly within the latter camp, under the tutelage of Rüdiger von der Heydt, Ernst Niebur, and Pieter Roelfsema.

Von der Malsburg and Schneider proposed the neural oscillation binding hypothesis in 1986 (see [2] for review), where they proposed that every object had its own temporal tag.

On this framework, once you take a look at an image with two puppies, all of the neurons throughout the visual system encoding the primary puppy would fire at one phase of the oscillation, while the neurons encoding the opposite puppy would fire at a special phase. Evidence for one of these binding was present in anesthetized cats, nevertheless, anesthesia increases oscillation within the brain.

Within the firing rate framework, neurons encoding attended objects fired at the next rate than those attending unattended objects and neurons encoding attended or unattended objects would fire at the next rate than those encoding the background. This has been shown repeatedly and robustly in awake animals [3].

Initially, there have been more experiments supporting the neural synchrony or oscillation hypotheses, but over time there was more evidence for the increased firing rate binding hypothesis.

The main target of Li’s paper is whether or not deep learning models exhibit object binding. They convincingly argue that ViT networks trained by self-supervised learning naturally learn to bind objects, but those trained via supervised classification (ImageNet) don’t. The failure of supervised training to show object binding, for my part, suggests that there’s a fundamental weakness to a single backpropagated global loss. Without fastidiously tuning this training paradigm, you might have a system that takes shortcuts and (for instance) learns textures as a substitute of objects, as shown by Geirhos et al. [4]. As an final result, you get models which can be fragile to adversarial attacks and only learn something when it has a major impact on the ultimate loss function. Fortunately, self-supervised learning works quite well because it stands without my more radical takes, and it’s in a position to reliably learn object binding.

2. Methods

2.1. The Architecture: Vision Transformers (ViT)

I’m going to review the Vision Transformer (ViT; [5]) on this section, so be at liberty to skip in the event you don’t must brush up on this architecture. After its introduction, there have been many additional visual transformer architectures, just like the Swin transformer and various hybrid convolutional transformers, resembling the CoAtNet and Convolutional Vision Transformer (CvT). Nonetheless, the research community keeps coming back to ViT. A part of it’s because ViT is well suited to current self-supervised approaches – resembling Masked Auto-Encoding (MAE) and I-JEPA (Image Joint Embedding Predictive Architecture).

Figure 1. ViT architecture, shown performing classification. Created by writer, photo with puppies by Nano Banana.

ViT splits the image right into a grid of patches that are converted into tokens. Tokens in ViT are only feature vectors, while tokens in other transformers might be discrete. For Li’s paper, the authors resized the photographs to (224times 224) pixels after which split them right into a grid of (16times 16) patches ((14times 14) pixels per patch). The patches are then converted to tokens by simply flattening the patches.

The positions of the patches within the image are added as positional embeddings using elementwise addition. For classification, the sequence of tokens is prepended with a special, learned classification token. So, if there are (W times H) patches, then there are (1 + W times H) input tokens. There are also (1 + W times H) output tokens from the core ViT model. The primary token of the output sequence, which corresponds to the classification token, is passed to the classification head to provide the classification. All the remaining output tokens are ignored for the classification task. Through training, the network learns to encode the worldwide context of the image needed for classification into this token.

The tokens get passed through the encoder of the transformer while keeping the length of the sequence the identical. There may be an implied correspondence from the input token and the identical token throughout the network. While there isn’t a guarantee of what the tokens in the course of the network can be encoding, this might be influenced by the training method. A dense task, like MAE, enforces this correspondence between the (i)-th token of the input sequence and the (i)-th token of the output sequence. A task with a rough signal, like classification, won’t teach the network to maintain this correspondence.

2.2. The Training Regimes: Self-Supervised Learning (SSL)

You don’t necessarily must know the small print of the self-supervised learning methods utilized in the Li et al. NeurIPS 2025 paper to understand the outcomes. They argue that the outcomes applied to all of the SSL methods they tried: DINO, MAE, and CLIP.

DINOv2 was the primary SSL method the authors tested and the one which they focused on. DINO works by degrading the image with cropping and data augmentations. The essential idea is that the model learns to extract the essential information from the degraded information and match that to the complete original image. There may be some complexity in that there’s a teacher network, which is an exponential moving average (EMA) of the scholar network. That is less more likely to collapse than if the scholar network is used to generate the training signal.

MAE is a style of Masked Image Modelling (MIM). It drops a certain percent of the tokens or patches from the input sequence. Because the tokens include positional encoding, this is straightforward to do. This reduced set of tokens is then passed through the encoder. The tokens are then passed through a transformer decoder to attempt to “inpaint” the missing tokens. The loss signal then comes from comparing the input with all of the tokens (the ground-truth) with the anticipated tokens.

CLIP relies on captioned images, resembling those scraped from the online. It aligns a text encoder and image encoder, training them concurrently. I won’t spend plenty of time describing it here, but one thing to indicate is that this training signal is coarse (based on the entire image and the entire caption). The training data is web-scale, moderately than limited to ImageNet, and while the signal is coarse, the feature vectors aren’t (e.g. one-hot encoded). So, while it is taken into account self-supervised, it does use a weakly supervised signal in the shape of the captions.

2.3. Probes

Figure 2. Two puppies with patches on different and same "objects"
Figure 2. Two puppies with patches on different and same “objects” (puppies). Created by writer, image by Nano Banana.

As shown in Figure 2, a probe or test that’s in a position to discriminate object binding needs to find out whether the blue patches are from the identical puppy and the red and blue patches are from different puppies. So you may create a test like cosine similarity between the patches and find that this does pretty much in your test set. But… is it really detecting object binding and never low-level or class-based features? Most of the photographs probably aren’t as complex. So you wish some probe that’s just like the cosine similarity test, but in addition some sort of strong baseline that’s in a position to, for instance, tell whether the patches belong to the identical semantic class, but not necessarily whether or not they belong to the identical instance.

The probes that they use which can be most just like using cosine similarity are the diagonal quadratic probe and the quadratic probe, where the latter essentially adds one other linear layer (sort of like a linear probe, but you might have two linear probes that you just then take the dot product of). These are the 2 probes that I’d consider have the potential to detect binding. In addition they have some object class-based probes that I’d consider the strong baselines.

Figure 3. Graph of object binding accuracy at different layers
Figure 3. My simplified (poor) reproduction of the paper’s Figure 2. Results on models trained with DINOv2.

Of their Figure 2 (my Figure 3), I’d concentrate to the quadratic probe magenta curve and the overlapping object class orange curve. The quadratic curve doesn’t rise above the item class curves until around layers 10-11 of the 23 layers. The diagonal quadratic curve doesn’t ever reach above those curves (see original figure in paper), meaning that the binding information no less than needs a linear layer to project it into an “IsSameObject” subspace.

I’m going into a little bit more detail with the probes within the appendix section, which I like to recommend skipping until/unless you read the paper.

3. The Central Claim: Li et al. (2025)

The important claim of their paper is that ViT models trained with self-supervised learning (SSL) naturally learn object binding, while ViT models trained with ImageNet supervised classification exhibit much weaker object binding. Overall, I find their arguments convincing, although, like with all papers, there are areas where they might have improved.

Their arguments are weakened by utilizing the weak baseline of at all times guessing that two patches aren’t sure, as shown in Figure 2. Fortunately, they used a wide selection of probes that features stronger class-based baselines, and their quadratic probe still performs higher than them. I do imagine that it could be possible to create a greater test and/or baselines, like adding positional awareness into the class-based methods. Nonetheless, I feel that is nitpicking and the object-based probes do make a fairly good baseline. Their Figure 4 gives additional reassurance that it’s performing object binding, although probe distance could still be playing a task.

Their supervised ViT model only achieved 3.7% higher accuracy than the weak baseline, which I’d interpret as not having any object binding. There may be one complication to this lead to that models trained with DINOv2 (and MAE) implement a correspondence between the input tokens and output tokens, while the ImageNet classification only trains on the primary token that corresponds to the learned “classify” task token; the remaining output tokens are ignored by this supervised training loss. So the probe is assuming that the (i)-th token at a given level corresponds to the (i)-th token of the input sequence, which is more likely to hold truer for the DINOv2-trained models in comparison with the ImageNet-trained classification model.

I feel it’s an open query whether CLIP and MAE would have shown object binding if it was in comparison with a stronger baseline. Figure 7 of their Appendix doesn’t make CLIP’s binding signal look that strong. Although CLIP, like supervised classification training, doesn’t implement the token correspondence throughout the processing. Notably in each supervised learning and CLIP, the layer with the height accuracy on same-object prediction is earlier within the network (0.13 and 0.39 out of 1), while networks that preserve the token correspondence show a peak later within the networks (0.65-1 out of 1).

Going back to mushy biological brains, one in every of the the reason why binding is a problem is that the representation of an object is distributed across the visual hierarchy. The ViT architecture is fundamentally different in that there isn’t a bidirectionality of data; all the data flows in a single direction and the representation at lower levels is not any longer needed once its information is passed on. Appendix A3 does show that the quadratic probe has a comparatively high accuracy for estimating whether patches from layer 15 and 18 are sure, so it appears that evidently this information is no less than there, even when it isn’t a bidirectional, recurrent architecture.

4. Conclusion: A Latest Baseline for “Understanding”?

I feel this paper is actually quite cool, because it’s the primary paper that I’m aware of that shows evidence of a deep learning model showing the emergent property of object binding. It will be great if the outcomes of the opposite SSL methods, like MAE, might be shown with the stronger baselines, but this paper no less than shows strong evidence that ViTs trained with DINO exhibit object binding. Previous work has suggested that this was not the case. The weakness (or absence) of the item binding signal from ViTs trained on ImageNet classification can also be interesting, and it’s consistent with the papers that suggest that CNNs trained with ImageNet classification are biased towards texture as a substitute of object shape [4], although ViTs have less texture bias [6] and DINO self-supervision also reduces the feel bias (but possibly not MAE) [7].

There are at all times things that might be improved with papers, and that’s why science and research builds on past research and expands and tests previous findings. Discriminating object-binding from other features is difficult and might require tests like artificial geometric stimuli to prove for certain that object-binding was found with none doubt. Nonetheless, the evidence presented continues to be quite strong.

Even in the event you aren’t taken with object-binding per se, the difference in behavior between ViT trained by unsupervised and supervised approaches is moderately stark and offers us some insights into the training regimes. It suggests that the inspiration models that we’re constructing are learning in a way that’s more just like the gold standard of real intelligence: humans.

Links

Appendix

Probe Details

I’m adding this section as an appendix since it is perhaps useful in the event you are going into the paper in additional detail. Nonetheless, I think it can be an excessive amount of detail for most individuals reading this post. One approach to find out whether two tokens are sure is perhaps to calculate the cosine similarity of those tokens. This is just taking the dot-product of the L2-normalized vector tokens. Unfortunately, for my part, they didn’t attempt to take the L2-normalization of the vector tokens, but they did try a weighted dot product which they call the diagonal quadratic probe.

$$phi_text{diag} (x,y) = x ^ topmathrm{diag} (w) y$$

The weights ( w ) are learned, so the probe can learn to give attention to the size more relevant to binding. While they didn’t perform L2-normalization, they did apply layer-normalization to the tokens, which incorporates L1-normalization and whitening per token.

There isn’t any reason to imagine that the item binding property can be nicely segregated within the feature vectors of their current forms, so it could make sense to first project them right into a recent “IsSameObject” subspace after which take their dot product. That is the quadratic probe that they found works so well:

$$begin{align}
phi_text{quad} (x,y) &= W x cdot W y
&= left( W x right) ^ top W y
&= x ^top W ^top W y
end{align}
$$
where (W in mathbb R ^{k times d}, k ll d).

The quadratic probe is significantly better at extracting the binding than the diagonal quadratic probe. The truth is, I’d argue that the quadratic probe is the one probe that they show that may extract the data on whether the objects are sure or not, because it is the one one which exceed the strong baseline of the item class-based probes.

I ignored their linear probe, which is a probe that I feel that that they had to incorporate within the paper, but that doesn’t really make any sense. For this, they applied a linear probe (a further layer that they train individually) to each the tokens, after which add the outcomes. The addition is why I feel the probe is a distraction. To check the tokens, there must be a multiplication. The quadratic probe is a greater such as the linear probe when you’re comparing two feature vectors.

Bibliography

[1] Y. Li, S. Salehi, L. Ungar and K. P. Kording, Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? (2025), arXiv preprint arXiv:2510.24709

[2] P. R. Roelfsema, Solving the binding problem: Assemblies form when neurons enhance their firing rate—they don’t must oscillate or synchronize (2023), Neuron, 111(7), 1003-1019

[3] J. R. Williford and R. von der Heydt, Border-ownership coding (2013), Scholarpedia journal, 8(10), 30040

[4] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann and W. Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (2018), International Conference on Learning Representations

[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., A picture is value 16×16 words: Transformers for image recognition at scale (2020), arXiv preprint arXiv:2010.11929

[6] M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. Shahbaz Khan and M. H. Yang, Intriguing properties of vision transformers (2021), Advances in Neural Information Processing Systems, 34, 23296-23308

[7] N. Park, W. Kim, B. Heo, T. Kim and S. Yun, What do self-supervised vision transformers learn? (2023), arXiv preprint arXiv:2305.00729

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x