Intro
This project is about recuperating zero-shot Classification of images and text using CV/LLM models without spending money and time fine-tuning in training, or re-running models in inference. It uses a novel dimensionality reduction technique on embeddings and determines classes using tournament style pair-wise comparison. It resulted in a rise in text/image agreement from 61% to 89% for a 50k dataset over 13 classes.
https://github.com/doc1000/pairwise_classification
Where you’ll use it
The sensible application is in large-scale class search where speed of inference is essential and model cost spend is a priority. Additionally it is useful find errors in your annotation process — misclassifications in a big database.
Results
The weighted F1 rating comparing the text and image class agreement went from 61% to 88% for ~50k items across 13 classes. A visible inspection also validated the outcomes.
| F1_score (weighted) | base model | pairwise |
| Multiclass | 0.613 | 0.889 |
| Binary | 0.661 | 0.645 |
Left: Base, full embedding, argmax on cosine similarity model
Right: pairwise tourney model using feature sub-segments scored by crossratio
Image by creator
Method: Pairwise comparison of cosine similarity of embedding sub-dimensions determined by mean-scale scoring
A simple strategy to vector classification is to check image/text embeddings to class embeddings using cosine similarity. It’s relatively quick and requires minimal overhead. You may as well run a classification model on the embeddings (logistic regressions, trees, svm) and goal the category without further embeddings.
My approach was to cut back the feature size within the embeddings determining which feature distributions were substantially different between two classes, and thus contributed information with less noise. For scoring features, I used a derivation of variance that encompasses two distributions, which I confer with as cross-variance (more below). I used this to get essential dimensions for the ‘clothing’ category (one-vs-the rest) and re-classified using the sub-features, which showed some improvement in model power. Nonetheless, the sub-feature comparison showed higher results when comparing classes pairwise (one vs one/face to face). Individually for images and text, I constructed an array-wide ‘tournament’ style bracket of pairwise comparisons, until a final class was determined for every item. It finally ends up being fairly efficient. I then scored the agreement between the text and image classifications.
Using cross variance, pair specific feature selection and pairwise tourney project.

I’m using a product image database that was available with pre-calculated CLIP embeddings (thanks SQID (Cited below. This dataset is released under the MIT License), AMZN (Cited below. This dataset is licensed under Apache License 2.0) and targeting the clothing images because that’s where I first saw this effect (thanks DS team at Nordstrom). The dataset was narrowed down from 150k items/images/descriptions to ~50k clothing items using zero shot classification, then the augmented classification based on targeted subarrays.

Test Statistic: Cross Variance
This can be a method to find out how different the distribution is for 2 different classes when targeting a single feature/dimension. It’s a measure of the combined average variance if each element of each distributions is dropped into the opposite distribution. It’s an expansion of the maths of variance/standard deviation, but between two distributions (that might be of various size). I actually have not seen it used before, even though it could also be listed under a special moniker.Â
Cross Variance:

Just like variance, except summing over each distributions and taking a difference of every value as an alternative of the mean of the only distribution. When you input the identical distribution as A and B, then it yields the identical results as variance.
This simplifies to:

That is corresponding to the alternate definition of variance (the mean of the squares minus the square of the mean) for a single distribution when the distributions i and j are equal. Using this version is massively faster and more memory efficient than attempting to broadcast the arrays directly. I’ll provide the proof and go into more detail in one other write-up. Cross deviation(Ï‚) is the square root of undefined.
To attain features, I take advantage of a ratio. The numerator is cross variance. The denominator is the product of ij, same because the denominator of Pearson correlation. Then I take the basis (I could just as easily use cross variance, which might compare more directly with covariance, but I’ve found the ratio to be more compact and interpretable using cross dev).

I interpret this because the increased combined standard deviation should you swapped classes for every item. A big number means the feature distribution is probably going quite different for the 2 classes.

Image by creator
That is an alternate mean-scale difference Ks_test; Bayesian 2dist tests and Frechet Inception Distance are alternatives. I just like the elegance and novelty of cross var. I’ll likely follow up by other differentiators. I should note that determining distributional differences for a normalized feature with overall mean 0 and sd = 1 is its own challenge.
Sub-dimensions: dimensionality reduction of embedding space for classification
When you find yourself trying to search out a characteristic of a picture, do you would like the entire embedding? Is color or whether something is a shirt or pair of pants positioned in a narrow section of the embedding? If I’m searching for a shirt, I don’t necessarily care if it’s blue or red, so I just have a look at the size that outline ‘shirtness’ and throw out the size that outline color.

Image by creator
I’m taking a [n,768] dimensional embedding and narrowing it all the way down to closer to 100 dimensions that really matter for a selected class pair. Why? Since the cosine similarity metric (cosim) gets influenced by the noise of the relatively unimportant features. The embedding carries an incredible amount of knowledge, much of which you just don’t care about in a classification problem. Do away with the noise and the signal gets stronger: cosim increases with elimination of ‘unimportant’ dimensions.

Image by creator
For a pairwise comparisons, first split items into classes using standard cosine similarity applied to the complete embedding. I exclude some items that show very low cosim on the belief that the model skill is low for those items (cosim limit). I also exclude items that show low differentiation between the 2 classes (cosim diff). The result’s two distributions upon which to extract essential dimensions that ought to define the ‘true’ difference between the classifications:

Image by creator
Array Pairwise Tourney Classification
Getting a world class project out of pairwise comparisons requires some thought. You may take the given project and compare just that class to all of the others. If there was good skill within the initial project, this could work well, but when multiple alternate classes are superior, you run into trouble. A cartesian approach where you compare all vs all would get you there, but would get big quickly. I settled on an array-wide ‘tournament’ style bracket of pairwise comparisons.

This has log_2 (#classes) rounds and total variety of comparisons maxing at summation_round(combo(#classes in round)*n_items) across some specified # of features. I randomize the ordering of ‘teams’ each round so the comparisons aren’t the identical every time. It has some match up risk but gets to a winner quickly. It’s built to handle an array of comparisons at each round, moderately than iterating over items.
Scoring
Finally, I scored the method by determining if the classification from text and pictures match. So long as the distribution isn’t heavily chubby towards a ‘default’ class (it isn’t), this must be assessment of whether the method is pulling real information out of the embeddings.Â
I checked out the weighted F1 rating comparing the classes assigned using the image vs the text description. The idea the higher the agreement, the more likely the classification is correct. For my dataset of ~50k images and text descriptions of clothing with 13 classes, the starting rating of the easy full-embedding cosine similarity model went from 42% to 55% for the sub-feature cosim, to 89% for the pairwise model with sub-features.. A visible inspection also validated the outcomes. The binary classification wasn’t the first goal – it was largely to get a sub-segment of the information to then test multi-class boosting.
| base model | pairwise | |
| Multiclass | 0.613 | 0.889 |
| Binary | 0.661 | 0.645 |

Image by creator

Image by creator using code from Nils Flaschel
Final Thoughts…
This may increasingly be method for locating errors in large subsets of annotated data, or doing zero shot labeling without extensive extra GPU time for high-quality tuning and training. It introduces some novel scoring and approaches, but the general process isn’t overly complicated or CPU/GPU/memory intensive.Â
Follow up shall be applying it to other image/text datasets in addition to annotated/categorized image or text datasets to find out if scoring is boosted. As well as, it might be interesting to find out whether the boost in zero shot classification for this dataset changes substantially if:
- Â Other scoring metrics are used as an alternative of cross deviation ratio
- Full feature embeddings are substituted for targeted features
- Pairwise tourney is replaced by one other approach
I hope you discover it useful.
Citations
@article{reddy2022shopping,title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},creator={Chandan K. Reddy and LluÃs MÃ rquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian},12 months={2022},eprint={2206.06588},archivePrefix={arXiv}}
Shopping Queries Image Dataset (SQID): An Image-Enriched ESCI Dataset for Exploring Multimodal Learning in Product Search, M. Al Ghossein, C.W. Chen, J. Tang
