R.E.D.: Scaling Text Classification with Expert Delegation

-

With the brand new age of problem-solving augmented by Large Language Models (LLMs), only a handful of problems remain which have subpar solutions. Most classification problems (at a PoC level) will be solved by leveraging LLMs at 70–90% Precision/F1 with just good prompt engineering techniques, in addition to adaptive in-context-learning (ICL) examples.

What happens when you must consistently achieve performance higher than that — when prompt engineering not suffices?

The classification conundrum

Text classification is one in all the oldest and most well-understood examples of supervised learning. Given this premise, it should really not be hard to construct robust, well-performing classifiers that handle a lot of input classes, right…?

Welp. It’s.

It actually has to do quite a bit more with the ‘constraints’ that the algorithm is usually expected to work under:

  • low amount of coaching data per class
  • high classification accuracy (that plummets as you add more classes)
  • possible addition of latest classes to an existing subset of classes
  • quick training/inference
  • cost-effectiveness
  • (potentially) really large number of coaching classes
  • (potentially) limitless required retraining of  classes as a consequence of data drift, etc.

Ever tried constructing a classifier beyond just a few dozen classes under these conditions? (I mean, even GPT could probably do an important job as much as ~30 text classes with just just a few samples…)

Considering you are taking the GPT route — If you will have greater than a pair dozen classes or a sizeable amount of information to be classified, you might be gonna have to succeed in deep into your pockets with the system prompt, user prompt, few shot example tokens that you will want to categorise one sample. That’s after making peace with the throughput of the API, even should you are running async queries.

In applied ML, problems like these are generally tricky to resolve since they don’t fully satisfy the necessities of supervised learning or aren’t low cost/fast enough to be run via an LLM. This particular pain point is what the R.E.D algorithm addresses: semi-supervised learning, when the training data per class is just not enough to construct (quasi)traditional classifiers.

The R.E.D. algorithm

R.E.D: Recursive Expert Delegation is a novel framework that changes how we approach text classification. That is an applied ML paradigm — i.e., there isn’t any  architecture to what exists, but its a highlight reel of ideas that work best to construct something that’s practical and scalable.

On this post, we will probably be working through a particular example where we now have a lot of text classes (100–1000), each class only has few samples (30–100), and there are a non-trivial variety of samples to categorise (10,000–100,000). We approach this as a semi-supervised learning problem via R.E.D.

Let’s dive in.

How it really works

easy representation of what R.E.D. does

As a substitute of getting a single classifier classify between a lot of classes, R.E.D. intelligently:

  1. Divides and conquers — Break the label space (large variety of input labels) into multiple subsets of labels. It is a greedy label subset formation approach.
  2. Learns efficiently — Trains specialized classifiers for every subset. This step focuses on constructing a classifier that oversamples on noise, where noise is intelligently modeled as data from 
  3. Delegates to an authority — Employes LLMs as expert oracles for specific label validation and correction only, just like having a team of domain experts. Using an LLM as a proxy, it empirically ‘mimics’ how a human expert validates an output.
  4. Recursive retraining — Repeatedly retrains with fresh samples added back from the expert until there are not any more samples to be added/a saturation from information gain is achieved

The intuition behind it is just not very hard to understand: Lively Learning employs humans as domain experts to consistently ‘correct’ or ‘validate’ the outputs from an ML model, with continuous training. This stops when the model achieves acceptable performance. We intuit and rebrand the identical, with just a few clever innovations that will probably be detailed in a research pre-print later.

Let’s take a deeper look…

Greedy subset selection with least similar elements

When the variety of input labels (classes) is high, the complexity of learning a linear decision boundary between classes increases. As such, the standard of the classifier deteriorates because the variety of classes increases. This is particularly true when the classifier doesn’t have enough samples to learn from — i.e. each of the training classes has only just a few samples.

This could be very reflective of a real-world scenario, and the first motivation behind the creation of R.E.D.

Some ways of improving a classifier’s performance under these constraints:

  • Restrict the variety of classes a classifier needs to categorise between
  • Make the choice boundary between classes clearer, i.e., train the classifier on highly dissimilar classes

Greedy Subset Selection does exactly this — because the scope of the issue is Text Classification, we form embeddings of the training labels, reduce their dimensionality via UMAP, then form  subsets from them. Each of the subsets has elements as training labels. We pick training labels greedily, ensuring that each label we pick for the subset is essentially the most dissimilar label w.r.t. the opposite labels that exist within the subset:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def avg_embedding(candidate_embeddings):
    return np.mean(candidate_embeddings, axis=0)

def get_least_similar_embedding(target_embedding, candidate_embeddings):
    similarities = cosine_similarity(target_embedding, candidate_embeddings)
    least_similar_index = np.argmin(similarities)  # Use argmin to seek out the index of the minimum
    least_similar_element = candidate_embeddings[least_similar_index]
    return least_similar_element


def get_embedding_class(embedding, embedding_map):
    reverse_embedding_map = {value: key for key, value in embedding_map.items()}
    return reverse_embedding_map.get(embedding)  # Use .get() to handle missing keys gracefully


def select_subsets(embeddings, n):
    visited = {cls: False for cls in embeddings.keys()}
    subsets = []
    current_subset = []

    while any(not visited[cls] for cls in visited):
        for cls, average_embedding in embeddings.items():
            if not current_subset:
                current_subset.append(average_embedding)
                visited[cls] = True
            elif len(current_subset) >= n:
                subsets.append(current_subset.copy())
                current_subset = []
            else:
                subset_average = avg_embedding(current_subset)
                remaining_embeddings = [emb for cls_, emb in embeddings.items() if not visited[cls_]]
                if not remaining_embeddings:
                    break # handle edge case
                
                least_similar = get_least_similar_embedding(target_embedding=subset_average, candidate_embeddings=remaining_embeddings)

                visited_class = get_embedding_class(least_similar, embeddings)

                
                if visited_class is just not None:
                  visited[visited_class] = True


                current_subset.append(least_similar)
    
    if current_subset:  # Add any remaining elements in current_subset
        subsets.append(current_subset)
        

    return subsets

the results of this greedy subset sampling is all of the training labels clearly boxed into subsets, where each subset has at most only classes. This inherently makes the job of a classifier easier, in comparison with the unique classes it will must classify between otherwise!

Semi-supervised classification with noise oversampling

Cascade this after the initial label subset formation — i.e., this classifier is simply classifying between a given subset of classes.

Picture this: when you will have low amounts of coaching data, you absolutely cannot create a hold-out set that’s meaningful for evaluation. Must you do it in any respect? How do you recognize in case your classifier is working well?

We approached this problem barely otherwise — we defined the elemental job of a semi-supervised classifier to be pre-emptive classification of a sample. Which means no matter what a sample gets classified as it would be ‘verified’ and ‘corrected’ at a later stage: this classifier only must discover what must be verified.

As such, we created a design for a way it will treat its data:

  • classes, where the last class is noise
  • noise: data from classes which might be NOT in the present classifier’s purview. The noise class is oversampled to be 2x the common size of the information for the classifier’s labels

Oversampling on noise is a faux-safety measure, to be sure that adjoining data that belongs to a different class is most probably predicted as noise as a substitute of slipping through for verification.

How do you check if this classifier is working well — in our experiments, we define this because the variety of ‘uncertain’ samples in a classifier’s prediction. Using uncertainty sampling and knowledge gain principles, we were effectively capable of gauge if a classifier is ‘learning’ or not, which acts as a pointer towards classification performance. This classifier is consistently retrained unless there may be an inflection point within the variety of uncertain samples predicted, or there is simply a delta of knowledge being added iteratively by latest samples.

Proxy energetic learning via an LLM agent

That is the guts of the approach — using an LLM as a proxy for a human validator. The human validator approach we’re talking about is Lively Labelling

Let’s get an intuitive understanding of Lively Labelling:

  • Use an ML model to learn on a sample input dataset, predict on a big set of datapoints
  • For the predictions given on the datapoints, a subject-matter expert (SME) evaluates ‘validity’ of predictions
  • Recursively, latest ‘corrected’ samples are added as training data to the ML model
  • The ML model consistently learns/retrains, and makes predictions until the SME is satisfied by the standard of predictions

For Lively Labelling to work, there are expectations involved for an SME:

  • after we expect a human expert to ‘validate’ an output sample, the expert understands what the duty is
  • a human expert will use judgement to judge ‘what else’ definitely belongs to a label L when deciding if a brand new sample should belong to L

Given these expectations and intuitions, we are able to ‘mimic’ these using an LLM:

  • give the LLM an ‘understanding’ of what each label means. This will be done through the use of a bigger model to critically evaluate the connection between {label: data mapped to label} for all labels. In our experiments, this was done using a 32B variant of DeepSeek that was self-hosted.
Giving an LLM the aptitude to know ‘why, what, and the way’
  • As a substitute of predicting what’s the right label, leverage the LLM to discover if a prediction is ‘valid’ or ‘invalid’ only (i.e., LLM only has to reply a binary query).
  • Reinforce the concept of what other valid samples for the label appear to be, i.e., for each pre-emptively predicted label for a sample, dynamically source  closest samples in its training (guaranteed valid) set when prompting for validation.

The result? An economical framework that relies on a quick, low cost classifier to make pre-emptive classifications, and an LLM that verifies these using (meaning of the label + dynamically sourced training samples which might be just like the present classification):

import math

def calculate_uncertainty(clf, sample):
    predicted_probabilities = clf.predict_proba(sample.reshape(1, -1))[0]  # Reshape sample for predict_proba
    uncertainty = -sum(p * math.log(p, 2) for p in predicted_probabilities)
    return uncertainty


def select_informative_samples(clf, data, k):
    informative_samples = []
    uncertainties = [calculate_uncertainty(clf, sample) for sample in data]

    # Sort data by descending order of uncertainty
    sorted_data = sorted(zip(data, uncertainties), key=lambda x: x[1], reverse=True)

    # Get top k samples with highest uncertainty
    for sample, uncertainty in sorted_data[:k]:
        informative_samples.append(sample)

    return informative_samples


def proxy_label(clf, llm_judge, k, testing_data):
    #llm_judge - any LLM with a system prompt tuned for verifying if a sample belongs to a category. Expected output is a bool : True or False. True verifies the unique classification, False refutes it
    predicted_classes = clf.predict(testing_data)

    # Select k most informative samples using uncertainty sampling
    informative_samples = select_informative_samples(clf, testing_data, k)

    # List to store correct samples
    voted_data = []

    # Evaluate informative samples with the LLM judge
    for sample in informative_samples:
        sample_index = testing_data.tolist().index(sample.tolist()) # modified from testing_data.index(sample) due to numpy array type issue
        predicted_class = predicted_classes[sample_index]

        # Check if LLM judge agrees with the prediction
        if llm_judge(sample, predicted_class):
            # If correct, add the sample to voted data
            voted_data.append(sample)

    # Return the list of correct samples with proxy labels
    return voted_data

By feeding the valid samples (voted_data) to our classifier under controlled parameters, we achieve the ‘recursive’ a part of our algorithm:

Recursive Expert Delegation: R.E.D.

By doing this, we were capable of achieve close-to-human-expert validation numbers on controlled multi-class datasets. Experimentally, R.E.D. scales as much as 1,000 classes while maintaining a reliable degree of accuracy almost on par with human experts (90%+ agreement).

I imagine this can be a significant achievement in applied ML, and has real-world uses for production-grade expectations of cost, speed, scale, and flexibility. The technical report, publishing later this 12 months, highlights relevant code samples in addition to experimental setups used to realize given results.

Thinking about more details? Reach out to me over Medium or email for a chat!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x