Home Artificial Intelligence Semantic Signal Separation

Semantic Signal Separation

Semantic Signal Separation

Understand Semantic Structures with Transformers and Topic Modeling

We live within the age of huge data. At this point it’s develop into a cliche to say that data is the oil of the twenty first century but it surely really is so. Data collection practices have resulted in huge piles of knowledge in nearly everyone’s hands.

Interpreting data, nonetheless, isn’t any easy task, and far of the industry and academia still depend on solutions, which give little within the ways of explanations. While deep learning is incredibly useful for predictive purposes, it rarely gives practitioners an understanding of the mechanics and structures that underlie the info.

Textual data is very tricky. While natural language and ideas like “topics” are incredibly easy for humans to have an intuitive grasp of, producing operational definitions of semantic structures is removed from trivial.

In this text I’ll introduce you to different conceptualizations of discovering latent semantic structures in natural language, we’ll have a look at operational definitions of the speculation, and finally I’ll display the usefulness of the tactic with a case study.

While topic to us humans looks as if a very intuitive and self-explanatory term, it’s hardly so once we attempt to give you a useful and informative definition. The Oxford dictionary’s definition is luckily here to assist us:

A subject that’s discussed, written about, or studied.

Well, this didn’t get us much closer to something we will formulate in computational terms. Notice how the word subject, is used to cover all of the gory details. This needn’t deter us, nonetheless, we will definitely do higher.

Semantic Space of Academic Disciplines

In Natural Language Processing, we frequently use a spatial definition of semantics. This might sound fancy, but essentially we imagine that semantic content of text/language will be expressed in some continuous space (often high-dimensional), where concepts or texts which might be related are closer to every aside from people who aren’t. If we embrace this theory of semantics, we will easily give you two possible definitions for topic.

Topics as Semantic Clusters

A slightly intuitive conceptualization is to assume topic as groups of passages/concepts in semantic space which might be closely related to one another, but not as closely related to other texts. This incidentally signifies that one passage can only belong to 1 topic at a time.

Semantic Clusters of Academic Disciplines

This clustering conceptualization also lends itself to fascinated by topics hierarchically. You possibly can imagine that the subject “animals” might contain two subclusters, one which is “Eukaryates”, while the opposite is “Prokaryates”, after which you may go down this hierarchy, until, on the leaves of the tree you will discover actual instances of concepts.

After all a limitation of this approach is that longer passages might contain multiple topics in them. This might either be addressed by splitting up texts to smaller, atomic parts (e.g. words) and modeling over those, but we may ditch the clustering conceptualization alltogether.

Topics as Axes of Semantics

We may consider topics because the underlying dimensions of the semantic space in a corpus. Or in other words: As an alternative of describing what groups of documents there are we’re explaining variation in documents by finding underlying semantic signals.

Underlying Axes within the Semantic Space of Academic Disciplines

We’re explaining variation in documents by finding underlying semantic signals.

You could possibly for example imagine that an important axes that underlie restaurant reviews could be:

  1. Satisfaction with the food
  2. Satisfaction with the service

I hope you see why this conceptualization is helpful for certain purposes. As an alternative of us finding “good reviews” and “bad reviews”, we get an understanding of what it’s that drives differences between these. A popular culture example of this type of theorizing is in fact the political compass. Yet again, as an alternative of us being excited by finding “conservatives” and “progressives”, we discover the aspects that differentiate these.

Now that we got the philosophy out of the best way, we will get our hands dirty with designing computational models based on our conceptual understanding.

Semantic Representations

Classically the best way we represented the semantic content of texts, was the so-called bag-of-words model. Essentially you make the very strong, and almost trivially flawed assumption, that the unordered collection of words in a document is constitutive of its semantic content. While these representations are plagued with quite a few issues (curse of dimensionality, discrete space, etc.) they’ve been demonstrated useful by many years of research.

Luckily for us, the cutting-edge has progressed beyond these representations, and now we have access to models that may represent text in context. Sentence Transformers are transformer models which might encode passages right into a high-dimensional continuous space, where semantic similarity is indicated by vectors having high cosine similarity. In this text I’ll mainly give attention to models that use these representations.

Clustering Models

Models which might be currently probably the most widespread in the subject modeling community for contextually sensitive topic modeling (Top2Vec, BERTopic) are based on the clustering conceptualization of topics.

Clusters in Semantic Space Discovered by BERTopic (figure from BERTopic’s documentation)

They discover topics in a process that consists of the next steps:

  1. Reduce dimensionality of semantic representations using UMAP
  2. Discover cluster hierarchy using HDBSCAN
  3. Estimate importances of terms for every cluster using post-hoc descriptive methods (c-TF-IDF, proximity to cluster centroid)

These models have gained a whole lot of traction, mainly resulting from their interpretable topic descriptions and their ability to get well hierarchies, in addition to to learn the variety of topics from the info.

If we would like to model nuances in topical content, and understand aspects of semantics, clustering models will not be enough.

I don’t intend to enter great detail concerning the practical benefits and limitations of those approaches, but most of them stem from philosophical considerations outlined above.

Semantic Signal Separation

If we’re to find the axes of semantics in a corpus, we’ll need a latest statistical model.

We are able to take inspiration from classical topic models, resembling Latent Semantic Allocation. LSA utilizes matrix decomposition to seek out latent components in bag-of-words representations. LSA’s principal goal is to seek out words which might be highly correlated, and explain their cooccurrence as an underlying semantic component.

Since we are not any longer coping with bag-of-words, explaining away correlation may not be an optimal strategy for us. Orthogonality will not be statistical independence. Or in other words: Simply because two components are uncorrelated, it doesn’t mean that they’re statistically independent.

Orthogonality will not be statistical independence

Other disciplines have luckily give you decomposition models that discover maximally independent components. Independent Component Evaluation has been extensively utilized in Neuroscience to find and take away noise signals from EEG data.

Difference between Orthogonality and Independence Demonstrated with PCA and ICA (Figure from scikit-learn’s documentation)

The principal idea behind Semantic Signal Separation is that we will find maximally independent underlying semantic signals in a corpus of text by decomposing representations with ICA.

We are able to gain human-readable descriptions of topics by taking terms from the corpus that rank highest on a given component.

To display the usefulness of Semantic Signal Separation for understanding semantic variation in corpora, we’ll fit a model on a dataset of roughly 118k machine learning abstracts.

To reiterate once more what we’re trying to realize here: We would like to determine the scale, along which all machine learning papers are distributed. Or in other words we would really like to construct a spatial theory of semantics for this corpus.

For this we’re going to use a Python library I developed called Turftopic, which has implementations of most topic models that utilize representations from transformers, including Semantic Signal Separation. Moreover we’re going to install the HuggingFace datasets library in order that we will download the corpus at hand.

pip install turftopic datasets

Allow us to download the info from HuggingFace:

from datasets import load_dataset

ds = load_dataset("CShorten/ML-ArXiv-Papers", split="train")

We’re then going to run Semantic Signal Separation on this data. We’re going to use the all-MiniLM-L12-v2 Sentence Transformer, because it is kind of fast, but provides reasonably top quality embeddings.

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10, encoder="all-MiniLM-L12-v2")


Topics Present in the Abstracts by Semantic Signal Separation

These are highest rating keywords for the ten axes we present in the corpus. You possibly can see that the majority of those are quite readily interpretable, and already enable you see what underlies differences in machine learning papers.

I’ll give attention to three axes, type of arbitrarily, because I discovered them to be interesting. I’m a Bayesian evangelist, so Topic 7 looks as if an interesting one, as evidently this component describes how probabilistic, model based and causal papers are. Topic 6 appears to be about noise detection and removal, and Topic 1 is generally concerned with measurement devices.

We’re going to supply a plot where we display a subset of the vocabulary where we will see how high terms rank on each of those components.

First let’s extract the vocabulary from the model, and choose quite a few words to display on our graphs. I selected to go together with words which might be within the 99th percentile based on frequency (in order that they still remain somewhat visible on a scatter plot).

import numpy as np

vocab = model.get_vocab()

# We are going to produce a BoW matrix to extract term frequencies
document_term_matrix = model.vectorizer.transform(ds["abstract"])
frequencies = document_term_matrix.sum(axis=0)
frequencies = np.squeeze(np.asarray(frequencies))

# We select the 99th percentile
selected_terms_mask = frequencies > np.quantile(frequencies, 0.99)

We are going to make a DataFrame with the three chosen dimensions and the terms so we will easily plot later.

import pandas as pd

# model.components_ is a n_topics x n_terms matrix
# It comprises the strength of all components for every word.
# Here we're choosing components for the words we chosen earlier

terms_with_axes = pd.DataFrame({
"inference": model.components_[7][selected_terms],
"measurement_devices": model.components_[1][selected_terms],
"noise": model.components_[6][selected_terms],
"term": vocab[selected_terms]

We are going to use the Plotly graphing library for creating an interactive scatter plot for interpretation. The X axis goes to be the inference/Bayesian topic, Y axis goes to be the noise topic, and the colour of the dots goes to be determined by the measurement device topic.

import plotly.express as px

textposition="top center",
marker=dict(size=12, line=dict(width=2, color="white"))

Plot of Most Frequent Terms within the Corpus Distributed by Semantic Axes

We are able to already infer loads concerning the semantic structure of our corpus based on this visualization. As an illustration we will see that papers which might be concerned with efficiency, online fitting and algorithms rating very low on statistical inference, that is somewhat intuitive. Alternatively what Semantic Signal Separation has already helped us do in a data-based approach is confirm, that deep learning papers will not be very concerned with statistical inference and Bayesian modeling. We are able to see this from the words “network” and “networks” (together with “convolutional”) rating very low on our Bayesian axis. That is certainly one of the criticisms the sector has received. We’ve just given support to this claim with empirical evidence.

Deep learning papers will not be very concerned with statistical inference and Bayesian modeling, which is certainly one of the criticisms the sector has received. We’ve just given support to this claim with empirical evidence.

We may see that clustering and classification may be very concerned with noise, but that agent-based models and reinforcement learning isn’t.

Moreover an interesting pattern we may observe is the relation of our Noise axis to measurement devices. The words “image”, “images”, “detection” and “robust” stand out as scoring very high on our measurement axis. These are also in a region of the graph where noise detection/removal is comparatively high, while discuss statistical inference is low. What this implies to us, is that measurement devices capture a whole lot of noise, and that the literature is attempting to counteract these issues, but mainly not by incorporating noise into their statistical models, but by preprocessing. This makes a whole lot of sense, as for example, Neuroscience is understood for having very extensive preprocessing pipelines, and lots of of their models have a tough time coping with noise.

Noise in Measurement Devices’ Output is Countered with Preprocessing

We may observe that the bottom scoring terms on measurement devices is “text” and “language”. Plainly NLP and machine learning research will not be very concerned with neurological bases of language, and psycholinguistics. Observe that “latent” and “representation can also be relatively low on measurement devices, suggesting that machine learning research in neuroscience will not be super involved with representation learning.

Text and Language are Rarely Related with Measurement Devices

After all the probabilities from listed here are limitless, we could spend loads more time interpreting the outcomes of our model, but my intent was to display that we will already find claims and establish a theory of semantics in a corpus through the use of Semantic Signal Separation.

Semantic Signal Separation should mainly be used as an exploratory measure for establishing theories, slightly than taking its results as proof of a hypothesis.

One thing I would really like to emphasise is that Semantic Signal Separation should mainly be used as an exploratory measure for establishing theories, slightly than taking its results as proof of a hypothesis. What I mean here, is that our results are sufficient for gaining an intuitive understanding of differentiating aspects in our corpus, an then constructing a theory about what is going on, and why it is going on, but it surely will not be sufficient for establishing the speculation’s correctness.

Exploratory data evaluation will be confusing, and there are in fact no one-size-fits-all solutions for understanding your data. Together we’ve checked out methods to enhance our understanding with a model-based approach from theory, through computational formulation, to practice.

I hope this text will serve you well when analysing discourse in large textual corpora. In case you intend to learn more about topic models and exploratory text evaluation, ensure to have a have a look at a few of my other articles as well, as they discuss some elements of those subjects in greater detail.

(( Unless stated otherwise, figures were produced by the writer. ))


Please enter your comment!
Please enter your name here