Select the Right One: Evaluating Topic Models for Business Intelligence

are utilized in businesses to categorise brand-related text datasets (akin to product and site reviews, surveys, and social media comments) and to trace how customer satisfaction metrics change over time.

There’s a myriad of recent topic models one can select from: the widely used BERTopic by Maarten Grootendorst (2022), the recent FASTopic presented finally yr’s NeurIPS, (Xiaobao Wu et al.,2024), the Dynamic Topic Model by Blei and Lafferty (2006), or a fresh semi-supervised Seeded Poisson Factorization model (Prostmaier et al., 2025).

For a business use case, training topic models on customer texts, we regularly get results that aren’t equivalent and sometimes even conflicting. In business, imperfections cost money, so the engineers should place into production the model that gives the perfect solution and solves the issue most effectively. At the identical pace that latest topic models appear in the marketplace, methods for evaluating their quality using latest metrics also evolve.

This practical tutorial will give attention to bigram topic modelswhich provide more relevant information and discover higher key qualities and problems for business decisions than single-word modelsetc.). On one side, bigram models are more detailed; on the opposite, many evaluation metrics weren’t originally designed for his or her evaluation. To supply more background on this area, we’ll explore intimately:

Easy methods to evaluate the standard of bigram topic models
Easy methods to prepare an email classification pipeline in Python.

Our example use case will show how bigram topic models (BERTopic and FASTopic) help prioritize email communication with customers on certain topics and reduce response times.

1. What are topic model quality indicators?

The evaluation task should goal the best state:

In practice, which means that the words predicted for every topic are semantically similar to human judgment, and there may be low duplication of words between topics.

Coherence metrics evaluate how well the words discovered by a subject model make sense to humans (have similar semantics in each topic).
Topic diversity measures how different the discovered topics are from each other.

Bigram topic models work well with these metrics:

NPMI uses probabilities estimated in a reference corpus to calculate a [-1:1] rating for every word (or bigram) predicted by the model. Read [1] for more details.

The reference corpus may be either internal (the training set) or external (e.g., an external email dataset). A big, external, and comparable corpus is a better option because it may help reduce bias in training sets. Because this metric works with word frequencies, the training set and the reference corpus must be preprocessed the identical way (i.e., if we remove numbers and stopwords within the training set, we must always also do it within the reference corpus). The combination model rating is the common of words across topics.

SCdoesn’t need a reference corpus. It uses the identical dataset as was used to coach the subject model. Read more in [2].

Let’s say we have now the Top 4 words for one topic: predicted by a subject model.Then looks in any respect combos of words within the training set going from left to right, starting with the primary word, , then the second word , , then last word and it counts the variety of documents that contain each words, divided by the frequency of documents that contain the primary word. Overall SC rating for a model is the mean of all topic-level scores.

Image 1. Semantic coherence by Mimno et al. (2011) illustration. Image by creator.

PUV calculates the share of unique words across topics within the model. implies that each topic within the model incorporates unique bigrams. Values near 1 indicate a well-shaped, high-quality model with small word overlap between topics. [3].

2. How can we prioritize email communication with topic models?

A big share of customer communication, not only in e-commerce businesses, is now solved with chatbots and private client sections. Yet, it is not uncommon to speak with customers by email. Many email providers offer developers broad flexibility in APIs to customize their email platform (). On this place, topic models make mailing more flexible and effective.

Image 2. Topic model pipeline illustration. Image by creator.

3. Data and model set-ups

We are going to train FASTopic and Bertopic to categorise emails into 8 and 10 topics and evaluate the standard of all model specifications. Read my previous TDS tutorial on topic modeling with these cutting-edge topic models.

As a training set, we use a synthetically generated Customer Care Email dataset available on Kaggle with a GPL-3 license. The prefiltered data covers 692 incoming emails and appears like this:

Image 3. Customer Care Email dataset. Image by creator.

3.1. Data preprocessing

Cleansing text in the best order is crucial for topic models to work in practice since it minimizes the bias of every cleansing operation.

Numbers are typically removed first, followed by emojis, unless we don’t need them for special situations, akin to extracting sentiment. Stopwords for a number of languages are removed afterward, followed by punctuation in order that stopwords don’t break up into two tokens ( -> ). Additional tokens (company and other people’s names, etc.) are removed in the following step within the clean data before lemmatization, which unifies tokens with the identical semantics.

Image 4. General preprocessing steps for topic modeling. Image by creator

Text preprocessing is model-specific:

works with clean data on input; some cleansing (stopwords) may be done through the training. The best and simplest way is to make use of the that gives a no-code way of information preprocessing for text mining projects.
the documentation recommends that “rFor this reason, cleansing operations must be included within the model training.

3.2. Model compilation and training

You’ll be able to check the complete codes for FASTopic and BERTopic’s training with bigram preprocessing and cleansing in this repo. My previous TDS tutorials (4) and (5) explain all steps intimately.

We train each models to categorise 8 topics in customer email data. A straightforward inspection of the subject distribution shows that incoming emails to FASTopic are quite well distributed across topics. BERTopic classifies emails unevenly, keeping outliers (uncategorized tokens) in T-1 and a big share of incoming emails in T0.

Image 5: Topic distribution, email classification. Image by creator.

Listed here are the expected bigrams for each models with topic labels:

Image 6: Models’ predictions. Image by creator.

Since the email corpus is an artificial LLM-generated dataset, the naive labelling of the topics for each models shows topics which are:

Comparable:
Differing: (BERTopic classifies outliers into T-1), (FASTopic), (BERTopic)

For business purposes, topics must be labelled by the corporate’s insiders who know the client base and the business priorities.

4. Model evaluation

If three out of eight classified topics are labeled in a different way, then which model must be deployed? Let’s now evaluate the coherence and variety for the trained BERTopic and FASTopic T-8 models.

4.1. NPMI

We’d like a reference corpus to calculate an for every model. The Customer IT Support Ticket Dataset from Kaggle, distributed with Attribution 4.0 International license, provides comparable data to our training set. The information is filtered to 11923 English email bodies.

Image 7: NPMI coherence evaluation.Image by creator.

4.2. SC

With , we learn the context and semantic similarity of bigrams predicted by a subject model by calculating their position within the corpus in relation to other tokens. To accomplish that, we:

4.3. PUV

Topic diversity metric checks the duplicates of bigrams between topics in a model.

Image 8: Topic diversity illustration. Image by creator.

4.4. Model comparison

Let’s now summarize the coherence and variety evaluation in Image 9. BERTopic models are more coherent but less diverse than FASTopic. The differences aren’t very large, but BERTopic suffers from uneven distribution of incoming emails into the pipeline (see charts in Image 5). Around 32% of classified emails fall into, and 15% into, which covers the unclassified outliers. The models are trained with a min. of 20 tokens per topic. Increasing this parameter causes the model to be unable to coach, probably due to small data size.

For that reason, FASTopic is a better option for topic modelling in email classification with small training datasets.

Image 9: Topic model evaluation metrics. Image by creator.

The last step is to deploy the model with topic labels in the e-mail platform to categorise incoming emails:

Image 10. Topic model classification pipeline, output. Image by creator.

Summary

Coherence and variety metrics compare models with similar training setups, the identical dataset, and cleansing strategy. We cannot compare their absolute values with the outcomes of various training sessions. But they assist us choose the perfect model for our specific use case. They provide a of assorted model specifications and help resolve which model must be deployed within the pipeline. Topic models evaluation should at all times be the last step before model deployment in business practice.

How does customer care profit from the subject modelling exercise? After the subject model is put into production, the pipeline sends a classified topic for every email to the e-mail platform that Customer Care uses for communicating with customers. With a limited staff, it’s now possible to prioritize and respond faster to probably the most sensitive business requests (akin to ” and ), and alter priorities dynamically.

Data and complete codes for this tutorial are here.

References

[1] Blei, D. M., Lafferty, J. D. 2006. Dynamic topic models. In .

[2] Dieng A.B., Ruiz F. J. R., and Blei D. M. 2020. Topic Modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8:439-453.

[3] Grootendorst, M. 2022. Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure.

[4] Korab, P. Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code. . 22.1.2025. Accessible from: link.

[5] Korab, P. Topic Modelling with BERTtopic in Python. . 4.1.2024. Accessible from: link.

[6] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. 2024. FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.

[7] Mimno, D., Wallach, H., M., Talley, E., Leenders, M, McCallum. A. 2011. Optimizing Semantic Coherence in Topic Models.

[8] Prostmaier, B., Vávra, J., Grün, B., Hofmarcher., P. 2025. Seeded Poisson Factorization: Leveraging domain knowledge to suit topic models. arXiv preprint: 2405.17978.

Select the Right One: Evaluating Topic Models for Business Intelligence

1. What are topic model quality indicators?

2. How can we prioritize email communication with topic models?