Cleansing Up Confluence Chaos: A Python and BERTopic Quest

Artificial Intelligence

Cleansing Up Confluence Chaos: A Python and BERTopic Quest

admin

April 30, 2023

Cleansing Up Confluence Chaos: A Python and BERTopic Quest

A tale of taming unruly documents to create the final word GPT-based chatbot

Picture this: you’re at a rapidly growing tech company, and also you’ve been given the mission to create a state-of-the-art chatbot using the mind-blowing GPT technology. This chatbot is destined to change into the corporate’s crown jewel, a virtual oracle that’ll answer questions based on the treasure trove of information stored in your Confluence spaces. Seems like a dream job, right?

But, as you’re taking a better have a look at the Confluence knowledge base, reality hits. It’s a wild jungle of empty/incomplete pages, irrelevant documents and duplicate content. It’s like someone dumped a thousand jigsaw puzzles into a large blender and pressed “start.” And now, it’s your job to scrub up this mess before you’ll be able to even take into consideration constructing that tremendous chatbot.

Luckily for you, in this text, we’ll embark on an exciting journey to beat the Confluence chaos, using the ability of Python and BERTopic to discover and eliminate those annoying outliers. So, buckle up and prepare to rework your knowledge base into the right training ground on your cutting-edge GPT-based chatbot.

As you face the daunting task of cleansing up your Confluence knowledge base, you would possibly consider diving in manually, sorting through each document one after the other. Nonetheless, the manual approach is slow, labor-intensive, and error-prone. In any case, even probably the most meticulous worker can overlook vital details or misjudge the relevance of a document.

Along with your knowledge of Python, you is perhaps tempted to create a heuristic-based solution, using a set of predefined rules to discover and eliminate outliers. While this approach is quicker than manual cleanup, it has its limitations. Heuristics could be rigid and struggle to adapt to the complex and ever-evolving nature of your Confluence spaces, often resulting in suboptimal results.

Enter Python and BERTopic, a robust combination that may allow you to tackle the challenge of cleansing up your Confluence knowledge base more effectively. Python is a flexible programming language, while BERTopic is a complicated topic modeling library that may analyze your documents and group them based on their underlying topics.

In the subsequent paragraphs, we’ll explore how Python and BERTopic can work together to automate the strategy of identifying and eliminating outliers in your Confluence spaces. By harnessing their combined powers, you’ll save time and resources while increasing the accuracy and effectiveness of your cleanup efforts.

Alright, from this point on, I’ll walk you thru the strategy of making a Python script using BERTopic to discover and eliminate outliers in your Confluence knowledge base. The goal is to generate a ranked list of documents based on their “unrelatedness” rating (which we’ll define later). The ultimate output will consist of the document’s title, a preview of the text (first 100 characters), and the unrelatedness rating. The ultimate output will appear as follows:

(Title: “AI in Healthcare”, Preview: “Artificial intelligence is transforming…”, Unrelatedness: 0.95)
(Title: “Office Birthday Party Guidelines”, Preview: “To make sure a fun and secure…”, Unrelatedness: 0.8)
The essential steps on this process include:

Connect with Confluence and download documents: establish a connection to your Confluence account and fetch the documents for processing. This section provides guidance on organising the connection, authenticating, and downloading the mandatory data.
HTML processing and text extraction using Beautiful Soup: use Beautiful Soup, a robust Python library, to administer HTML content and extract the text from Confluence documents. This step involves cleansing up the extracted text, removing unwanted elements, and preparing the info for evaluation.
Apply BERTopic and create the rating: with the cleaned-up text in hand, apply BERTopic to research and group the documents based on their underlying topics. After obtaining the subject representations, calculate the “unrelatedness” measure for every document and create a rating to discover and eliminate outliers in your Confluence knowledge base.

Finally the code. Here, we’ll start downloading documents from a Confluence space, we’ll then process the HTML content, and we’ll extract the text for the subsequent phase (BERTopic!).

First, we’d like to connect with Confluence via API. Because of the atlassian-python-api library, that could be done with a number of lines of code. When you don’t have an API token for Atlassian, read this guide to set that up.

import os
import re
from atlassian import Confluence
from bs4 import BeautifulSoup# Arrange Confluence API client
confluence = Confluence(
url='YOUR_CONFLUENCE URL',
username="YOUR_EMAIL",
password="YOUR_API_KEY",
cloud=True)
# Replace SPACE_KEY with the specified Confluence space key
space_key = 'YOUR_SPACE'
def get_all_pages_from_space_with_pagination(space_key):
limit = 50
start = 0
all_pages = []
while True:
pages = confluence.get_all_pages_from_space(space_key, start=start, limit=limit)
if not pages:
break
all_pages.extend(pages)
start += limit
return all_pages
pages = get_all_pages_from_space_with_pagination(space_key)

After fetching the pages, we’ll create a directory for the text files, extract the pages’ content and save the text content to individual files:

# Function to sanitize filenames
def sanitize_filename(filename):
return "".join(c for c in filename if c.isalnum() or c in (' ', '.', '-', '_')).rstrip()# Create a directory for the text files if it doesn't exist
if not os.path.exists('txt_files'):
os.makedirs('txt_files')
# Extract pages and save to individual text files
for page in pages:
page_id = page['id']
page_title = page['title']
# Fetch the page content
page_content = confluence.get_page_by_id(page_id, expand='body.storage')
# Extract the content within the "storage" format
storage_value = page_content['body']['storage']['value']
# Clean the HTML tags to get the text content
text_content = process_html_document(storage_value)
file_name = f'txt_files/{sanitize_filename(page_title)}_{page_id}.txt'
with open(file_name, 'w', encoding='utf-8') as txtfile:
txtfile.write(text_content)

The function process_html_document carries out all of the mandatory cleansing tasks to extract the text from the downloaded pages while maintaining a coherent format. The extent to which you desire to refine this process relies on your specific requirements. On this case, we give attention to handling tables and lists to make sure that the resulting text document retains a format much like the unique layout.

import spacynlp = spacy.load("en_core_web_sm")
def html_table_to_text(html_table):
soup = BeautifulSoup(html_table, "html.parser")
# Extract table rows
rows = soup.find_all("tr")
# Determine if the table has headers or not
has_headers = any(th for th in soup.find_all("th"))
# Extract table headers, either from the primary row or from the  elements
if has_headers:
headers = [th.get_text(strip=True) for th in soup.find_all("th")]
row_start_index = 1  # Skip the primary row, because it incorporates headers
else:
first_row = rows[0]
headers = [cell.get_text(strip=True) for cell in first_row.find_all("td")]
row_start_index = 1
# Iterate through rows and cells, and use NLP to generate sentences
text_rows = []
for row in rows[row_start_index:]:
cells = row.find_all("td")
cell_sentences = []
for header, cell in zip(headers, cells):
# Generate a sentence using the header and cell value
doc = nlp(f"{header}: {cell.get_text(strip=True)}")
sentence = " ".join([token.text for token in doc if not token.is_stop])
cell_sentences.append(sentence)
# Mix cell sentences right into a single row text
row_text = ", ".join(cell_sentences)
text_rows.append(row_text)
# Mix row texts right into a single text
text = "nn".join(text_rows)
return text
def html_list_to_text(html_list):
soup = BeautifulSoup(html_list, "html.parser")
items = soup.find_all("li")
text_items = []
for item in items:
item_text = item.get_text(strip=True)
text_items.append(f"- {item_text}")
text = "n".join(text_items)
return text
def process_html_document(html_document):
soup = BeautifulSoup(html_document, "html.parser")
# Replace tables with text using html_table_to_text
for table in soup.find_all("table"):
table_text = html_table_to_text(str(table))
table.replace_with(BeautifulSoup(table_text, "html.parser"))
# Replace lists with text using html_list_to_text
for ul in soup.find_all("ul"):
ul_text = html_list_to_text(str(ul))
ul.replace_with(BeautifulSoup(ul_text, "html.parser"))
for ol in soup.find_all("ol"):
ol_text = html_list_to_text(str(ol))
ol.replace_with(BeautifulSoup(ol_text, "html.parser"))
# Replace all sorts of 
 with newlines
br_tags = re.compile('
|
|
')
html_with_newlines = br_tags.sub('n', str(soup))
# Strip remaining HTML tags to isolate the text
soup_with_newlines = BeautifulSoup(html_with_newlines, "html.parser")
return soup_with_newlines.get_text()

On this final chapter, we’ll finally leverage BERTopic, a robust topic modeling technique that utilizes BERT embeddings. You’ll be able to learn more about BERTopic of their GitHub repository and their documentation.

Our approach to finding outliers consists of running BERTopic with different values for the variety of topics. In each iteration, we’ll collect all documents that fall into the Outlier cluster (-1). The more regularly a document appears within the -1 cluster, the more likely it’s to be considered an outlier. This frequency forms the primary component of our unrelatedness rating. BERTopic also provides a probability value for documents within the -1 cluster. We’ll calculate the common of those probabilities for every document over all of the iterations. This average represents the second component of our unrelatedness rating. Finally, we’ll determine the general unrelatedness rating for every document by computing the common of the 2 scores (frequency and probability). This combined rating will help us discover probably the most unrelated documents in our dataset.

Here is the initial code:

import numpy as np
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
from sklearn.feature_extraction.text import CountVectorizervectorizer_model = CountVectorizer(stop_words="english")
representation_model = MaximalMarginalRelevance(diversity=0.2)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
# Collect text and filenames from chunks within the txt_files directory
documents = []
filenames = []
for file in os.listdir('txt_files'):
if file.endswith('.txt'):
with open(os.path.join('txt_files', file), 'r', encoding='utf-8') as f:
documents.append(f.read())
filenames.append(file)

On this code block, we arrange the mandatory tools for BERTopic by importing the required libraries and initializing the models. We define 3 models that will likely be utilized by BERTopic:

vectorizer_model: the CountVectorizer model tokenizes the documents and creates a document-term matrix where each entry represents the count of a term in a document. It also removes English stop words from the documents to enhance topic modeling performance.
representation_model: the MaximalMarginalRelevance (MMR) model diversifies the extracted topics by considering each the relevance and variety of topics. The diversity parameter controls the trade-off between these two facets, with higher values resulting in more diverse topics.
ctfidf_model: the ClassTfidfTransformer model adjusts the term frequency-inverse document frequency (TF-IDF) scores of the document-term matrix to higher represent topics. It reduces the impact of regularly occurring words across topics and enhances the excellence between topics.

We then collect the text and filenames of the documents from the ‘txt_files’ directory to process them with BERTopic in the subsequent step.

def extract_topics(docs, n_topics):
model = BERTopic(nr_topics=n_topics, calculate_probabilities=True, language="english",
ctfidf_model=ctfidf_model, representation_model=representation_model, 
vectorizer_model=vectorizer_model)
topics, probabilities = model.fit_transform(docs)
return model, topics, probabilitiesdef find_outlier_topic(model):
topic_sizes = model.get_topic_freq()
outlier_topic = topic_sizes.iloc[-1]["Topic"]
return outlier_topic
outlier_counts = np.zeros(len(documents))
outlier_probs = np.zeros(len(documents))
# Define the range of topics you desire to try
min_topics = 5
max_topics = 10
for n_topics in range(min_topics, max_topics + 1):
model, topics, probabilities = extract_topics(documents, n_topics)
outlier_topic = find_outlier_topic(model)
for i, (topic, prob) in enumerate(zip(topics, probabilities)):
if topic == outlier_topic:
outlier_counts[i] += 1
outlier_probs[i] += prob[outlier_topic]

Within the above section, we use BERTopic to discover outlier documents by iterating through a variety of topic counts from a specified minimum to a maximum. For every topic count, BERTopic extracts the topics and their corresponding probabilities. It then identifies the outlier topic and updates the outlier_counts and outlier_probs for documents assigned to this outlier topic. This process iteratively accumulates counts and probabilities, providing a measure of how often and the way ‘strongly’ documents are classified as outliers.

Finally, we will compute our unrelatedness rating and print the outcomes:

def normalize(arr):
min_val, max_val = np.min(arr), np.max(arr)
return (arr - min_val) / (max_val - min_val)# Average the possibilities
avg_outlier_probs = np.divide(outlier_probs, outlier_counts, out=np.zeros_like(outlier_probs), where=outlier_counts != 0)
# Normalize counts 
normalized_counts = normalize(outlier_counts)
# Compute the combined unrelatedness rating by averaging the normalized counts and probabilities
unrelatedness_scores = [(i, (count + prob) / 2) for i, (count, prob) in enumerate(zip(normalized_counts, avg_outlier_probs))]
unrelatedness_scores.sort(key=lambda x: x[1], reverse=True)
# Print the filtered results
for index, rating in unrelatedness_scores:
if rating > 0:
title = filenames[index]
preview = documents[index][:100] + "..." if len(documents[index]) > 100 else documents[index]
print(f"Title: {title}, Preview: {preview}, Unrelatedness: {rating:.2f}")
print("n")

And that’s it! Here you should have your list of outliers documents ranked by unrelatedness. By cleansing up your Confluence spaces and removing irrelevant content, you’ll be able to pave the best way for making a more efficient and precious chatbot that leverages your organization’s knowledge. Glad cleansing!

A tale of taming unruly documents to create the final word GPT-based chatbot

LEAVE A REPLY Cancel reply