Tutorial: Semantic Clustering of User Messages with LLM Prompts

-


As a Developer Advocate, it’s difficult to maintain up with user forum messages and understand the massive picture of what users are saying. There’s loads of priceless content — but how will you quickly spot the important thing conversations? On this tutorial, I’ll show you an AI hack to perform semantic clustering just by prompting LLMs!

TL;DR 🔄 this blog post is about go from (data science + code) → (AI prompts + LLMs) for a similar results — just faster and with less effort! 🤖⚡. It’s organized as follows:

  • Inspiration and Data Sources
  • Exploring the Data with Dashboards
  • LLM Prompting to provide KNN Clusters
  • Experimenting with Custom Embeddings
  • Clustering Across Multiple Discord Servers

Inspiration and Data Sources

First, I’ll give props to the December 2024 paper Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants to research and surface aggregated usage patterns across thousands and thousands of conversations. Reading this paper inspired me to do this.

Data. I used only publicly available Discord messages, specifically “forum threads”, where users ask for tech help. As well as, I aggregated and anonymized content for this blog.  Per thread, I formatted the info into conversation turn format, with user roles identified as either “user”, asking the query or “assistant”, anyone answering the user’s initial query. I also added an easy, hard-coded binary sentiment rating (0 for “not comfortable” and 1 for “comfortable”) based on whether the user said thanks anytime of their thread. For vectorDB vendors I used Zilliz/Milvus, Chroma, and Qdrant.

Step one was to convert the info right into a pandas data frame. Below is an excerpt. You’ll be able to see for thread_id=2, a user only asked 1 query. But for thread_id=3, a user asked 4 different questions in the identical thread (other 2 questions at farther down timestamps, not shown below).

I added a naive sentiment 0|1 scoring function.

def calc_score(df):
   # Define the goal words
   target_words = ["thanks", "thank you", "thx", "🙂", "😉", "👍"]


   # Helper function to examine if any goal word is within the concatenated message content
   def contains_target_words(messages):
       concatenated_content = " ".join(messages).lower()
       return any(word in concatenated_content for word in target_words)


   # Group by 'thread_id' and calculate rating for every group
   thread_scores = (
       df[df['role_name'] == 'user']
       .groupby('thread_id')['message_content']
       .apply(lambda messages: int(contains_target_words(messages)))
   )
   # Map the calculated scores back to the unique DataFrame
   df['score'] = df['thread_id'].map(thread_scores)
   return df


...


if __name__ == "__main__":
  
   # Load parameters from YAML file
   config_path = "config.yaml"
   params = load_params(config_path)
   input_data_folder = params['input_data_folder']
   processed_data_dir = params['processed_data_dir']
   threads_data_file = os.path.join(processed_data_dir, "thread_summary.csv")
  
   # Read data from Discord Forum JSON files right into a pandas df.
   clean_data_df = process_json_files(
       input_data_folder,
       processed_data_dir)
  
   # Calculate rating based on specific words in message content
   clean_data_df = calc_score(clean_data_df)


   # Generate reports and plots
   plot_all_metrics(processed_data_dir)


   # Concat thread messages & save as CSV for prompting.
   thread_summary_df, avg_message_len, avg_message_len_user = 
   concat_thread_messages_df(clean_data_df, threads_data_file)
   assert thread_summary_df.shape[0] == clean_data_df.thread_id.nunique()

Exploring the Data with Dashboards

From the processed data above, I built traditional dashboards:

  • Message Volumes: One-off peaks in vendors like Qdrant and Milvus (possibly resulting from marketing events).
  • User Engagement: . Scatterplot dark dots seem random with regard to y-axis (response time). Possibly users will not be in production, their questions will not be very urgent? Outliers exist, reminiscent of Qdrant and Chroma, which can have bot-driven anomalies.
  • Satisfaction Trends: Around 70% of users appear comfortable to have any interaction.
Image by writer of aggregated, anonymized data. Top lefts: Charts display Chroma’s highest message volume, followed by Qdrant, after which Milvus. Top rights: Top messaging users, Qdrant + Chroma possible bots (see top bar in top messaging users chart). Middle rights: Scatterplots of Response time vs Variety of user turns shows no correlation with respect to dark dots and y-axis (response time). Normally higher satisfaction w.r.t. x-axis (user turns), except Chroma. Bottom lefts: Bar charts of satisfaction levels, be certain that you catch possible emoji-based feedback, see Qdrant and Chroma.

LLM Prompting to provide KNN Clusters

For prompting, the following step was to aggregate data by thread_id. For LLMs, you wish the texts concatenated together. I separate out user messages from entire thread messages, to see if one or the opposite would produce higher clusters. I ended up using just user messages.

Example anonymized data for prompting. All message texts concatenated together.

With a CSV file for prompting, you’re able to start using a LLM to do data science!

!pip install -q google.generativeai
import os
import google.generativeai as genai


# Get API key from local system
api_key=os.environ.get("GOOGLE_API_KEY")


# Configure API key
genai.configure(api_key=api_key)


# List all of the model names
for m in genai.list_models():
   if 'generateContent' in m.supported_generation_methods:
       print(m.name)


# Try different models and prompts
GEMINI_MODEL_FOR_SUMMARIES = "gemini-2.0-pro-exp-02-05"
model = genai.GenerativeModel(GEMINI_MODEL_FOR_SUMMARIES)
# Mix the prompt and CSV data.
full_input = prompt + "nnCSV Data:n" + csv_data
# Inference call to Gemini LLM
response = model.generate_content(full_input)


# Save response.text as .json file...


# Check token counts and compare to model limit: 2 million tokens
print(response.usage_metadata)
Image by writer. Top: Example LLM model names. Bottom: Example inference call to Gemini LLM token counts: prompt_token_count = input tokens; candidates_token_count = output tokens; total_token_count = sum total tokens used.

Unfortunately Gemini API kept cutting short the response.text. I had higher luck using AI Studio directly.

Image by writer: Screenshot of example outputs from Google AI Studio.

My 5 prompts to Gemini Flash & Pro (temperature set to 0) are below.

Prompt#1: Get thread Summaries:

Prompt#2: Get cluster stats:

Prompt#3: Perform initial clustering:

Silhouette Rating measures how similar an object is to its own cluster (cohesion) versus other clusters (separation). Scores range from -1 to 1. A better average silhouette rating generally indicates better-defined clusters with good separation. For more details, try the scikit-learn silhouette rating documentation.

Applying it to Chroma Data. Below, I show results from Prompt#2, as a plot of silhouette scores. I selected N=6 clusters as a compromise between high rating and fewer clusters. Most LLMs today for data evaluation take input as CSV and output JSON.

Image by writer of aggregated, anonymized data. Left: I selected N=6 clusters as compromise between higher rating and fewer clusters. Right: The actual clusters using N=6. Highest sentiment (highest scores) are for topics about Query. Lowest sentiment (lowest scores) are for topics about “Client Problems”.

From the plot above, you’ll be able to see we’re finally entering into the meat of what users are saying!

Prompt#4: Get hierarchical cluster stats:

Prompt#5: Perform hierarchical clustering:

I also prompted to generate Streamlit code to visualise the clusters (since I’m not a JS expert 😄). Results for a similar Chroma data are shown below.

Image by writer of aggregated, anonymized data. Left image: Each scatterplot dot is a thread with hover-info. Right image: Hierarchical clustering with raw data drill-down capabilities. Api and Package Errors looks like Chroma’s most urgent topic to repair, because sentiment is low and volume of messages is high.

I discovered this very insightful. For Chroma, clustering revealed that while users were comfortable with topics like Query, Distance, and Performance, they were unhappy about areas reminiscent of Data, Client, and Deployment.

Experimenting with Custom Embeddings

I repeated the above clustering prompts, using just the numerical embedding (“user_embedding”) within the CSV as an alternative of the raw text summaries (“user_text”).I’ve explained embeddings intimately in previous blogs before, and the risks of overfit models on leaderboards. OpenAI has reliable embeddings that are extremely inexpensive by API call. Below is an example code snippet create embeddings.

from openai import OpenAI


EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 512 # 512 or 1536 possible


# Initialize client with API key
openai_client = OpenAI(
   api_key=os.environ.get("OPENAI_API_KEY"),
)


# Function to create embeddings
def get_embedding(text, embedding_model=EMBEDDING_MODEL,
                 embedding_dim=EMBEDDING_DIM):
   response = openai_client.embeddings.create(
       input=text,
       model=embedding_model,
       dimensions=embedding_dim
   )
   return response.data[0].embedding


# Function to call per pandas df row in .apply()
def generate_row_embeddings(row):
   return {
       'user_embedding': get_embedding(row['user_thread_summary']),
   }


# Generate embeddings using pandas apply
embeddings_data = df.apply(generate_row_embeddings, axis=1)
# Add embeddings back into df as separate columns
df['user_embedding'] = embeddings_data.apply(lambda x: x['user_embedding'])
display(df.head())


# Save as CSV ...
Example data for prompting. Column “user_embedding” is an array length=512 of floating point numbers.

Interestingly, each Perplexity Pro and Gemini 2.0 Pro sometimes hallucinated cluster topics (e.g., misclassifying a matter about slow queries as “Personal Matter”).

Image by writer of aggregated, anonymized data. Each Perplexity Pro and Google’s Gemini 1.5 Pro hallucinated Cluster Topics when given an externally-generated embedding column. Conclusion — when performing NLP with prompts, just keep the raw text and let the LLM create its own embeddings behind the scenes. Feeding in externally-generated embeddings seems to confuse the LLM!

Clustering Across Multiple Discord Servers

Finally, I broadened the evaluation to incorporate Discord messages from three different VectorDB vendors. The resulting visualization highlighted common issues — like each Milvus and Chroma facing authentication problems.

Image by writer of aggregated, anonymized data: A multi-vendor VectorDB dashboard displays top issues across many corporations. One thing that stands out is each Milvus and Chroma are having trouble with Authentication.

Summary

Here’s a summary of the steps I followed to perform semantic clustering using LLM prompts:

  1. Extract Discord threads.
  2. Format data into conversation turns with roles (“user”, “assistant”).
  3. Rating sentiment and save as CSV.
  4. Prompt Google Gemini 2.0 flash for thread summaries.
  5. Prompt Perplexity Pro or Gemini 2.0 Pro for clustering based on thread summaries using the identical CSV.
  6. Prompt Perplexity Pro or Gemini 2.0 Pro to jot down Streamlit code to visualise clusters (because I’m not a JS expert 😆).

By following these steps, you’ll be able to quickly transform raw forum data into actionable insights — what used to take days of coding can now be done in only one afternoon!

References

  1. Clio: Privacy-Preserving Insights into Real-World AI Use, https://arxiv.org/abs/2412.13678
  2. Anthropic blog about Clio, https://www.anthropic.com/research/clio
  3. Milvus Discord Server, last accessed Feb 7, 2025
    Chroma Discord Server, last accessed Feb 7, 2025
    Qdrant Discord Server, last accessed Feb 7, 2025
  4. Gemini models, https://ai.google.dev/gemini-api/docs/models/gemini
  5. Blog about Gemini 2.0 models, https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/
  6. Scikit-learn Silhouette Rating
  7. OpenAI Matryoshka embeddings
  8. Streamlit
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x