Agentic RAG Applications: Company Knowledge Slack Agents

-

I that almost all corporations would have built or implemented their very own Rag agents by now.

An AI knowledge agent can dig through internal documentation — web sites, PDFs, random docs — and answer employees in Slack (or Teams/Discord) inside just a few seconds. So, these bots should significantly reduce time sifting through information for workers.

I’ve seen just a few of those in greater tech corporations, like AskHR from IBM, but they aren’t all that mainstream yet.

Should you’re keen to know how they’re built and the way much resources it takes to construct an easy one, that is an article for you.

Parts this text will undergo | Image by writer

I’ll undergo the tools, techniques, and architecture involved, while also the economics of constructing something like this. I’ll also include a bit on what you’ll find yourself focusing probably the most on.

Stuff you’ll spend time on | Image by writer

There’s also a demo at the top for what this can appear to be in Slack.

Should you’re already accustomed to RAG, be at liberty to skip the following section — it’s only a little bit of repetitive stuff around agents and RAG.

What’s RAG and Agentic RAG?

Most of you who read this can know what Retrieval-Augmented Generation (RAG) is but if you happen to’re latest to it, it’s a technique to fetch information that gets fed into the big language model (LLM) before it answers the user’s query.

This permits us to supply relevant information from various documents to the bot in real time so it could actually answer the user appropriately.

Easy RAG | Image by writer

This retrieval system is doing greater than easy keyword search, because it finds similar matches moderately than simply exact ones. For instance, if someone asks about fonts, a similarity search might return documents on typography.

Many would say that RAG is a reasonably easy concept to know, but the way you store information, the way you fetch it, and how much embedding models you utilize still matter lots.

Should you’re keen to learn more about embeddings and retrieval, I’ve written about this here.

Today, people have gone further and primarily work with agent systems.

In agent systems, the LLM can resolve where and the way it should fetch information, moderately than simply having content dumped into its context before generating a response.

Agent system with RAG tools — the yellow dot i the agent and the grey dots are the tools | Image by writer

It’s necessary to keep in mind that simply because more advanced tools exist doesn’t mean you must at all times use them. You desire to keep the system intuitive and likewise keep API calls to a minimum.

With agent systems the API calls will increase, because it must no less than call one tool after which make one other call to generate a response.

That said, I actually just like the user experience of the bot “going somewhere” — to a tool — to look something up. Seeing that flow in Slack helps the user understand what’s happening.

But going with an agent or using a full framework isn’t necessarily the better option. I’ll elaborate on this as we proceed.

Technical Stack

There’s a ton of options for agent frameworks, vector databases, and deployment options, so I’ll undergo some.

For the deployment option, since we’re working with Slack webhooks, we’re coping with event-driven architecture where the code only runs when there’s a matter in Slack.

To maintain costs to a minimum, we will use serverless functions. The alternative is either going with AWS Lambda or picking a brand new vendor.

Lambda vs Modal comparison, find the total table here | Image by writer

Platforms like Modal are technically built to serve LLM models, but they work well for long-running ETL processes, and for LLM apps generally.

Modal hasn’t been battle-tested as much, and also you’ll notice that when it comes to latency, but it surely’s very smooth and offers super low-cost CPU pricing.

I should note though that when setting this up with Modal on the free tier, I’ve had just a few 500 errors, but that may be expected.

As for find out how to pick the agent framework, this is totally optional. I did a comparison piece just a few weeks ago on open-source agentic frameworks which you could find here, and the one I unnoticed was LlamaIndex.

So I made a decision to present it a try here.

The very last thing that you must pick is a vector database, or a database that supports vector search. That is where we store the embeddings and other metadata, so we will perform similarity search when a user’s query is available in.

There are a whole lot of options on the market, but I believe those with the very best potential are Weaviate, Milvus, pgvector, Redis, and Qdrant.

Vector DBs comparison, find the total table here | Image by writer

Each Qdrant and Milvus have pretty generous free tiers for his or her cloud options. Qdrant, I do know, allows us to store each dense and sparse vectors. Llamaindex, together with most agent frameworks, support many various vector databases so any can work.

I’ll try Milvus more in the long run to match performance and latency, but for now, Qdrant works well.

Redis is a solid pick too, or really any vector extension of your existing database.

Cost & time to construct

By way of time and value, you could have to account for engineering hours, cloud, embedding, and enormous language model (LLM) costs.

It doesn’t take that much time besides up a framework to run something minimal. What takes time is connecting the content properly, prompting the system, parsing the outputs, and ensuring it runs fast enough.

But when we turn to overhead costs, cloud costs to run the agent system is minimal for only one bot for one company using serverless functions as you saw within the table within the last section.

Nonetheless, for the vector databases, it would get costlier the more data you store.

Each Zilliz and Qdrant Cloud has a great amount of free tier on your first 1 to 5GBs of information, so unless you transcend just a few thousand chunks chances are you’ll not pay for anything.

Vector DBs comparison for costs, find the total table here | Image by writer

You’ll start paying though when you transcend the hundreds mark, with Weaviate being the most costly of the vendors above.

As for the embeddings, these are generally very low-cost.

You possibly can see a table below on using OpenAI’s text-embedding-3-small with chunks of various sizes when you embed 1 to 10 million texts.

Embedding costs per chunk examples — find the total table here | Image by writer

When people start optimizing for embeddings and storage, they’ve often moved beyond embedding tens of millions of texts.

The one thing that matters probably the most though is what large language model (LLM) you utilize. It’s essential to take into consideration API prices, since an agent system will typically call an LLM two to 4 times per run.

Example prices for LLMs in agent systems, full table here | Image by writer

For this technique, I’m using GPT-4o-mini or Gemini Flash 2.0, that are the most cost effective options.

So let’s say an organization is using the bot just a few hundred times per day and every run costs us 2–4 API calls, we’d find yourself at around less of a dollar per day and around $10–50 dollars monthly.

You possibly can see that switching to a costlier model would increase the monthly bill by 10x to 100x. Using ChatGPT is generally subsidized free of charge users, but if you construct your individual applications you’ll be financing it.

There will probably be smarter and cheaper models in the long run, so whatever you construct now will likely improve over time. But start small, because costs add up and for easy systems like this you don’t need them to be exceptional.

The subsequent section will get into find out how to construct this technique.

The architecture (processing documents)

The system has two parts. The primary is how we split up documents — what we call chunking — and embed them. This primary part may be very necessary, as it would dictate how the agent answers later.

Splitting up documents to different chunks attached with metadata | Image by writer

So, to make sure that you’re preparing all of the sources properly, that you must consider carefully about find out how to chunk them.

Should you have a look at the document above, you may see that we will miss context if we split the document based on headings but in addition on the variety of characters where the paragraphs attached to the primary heading is split up for being too long.

Losing context in chunks | Image by writer

It’s essential to be smart about ensuring each chunk has enough context (but not an excessive amount of). You furthermore mght have to make sure that the chunk is attached to metadata so it’s easy to trace back to where it was found.

Setting metadata to the sources to trace back to where the chunks were found | Image by writer

That is where you’ll spend probably the most time, and truthfully, I believe there needs to be higher tools on the market to do that intelligently.

I ended up using Docling for PDFs, constructing it out to connect elements based on headings and paragraph sizes. For web pages, I built a crawler that looked over page elements to make your mind up whether to chunk based on anchor tags, headings, or general content.

Remember, if the bot is presupposed to cite sources, each chunk must be attached to URLs, anchor tags, page numbers, block IDs, permalinks so the system can locate the data appropriately getting used.

Since a lot of the content you’re working with is scattered and infrequently low quality, I also decided to summarize texts using an LLM. These summaries got different labels with higher authority, which meant they were prioritized during retrieval.

Summarizing docs with higher authority | Image by writer

There’s also the choice to push within the summaries in their very own tools, and keep deep dive information separate. Letting the agent resolve which one to make use of but it would look strange to users because it’s not intuitive behavior.

Still, I even have to emphasize that if the standard of the source information is poor, it’s hard to make the system work well.

For instance, if a user asks how an API request needs to be made and there are 4 different web pages giving different answers, the bot won’t know which one is most relevant.

To demo this, I needed to do some manual review. I also had AI do deeper research around the corporate to assist fill in gaps, after which I embedded that too.

In the long run, I believe I’ll construct something higher for document ingestion — probably with the assistance of a language model.

The architecture (the agent)

For the second part, where we connect with this data, we want to construct a system where an agent can connect with different tools that contain different amounts of information from our vector database.

We keep to 1 agent only to make it easy enough to regulate. This one agent can resolve what information it needs based on the user’s query.

The agent system | Image by writer

It’s good to not complicate things and construct it out to make use of too many agents, otherwise you’ll run into issues, especially with these smaller models.

Although this may increasingly go against my very own recommendations, I did arrange a primary LLM function that decides if we want to run the agent in any respect.

First initial LLM call to make your mind up on the larger agent | Image by writer

This was primarily for the user experience, because it takes just a few extra seconds besides up the agent (even when starting it as a background task when the container starts).

As for find out how to construct the agent itself, this is straightforward, as LlamaIndex does a lot of the work for us. For this, you need to use the FunctionAgent, passing in several tools when setting it up.

# Only runs if the primary LLM thinks it's needed

access_links_tool = get_access_links_tool()
public_docs_tool = get_public_docs_tool()
onboarding_tool = get_onboarding_information_tool()
general_info_tool = get_general_info_tool()
    
formatted_system_prompt = get_system_prompt(team_name)
    
agent = FunctionAgent(
  tools=[onboarding_tool, public_docs_tool, access_links_tool, general_info_tool],
  llm=global_llm,
  system_prompt=formatted_system_prompt
)

The tools have access to different data from the vector database, they usually are wrappers across the CitationQueryEngine. This engine helps to cite the source nodes within the text. We are able to access the source nodes at the top of the agent run, which you’ll attach to the message and within the footer.

To make sure that the user experience is sweet, you may tap into the event stream to send updates back to Slack.

handler = agent.run(user_msg=full_msg, ctx=ctx, memory=memory)

async for event in handler.stream_events():
  if isinstance(event, ToolCall):
     display_tool_name = format_tool_name(event.tool_name)
     message = f"✅ Checking {display_tool_name}"
     post_thinking(message)
  if isinstance(event, ToolCallResult):
     post_thinking(f"✅ Done checking...")

final_output = await handler  
final_text = final_output
blocks = build_slack_blocks(final_text, mention)

post_to_slack(
  channel_id=channel_id, 
  blocks=blocks,
  timestamp=initial_message_ts,
  client=client 
)

Be certain that to format the messages and Slack blocks well, and refine the system prompt for the agent so it formats the messages appropriately based on the data that the tools will return.

The architecture needs to be easy enough to know, but there are still some retrieval techniques we should always dig into.

Techniques you may try

Numerous people will emphasize certain techniques when constructing RAG systems, they usually’re partially right. It’s best to use hybrid search together with some sort of re-ranking.

How the query tools work under the hood — a bit simplified | Image by writer

The primary I’ll mention is hybrid search once we perform retrieval.

I discussed that we use semantic similarity to fetch chunks of information in the varied tools, but you furthermore may have to account for cases where exact keyword search is required.

Just imagine a user asking for a particular certificate name, like CAT-00568. In that case, the system needs to seek out exact matches just as much as fuzzy ones.

With hybrid search, supported by each Qdrant and LlamaIndex, we use each dense and sparse vectors.

# when establishing the vector store (each for embedding and fetching)
vector_store = QdrantVectorStore(
   client=client,
   aclient=async_client,
   collection_name="knowledge_bases",
   enable_hybrid=True,
   fastembed_sparse_model="Qdrant/bm25"
 )

Sparse is ideal for exact keywords but blind to synonyms, whereas dense is great for “fuzzy” matches (“advantages policy” matches “worker perks”) but they’ll miss literal strings like .

Once the outcomes are fetched, it’s useful to use deduplication and re-ranking to filter out irrelevant chunks before sending them to the LLM for citation and synthesis.

reranker = LLMRerank(llm=OpenAI(model="gpt-3.5-turbo"), top_n=5)
dedup = SimilarityPostprocessor(similarity_cutoff=0.9)

engine = CitationQueryEngine(
    retriever=retriever,
    node_postprocessors=[dedup, reranker],
    metadata_mode=MetadataMode.ALL,
)

This part wouldn’t be needed in case your data were exceptionally clean, which is why it shouldn’t be your predominant focus. It adds overhead and one other API call.

It’s also not needed to make use of a big model for re-ranking, but you’ll have to perform some research on your individual to work out your options.

These techniques are easy to know and quick to establish, in order that they aren’t where you’ll spend most of your time.

What you’ll actually spend time on

Many of the belongings you’ll spend time on aren’t so sexy. It’s prompting, reducing latency, and chunking documents appropriately.

Before you begin, you must look into different prompt templates from various frameworks to see how they prompt the models. You’ll spend quite a little bit of time ensuring the system prompt is well-crafted for the LLM you select.

The second thing you’ll spend most of your time on is making it fast. I’ve looked into internal tools from tech corporations constructing AI knowledge agents and located they sometimes respond in about 8 to 13 seconds.

So, you wish something in that range.

Using a serverless provider could be a problem here due to cold starts. LLM providers also introduce their very own latency, which is difficult to regulate.

One or two lagging API calls drags down your entire system | Image by writer

That said, you may look into spinning up resources before they’re used, switching to lower-latency models, skipping frameworks to cut back overhead, and customarily decreasing the variety of API calls per run.

The very last thing, which takes an enormous amount of labor and which I’ve mentioned before, is chunking documents.

Should you had exceptionally clean data with clear headers and separations, this part could be easy. But more often, you’ll be coping with poorly structured HTML, PDFs, raw text files, Notion boards, and Confluence notes — often scattered and formatted inconsistently.

The challenge is determining find out how to programmatically ingest these documents so the system gets the total information needed to reply a matter.

Just working with PDFs, for instance, you’ll have to extract tables and pictures properly, separate sections by page numbers or layout elements, and trace each source back to the proper page.

You would like enough context, but not chunks which might be too large, or it would be harder to retrieve the suitable info later.

This sort of stuff isn’t well generalized. You possibly can’t just push it in and expect the system to know it — you could have to think it through before you construct it.

Easy methods to construct it out further

At this point, it really works well for what it’s presupposed to do, but there are just a few pieces I should cover (or people will think I’m simplifying an excessive amount of). You’ll wish to implement caching, a technique to update the info, and long-term memory.

Caching isn’t essential, but you may no less than cache the query’s embedding in larger systems to hurry up retrieval, and store recent source results for follow-up questions. I don’t think LlamaIndex helps much here, but you must have the ability to intercept the QueryTool on your individual.

You’ll also desire a technique to constantly update information within the vector databases. That is the most important headache — it’s hard to know when something has modified, so you wish some sort of change-detection method together with an ID for every chunk.

You could possibly just use periodic re-embedding strategies where you update a piece with different meta tags altogether (that is my preferred approach because I’m lazy).

The very last thing I need to say is long-term memory for the agent, so it could actually understand conversations you’ve had prior to now. For that, I’ve implemented some state by fetching history from the Slack API. This lets the agent see around 3–6 previous messages when responding.

We don’t wish to push in an excessive amount of history, because the context window grows — which not only increases cost but in addition tends to confuse the agent.

That said, there are higher ways to handle long-term memory using external tools. I’m keen to jot down more on that in the long run.

Learnings and so forth

After doing this now for a bit I even have just a few notes to share about working with frameworks and keeping it easy (that I personally don’t at all times follow).

You learn lots from using a framework, especially find out how to prompt well and find out how to structure the code. But sooner or later, working across the framework adds overhead.

As an example, in this technique, I’m bypassing the framework a bit by adding an initial API call that decides whether to maneuver on to the agent and responds to the user quickly.

If I had built this with no framework, I believe I could have handled that sort of logic higher where the primary model decides what tool to call straight away.

LLM API calls within the system | Image by writer

I haven’t tried this but I’m assuming this could be cleaner.

Also, LlamaIndex optimizes the user query, which it should, before retrieval.

But sometimes it reduces the query an excessive amount of, and I want to go in and fix it. The citation synthesizer doesn’t have access to the conversation history, so with that overly simplified query, it doesn’t at all times answer well.

The abstractions can sometimes cause the system to lose context | Image by writer

With a framework, it’s also hard to trace where latency is coming from within the workflow since you may’t at all times see all the pieces, even with remark tools.

Most developers recommend using frameworks for quick prototyping or bootstrapping, then rewriting the core logic with direct calls in production.

It’s not since the frameworks aren’t useful, but because sooner or later it’s higher to jot down something you fully understand that only does what you wish.

The overall advice is to maintain things so simple as possible and minimize LLM calls (which I’m not even fully doing myself here).

But when all you wish is RAG and never an agent, follow that.

You possibly can create an easy LLM call that sets the suitable parameters within the vector DB. From the user’s perspective, it’ll still appear to be the system is “looking into the database” and returning relevant info.

Should you’re taking place the identical path, I hope this was useful.

There’s bit more to it though. You’ll wish to implement some sort of evaluation, guardrails, and monitoring (I’ve used Phoenix here).

Once finished though, the result will appear to be this:

Example in company agent searching through PDFs, web sites docs in Slack | Image by writer

Should you to follow my writing, yow will discover me here, on my website, or on LinkedIn.

I’ll attempt to dive deeper into agentic memory, evals, and prompting over the summer.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x