The Essential Guide to Effectively Summarizing Massive Documents, Part 1

RAG is a well-discussed and widely implemented solution for addressing document summarizing optimization using GenAI technologies. Nevertheless, like all latest technology or solution, it’s vulnerable to edge-case challenges, especially in today’s enterprise environment. Two major concerns are contextual length coupled with per-prompt cost and the previously mentioned ‘Lost within the Middle’ context problem. Let’s dive a bit deeper to grasp these challenges.

Note: I might be performing the exercises in Python using the LangChain, Scikit-Learn, Numpy and Matplotlib libraries for quick iterations.

Today with automated workflows enabled by GenAI, analyzing big documents has change into an industry expectation/requirement. People wish to quickly find relevant information from medical reports or financial audits by just prompting the LLM. But there may be a caveat, enterprise documents usually are not like documents or datasets we take care of in academics, the sizes are considerably larger and the pertinent information will be present just about anywhere within the documents. Hence, methods like data cleansing/filtering are sometimes not a viable option since domain knowledge regarding these documents is just not at all times given.

Along with this, even the most recent Large Language Models (LLMs) like GPT-4o by OpenAI with context windows of 128K tokens cannot just eat these documents in a single shot and even in the event that they did, the standard of response won’t meet standards, especially for the price it should incur. To showcase this, let’s take a real-world example of attempting to summarize the Worker Handbook of GitLab which might downloaded here. This document is accessible freed from charge under the MIT license available on their GitHub repository.

1 We start by loading the document and in addition initialize our LLM, to maintain this exercise relevant I’ll make use of GPT-4o.

from langchain_community.document_loaders import PyPDFLoader# Load PDFs
pdf_paths = ["/content/gitlab_handbook.pdf"]
documents = []
for path in pdf_paths:
loader = PyPDFLoader(path)
documents.extend(loader.load())
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")

2 Then we will divide the document into smaller chunks (that is for embedding, I’ll explain why within the later steps).

from langchain.text_splitter import RecursiveCharacterTextSplitter# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# Split documents into chunks
splits = text_splitter.split_documents(documents)

3 Now, let’s calculate what number of tokens make up this document, for this we’ll iterate through each document chunk and calculate the whole tokens that make up the document.

total_tokens = 0for chunk in splits:
text = chunk.page_content  # Assuming `page_content` is where the text is stored
num_tokens = llm.get_num_tokens(text)  # Get the token count for every chunk
total_tokens += num_tokens
print(f"Total variety of tokens within the book: {total_tokens}")
# Total variety of tokens within the book: 254006

As we will see the variety of tokens is 254,006, while the context window limit for GPT-4o is 128,000. This document can’t be sent in a single undergo the LLM’s API. Along with this, considering this model’s pricing is $0.00500 / 1K input tokens, a single request sent to OpenAI for this document would cost $1.27! This doesn’t sound horrible until you present this in an enterprise paradigm with multiple users and day by day interactions across many such large documents, especially in a startup scenario where many GenAI solutions are being born.

One other challenge faced by LLMs is the Lost within the Middle, context problem as discussed intimately on this paper. Research and my experiences with RAG systems handling multiple documents describe that LLMs usually are not very robust in terms of extrapolating information from long context inputs. Model performance degrades considerably when relevant information is somewhere in the midst of the context. Nevertheless, the performance improves when the required information is either originally or the top of the provided context. Document Re-ranking is an answer that has change into a subject of progressively heavy discussion and research to tackle this specific issue. I might be exploring a couple of of those methods in one other post. For now, allow us to get back to the answer we’re exploring which utilizes K-Means Clustering.

Okay, I admit I sneaked in a technical concept within the last section, allow me to elucidate it (for many who will not be aware of the tactic, I got you).

First the fundamentals

To know K-means clustering, we must always first know what clustering is. Consider this: we now have a messy desk with pens, pencils, and notes all scattered together. To wash up, one would group like items together like all pens in a single group, pencils in one other, and notes in one other creating essentially 3 separate groups (not promoting segregation). Clustering is identical process where amongst a set of information (in our case the several chunks of document text), similar data or information are grouped creating a transparent separation of concerns for the model, making it easier for our RAG system to choose and select information effectively and efficiently as a substitute of getting to undergo all of it like a greedy method.

K, Means?

K-means is a selected method to perform clustering (there are other methods but let’s not information dump). Let me explain how it really works in 5 easy steps:

Picking the variety of groups (K): What number of groups we would like the information to be divided into
Choosing group centers: Initially, a middle value for every of the K-groups is randomly chosen
Group task: Each data point is then assigned to every group based on how close it’s to the previously chosen centers. Example: items closest to center 1 are assigned to group 1, items closest to center 2 might be assigned to group 2…and so forth till Kth group.
Adjusting the centers: In spite of everything the information points have been pigeonholed, we calculate the common of the positions of the items in each group and these averages change into the brand new centers to enhance accuracy (because we had initially chosen them at random).
Rinse and repeat: With the brand new centers, the information point assignments are again updated for the K-groups. This is completed till the difference (mathematically the Euclidean distance) is minimal for items inside a gaggle and the maximal from other data points of other groups, ergo optimal segregation.

While this will likely be quite a simplified explanation, a more detailed and technical explanation (for my fellow nerds) of this algorithm will be found here.

The Essential Guide to Effectively Summarizing Massive Documents, Part 1

First the fundamentals

K, Means?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

The First Multilingual LLM Debate Competition

MIT within the media: 2025 in review

Introducing the Open Leaderboard for Japanese LLMs!

ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI

The Essential Guide to Effectively Summarizing Massive Documents, Part 1

First the fundamentals

K, Means?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.