Testing Anthropic Claude’s 100k-token window on SEC 10-K Filings High-Level Findings Overview of Methodology Tutorial Setup Analyzing a Single Document Analyzing Multiple Documents Conclusion

Artificial Intelligence

Testing Anthropic Claude’s 100k-token window on SEC 10-K Filings High-Level Findings Overview of Methodology Tutorial Setup Analyzing a Single Document Analyzing Multiple Documents Conclusion

admin

May 15, 2023

Testing Anthropic Claude’s 100k-token window on SEC 10-K Filings
High-Level Findings
Overview of Methodology
Tutorial Setup
Analyzing a Single Document
Analyzing Multiple Documents
Conclusion

Anthropic’s 100K Context Window expansion, just released yesterday, has taken the AI community by storm. A 100k token limit is roughly 75k words (~3x GPT4–32k’s context window, ~25x that of GPT-3.5/ChatGPT); this implies that you could now fit 300 pages of text in a single inference call.

Certainly one of the core use cases highlighted within the Anthropic blog is analyzing an SEC 10-K filing; the model is able to ingesting the whole report, and producing answers to different questions.

Coincidentally, we published a tutorial a number of months ago showing how LlamaIndex + Unstructured + GPT3 could provide help to perform different queries over UBER SEC 10-k filings. LlamaIndex provides a comprehensive toolkit to assist manage external data on top of any LLM with limited context windows, and we show that we will execute a various range of queries, from questions over a single document to comparing sections across documents.

How well does Anthropic’s 100k model do over UBER SEC 10-k filings? Furthermore, how well does it do without the assistance of any of LlamaIndex’s more advanced data structures? On this blog we show the performance of Anthropic’s model on different queries, using the best data structure available: the list index.

Where Anthropic’s 100k model does well:

Anthropic’s model does exhibit a powerful capability to synthesize insights across the whole context window to reply the query at hand (assuming we set response_mode="tree_summarize" , see below). It could possibly miss details though; see below!
This one was surprising to us. Anthropic’s model is in a position to crunch a whole UBER 10-k filing in ~60–90 seconds, which seems long but is way faster than repeated API calls to GPT-3 (which when added up can take minutes).

Where Anthropic’s 100k model doesn’t do well:

We would like to check the capabilities of Anthropic’s 100K model on top of UBER 10-k filings from 2019–2022. We also need to do that while using as little retrieval/synthesis constructs as possible. This implies no embeddings, and no fancy retrieval mechanisms.

Ideally, we will directly insert a whole 10-k filing (and even all 4 10-k filings) into the prompt. Nevertheless, we found that a single UBER 10-k filing actually consists of ~Which means that we still should chunk up each filing!

We find yourself using our list index data structure — we split each text up into massive ~100k token chunks, and use our to synthesize a solution across multiple chunks.

We run some queries over each filing in addition to over multiple filings, just like our original blog post. We report the outcomes below.

Our data ingestion is similar because the LlamaIndex + Unstructured blog post. We use Unstructured’s HTML parser to parse the HTML DOM into nicely formatted text. We then create a Document object for every SEC filing.

You’ll be able to access Unstructured data loaders on LlamaHub.

from llama_index import download_loader
from pathlib import PathUnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)
loader = UnstructuredReader()
doc_set = {}
all_docs = []
years = [2022, 2021, 2020, 2019]
for 12 months in years:
year_doc = loader.load_data(file=Path(f'./data/UBER/UBER_{12 months}.html'), split_documents=False)[0]
# insert 12 months metadata into every year
year_doc.extra_info = {"12 months": 12 months}
doc_set[year] = year_doc
all_docs.append(year_doc)

Next, we wish to setup the Anthropic LLM. We’re using claude-v1 by default. We also need to manually define the brand new 100k-token input size inside our PromptHelper object; this may help us work out the way to “compact” context into the input prompt space during response synthesis.

We set the max_input_size to 100k and the output length to 2048. We also set the context chunk size to a high value (95k, leaving some buffer room for remainder of the prompt). Context will only be chunked if the variety of tokens exceeds this limit.

from llama_index import PromptHelper, LLMPredictor, ServiceContext
from langchain.llms import Anthropic# define prompt helper
# set maximum input size
max_input_size = 100000
# set variety of output tokens
num_output = 2048
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
llm_predictor = LLMPredictor(llm=Anthropic(model="claude-v1.3-100k", temperature=0, max_tokens_to_sample=num_output))
service_context = ServiceContext.from_defaults(
llm_predictor=llm_predictor, prompt_helper=prompt_helper,
chunk_size_limit=95000
)

Let’s first analyze queries over a single document. We construct an inventory index over the 2019 UBER 10-K:

list_index = GPTListIndex.from_documents([doc_set[2019]], service_context=service_context)
print(len(list_index.index_struct.nodes))

As mentioned, the 10-K exceeds the 100k token limit, and so there are two nodes inside the list index.

We then ask a question: “What were a few of the biggest risk aspects in 2019?”

Recall that there are two approaches inside LlamaIndex for response synthesis across multiple nodes (where the entire context exceeds the context window): a “create-and-refine” strategy, and a “tree summarize” strategy.

sequentially undergo each retrieved Node. Use a Query-Answer Prompt for the primary Node, and use a Refine Prompt for subsequent Nodes. Make a separate LLM call per Node.
Given a set of Node objects and the query, recursively construct a tree using the Query Answer Prompt and return the basis node because the response. Good for summarization purposes.

Claude-v1 100K doesn’t do well with the “create-and-refine” strategy (which is the default).

# NOTE: the default create/refine approach doesn't give good answers
query = "What were a few of the biggest risk aspects in 2019?"
query_engine = list_index.as_query_engine(service_context=service_context)
response = query_engine.query(query)

Here is the retrieved answer:

Original answer:A number of the biggest risk aspects for Uber in 2019 included:
• Regulatory challenges and uncertainty. Uber faced regulatory challenges and uncertainty in lots of markets, including restrictions on its services, caps on pricing, and licensing requirements. For instance, California's AB5 law and other similar laws increased the danger of Drivers being classified as employees. Uber also faced regulatory scrutiny and bans in London, Barcelona, and other markets.
• Competition. The markets wherein Uber operates are highly competitive, and Uber faced significant competition from well-established and low-cost alternatives in 2019. Competitors also aggressively competed for Drivers and consumers by offering significant incentives and discounts. 
• Safety and security. There have been risks related to the security and security of Uber's platform, including risks from vehicle or scooter accidents, assaults, and other incidents. Uber released a security report in 2019 detailing reports of sexual assaults on its platform which brought additional scrutiny.
• Financial performance and profitability. Uber continued to experience significant losses in 2019 because it invested in growth. There was no guarantee that Uber would achieve or sustain profitability in the long run.  
• Reliance on third parties. Uber relied on third parties for key elements of its business like cloud computing services, payment processing, and background checks. If these third parties failed to offer services or increased costs, it could adversely impact Uber's business.
• Macroeconomic conditions. Uber's business was sensitive to economic conditions and changes that impacted discretionary consumer spending. A decline within the economy could reduce demand for Uber's services.  
• Reliance on Drivers and Restaurants. Uber's success relied on attracting and retaining enough Drivers and Restaurants to its platform. If Uber was unable to draw or retain enough Drivers and Restaurants, it could negatively impact its operations.
• Mental property. If Uber was unable to guard its mental property, or if third parties claimed Uber was infringing on their mental property, it could harm Uber's business. Uber relied on a mix of copyright, trademark, patent, and trade secret laws to ascertain its mental property rights.
Refined answer:
A number of the biggest risk aspects for Uber in 2019 included:
• Regulatory challenges and uncertainty. Uber faced significant regulatory challenges in 2019, including AB5 in California which increased the danger of Drivers being classified as employees and regulatory scrutiny and temporary bans in London and other markets. These regulatory issues created uncertainty and posed risks to Uber's business model and financial results.
• Competition. The ridesharing and meal delivery markets are highly competitive, and competitors offered significant incentives and discounts to take market share from Uber in 2019. This competition could negatively impact Uber's growth and profitability.
• Safety and security. There have been risks related to the security and security of Uber's platform, including risks from vehicle or scooter accidents, assaults, and other incidents. Uber's safety report detailing sexual assault reports brought additional scrutiny and risks. Any failure to make sure safety could significantly damage Uber's repute and business.
• Financial performance and profitability. Uber continued to experience substantial losses in 2019 on account of investments in growth, and there was no guarantee of future profitability. Uber's path to profitability was uncertain, and failure to turn out to be profitable could threaten its business model and access to capital.  
• Reliance on third parties. Uber relied on third parties for services like cloud computing, payment processing, and background checks. If these third parties failed to offer services or increased costs, it could negatively impact Uber's offerings, growth, and profitability.  
• Macroeconomic conditions. Uber's business was sensitive to the economy and consumer discretionary spending. An economic downturn could reduce demand for Uber's offerings and threaten its growth and financial performance.
• Reliance on Drivers and Restaurants. Uber's success relied on attracting and retaining enough Drivers and Restaurants to support growth and meet consumer demand. Failure to draw or retain Drivers and Restaurants could significantly impact Uber's business.  
• Mental property. Failure to guard mental property or infringement of others' mental property could damage Uber's competitive position and business. Uber relied on laws and policies to guard mental property, but monitoring infringement and unauthorized use was difficult.

As shown, the reply accommodates the prompt inside it; the purpose of the refine prompt was to supply just the refined answer, not produce each the unique and refined answers.

For some reason, using the Tree Summarize strategy works a lot better.

# NOTE: tree_summarize gives higher answers
query = "What were a few of the biggest risk aspects in 2019?"
query_engine = list_index.as_query_engine(service_context=service_context, response_mode="tree_summarize")
response = query_engine.query(query)

Returned response:

• Regulatory challenges and uncertainty: Uber faced significant regulatory challenges and uncertainty in 2019, including AB5 in California which codified a recent test for determining whether staff ought to be classified as employees or independent contractors. Uber also faced regulatory scrutiny and bans in other markets like London, UK. These regulatory issues created uncertainty and risk around Uber's business model and operations.• Safety and security: Uber received negative publicity around safety incidents on its platform which could damage its brand and repute. Uber released a security report in 2019 on sexual assaults and other incidents which led to additional scrutiny. Safety and security risks remain an ongoing issue for Uber's business.
• Competition: The markets wherein Uber competes are intensely competitive, and Uber faces competition from recent and existing firms in the assorted segments it operates in like ridesharing, food delivery, and logistics. Increased competition could make it difficult for Uber to draw and retain platform users.
• Financial losses: Uber continued to experience significant losses from operations in 2019 because it invested in growth. Uber may not achieve or sustain profitability in the long run which could negatively impact its business and stock price.
• Reliance on third-parties: Uber relies on third-party data centers, cloud computing services, and software which exposes it to risks if these third-parties experience any disruptions or interference. Any disruption of third-party services could negatively impact Uber's business.
• Macroeconomic conditions: Uber's business is sensitive to macroeconomic conditions, and any downturn could reduce discretionary consumer spending which might harm Uber's business. Issues like a recession, increased unemployment, or reduced consumer confidence could all potentially impact Uber's business.
• Reliance on Drivers and Restaurants: Uber's business will depend on attracting and retaining Drivers and Restaurants to its platform. If Uber is unable to draw or maintain a critical mass of Drivers and Restaurants, its platform will turn out to be less appealing to platform users and its financial results could be adversely impacted.

The reply is a lot better. It goes into details concerning the US-China trade war, slowing economic growth, Brexit, and more (take note 2019 is pre-COVID).

Token Usage and Latency

The document accommodates around ~170K tokens. For some reason, this number is just not reflected on the Anthropic usage logs (the “Prompt Tokens” section seems capped at 10240). However the Prompt Length (in characters) is logged, in addition to the model latency.

Given the pricing, ~170K tokens could be comparable to $1.5–2 USD.

A question through one Uber SEC-10K takes around , including all LLM calls. This is definitely a bit faster than repeated calls to ChatGPT/davinci. Each ChatGPT/davinci call (with the 4K token window maximized), empirically can take 6–10 seconds to finish → or more).

A well-liked example in our previous blog post was showcasing that you might compare/contrast different documents with LlamaIndex graph structures.

We test whether we will do this here as well, by feeding in multiple SEC reports into Claude-v1 100k.

Considering that one UBER SEC-10K filing doesn’t even fit within the context window, we’ll after all also must implement response synthesis strategies in an effort to handle ingesting multiple 10K filings.

We construct an inventory index over all 4 10K filings: 2019, 2020, 2021, and 2022.

list_index = GPTListIndex.from_documents(all_docs, service_context=service_context)
print(len(list_index.index_struct.nodes))

We then ask our query using our Tree Summarize response mode.

query = "How are the danger aspects changing across years? Compare/contrast the danger aspects across the SEC filings."
query_engine = list_index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query(query)

The complete answer is given below:

The danger aspects disclosed in Uber's SEC filings have evolved over time based on Uber's business and industry changes. A number of the key differences in risk aspects across the filings are:2017 10-K:
- Focused heavily on risks related to negative publicity, competition, dependence on independent contractors, and regulatory challenges as Uber was still facing backlash from various PR crises and regulatory pushback. 
- Also highlighted risks from mental property litigation given various IP disputes on the time.
2018 10-K:
- Added more risks related to autonomous vehicles as Uber ramped up its self-driving automobile efforts. Specifically called out risks from accidents, technical challenges, and competition within the AV space.
- Removed some risks related to negative publicity and PR crises as those issues had subsided. But added risks related to corporate culture and workplace environment given the Fowler scandal.
2019 10-K: 
- Further expanded AV risks to incorporate risks from partnerships and third party relationships as Uber relied more on partnerships with other firms like Toyota and Volvo. 
- Added risks related to environmental regulations and climate change as those issues received more mainstream attention. 
- Removed risks related to initial scaling challenges as Uber had matured as an organization. But added risks related to growth into recent offerings and markets as Uber expanded into freight shipping, scooters, etc.
In summary, Uber's risk aspects have evolved to reflect the important thing issues and challenges facing the corporate at any time limit. The risks highlight each external aspects like regulations in addition to internal aspects related to Uber's business, technology, and operations. The changes over time show how an revolutionary company's risks can shift quite quickly.
A number of the key trends in how the danger aspects are changing include:
1. Risks related to negative publicity and PR crises decreased over time as those issues subsided, while risks related to other issues like AV technology, workplace culture, and recent initiatives increased. This shows how Uber's risks evolved as the corporate matured.
2. Risks are likely to increase in areas where Uber is actively investing or expanding into. For instance, risks related to AVs, partnerships, and recent mobility offerings increased as Uber ramped up efforts in those areas. This highlights how risk profiles change with an organization's strategic priorities.
3. External risks like regulations and climate change were added as those issues gained more mainstream attention and importance. This shows how firms should adapt their risk aspects to account for changes in the general business environment. 
4. Certain foundational risks around competition, growth, and reliance on independent contractors continued over time. However the specifics and details provided for those risks evolved based on Uber's current challenges and priorities. So while the themes remained, the risks were actively updated.
In summary, Uber's risk aspects modified over time to offer more details on the corporate's priorities, challenges, partnerships, and external environment at any given point. But certain core risks inherent to Uber's business model have also remained consistently highlighted, demonstrating how those foundational risks are long-term in nature. The changes in risks over time provide a glimpse into how an revolutionary company's risk profile is continually evolving.

This response only accommodates risk refactor evaluation over the 2019 10-K (which in turn accommodates risk refactors for 2017 and 2018). It doesn’t contain the years from 2020 onwards. A part of that is potentially on account of our tree summarize response synthesis strategy. Nevertheless, it shows that attempting to naively “stuff” documents into big 100K token chunks with easy response synthesis strategies still doesn’t produce the optimal answers.

As expected, feeding all 4 documents into Anthropic necessitates many more chained LLM calls, which consumes far more tokens and takes quite a bit longer (on the order of 9–10 minutes).

On the whole, the brand new 100K context window is incredibly exciting and offers developers a recent mode of feeding in data into the LLM for various tasks/queries. It offers coherent evaluation with a marginal token cost that’s less expensive than that of GPT-4.

That said, trying to maximise this context window with each inference call does include tradeoffs when it comes to latency and value.

We stay up for doing more experiments/comparisons/thought pieces on top of Claude! Tell us your feedback.

Resources

You’ll be able to try our full Colab notebook here.