Leveraging Enterprise Specific Data With LLMs: How Unstructured Unlocked 100k+ Pages of IRS Manuals Getting All the way down to Work Scraping Preprocessing Once Your Data has Been Preprocessed Working with Structured Data Chat Your IRS Data: We’re able to field some questions! Next steps

Artificial Intelligence

Leveraging Enterprise Specific Data With LLMs: How Unstructured Unlocked 100k+ Pages of IRS Manuals Getting All the way down to Work Scraping Preprocessing Once Your Data has Been Preprocessed Working with Structured Data Chat Your IRS Data: We’re able to field some questions! Next steps

admin

April 16, 2023

Leveraging Enterprise Specific Data With LLMs: How Unstructured Unlocked 100k+ Pages of IRS Manuals
Getting All the way down to Work
Scraping
Preprocessing
Once Your Data has Been Preprocessed Working with Structured Data
Chat Your IRS Data: We’re able to field some questions!
Next steps

Unstructured makes it fast and straightforward to preprocess organizations’ internal data and render it right into a format that will be utilized at the side of LLMs. Fairly than hacking together custom python scripts, regular expressions, and open source OCR packages, you’ll be able to send almost any raw file containing natural language to Unstructured’s API and receive back nice, clean JSON. On this blog, we show how the IRS, for instance, could rapidly deploy an LLM solution with their data for his or her employees. This architecture is extensible to any organization wanting to deliver a ChatGPT-style experience with their data.

We began with grabbing greater than 100k pages of IRS manuals — largely in PDF format — from the IRS government website. Note that you may use Unstructured’s API not only to preprocess PDFs, but additionally HTML, MSFT Office file types, emails, and more.

Once we’d gathered our data, step one on this project is utilizing Unstructured’s API (or you’ll be able to deploy our image in your hardware) to preprocess the raw PDFs and transform them into clean JSON. See the readme here to follow the install instructions and clone the demo repo. When you’re able to use Unstructured, here’s the one and only command you’ll need for turning your data into easily digestible content:

PYTHONPATH=. ./unstructured/ingest/principal.py 
--local-input-path  
--structured-output-dir  
# optional parameter -> this can hit the *NEW* API vs. processing locally
--partition-by-api

In the event you’re extracting data locally without using the API, you’ll be able to increase throughput with the — num-processes parameter. E.g., 8 processes if running on hardware with 64gb available. Below is only one example of how Unstructured will transform the raw data right into a structured JSON format.

Once Unstructured has done the heavy lifting of converting the raw files to usable JSON, we are able to nest the preprocessed data inside an architecture that enables an LLM to profit from this organization-specific data.

(For more detail on each of those ensure to checkout our article here.)

For this particular project we tried out Pinecone for storage (we’ve also had great luck with Chroma, Weaviate, Qdrant, and others), OpenAI for embeddings and LLM (since it was easy…but we could have easily gone to Hugging Face to snag open source alternatives), and LangChain as a programming framework (Llama Index works great too!). Again it’s vital to notice that when we’re working with the preprocessed data it’s easy to experiment with different downstream libraries. For instance, if hybrid search looks as if a compelling approach to go, it’d be easy to judge Llama Index and/or LangChain + Supabase.

Once the vector DB is populated with all 100k preprocessed documents and their corresponding embeddings, all we’ve got to do is query. Come one, come all to one in every of the 2 options below and convey all of the musings you’ve ever had on IRS policy, procedure, and process.

Listed below are the 2 ways you’ll be able to check this out for yourself.

Our Hosted Instance

2. Running the CLI app yourself

Data is powerful, but provided that we are able to make use of it. With Unstructured, we’re excited to assist enterprises exploit their internal data with LLMs. We’re continually adding to our natural language preprocessing capabilities and expanding the number of knowledge connectors we support. Irrespective of where your natural language data resides or what file type its contained in, Unstructured has got you covered.

Github repo → Clone the demo repo and hook up with your individual data source

Community Slack → Join our growing community

Hosted Instance → Chat with the IRS Manuals yourself

Are there regulations around email communication?
What’s the difference between federal and state tax?
Who’s the top of the IRS?
How are penalties determined for late filings?
Tell me concerning the Whistleblower Office
tell me about Tax and Fingerprint Checks needed for experts
What’s the means of making an appeal?
Tell me about Each day Delinquency Penalty
When are taxes owed?
Who has to pay taxes?
How do I process amended tax returns?
How do I investigate charitable contribution deductions?
What sorts of tax credits are there?
Do churches pay taxes?
Tell me about form 709

Leveraging Enterprise Specific Data With LLMs: How Unstructured Unlocked 100k+ Pages of IRS Manuals Getting All the way down to Work Scraping Preprocessing Once Your Data has Been Preprocessed Working with Structured Data Chat Your IRS Data: We’re able to field some questions! Next steps

1 COMMENT

LEAVE A REPLY Cancel reply