Home Artificial Intelligence How I Turned My Company’s Docs right into a Searchable Database with OpenAI Converting the docs to a unified format Processing the documents Embedding text and code blocks with OpenAI Making a Qdrant vector index Querying the index Writing the search wrapper Conclusion

How I Turned My Company’s Docs right into a Searchable Database with OpenAI Converting the docs to a unified format Processing the documents Embedding text and code blocks with OpenAI Making a Qdrant vector index Querying the index Writing the search wrapper Conclusion

0
How I Turned My Company’s Docs right into a Searchable Database with OpenAI
Converting the docs to a unified format
Processing the documents
Embedding text and code blocks with OpenAI
Making a Qdrant vector index
Querying the index
Writing the search wrapper
Conclusion

Image courtesy of Unsplash.

For the past six months, I’ve been working at series A startup Voxel51, a and creator of the open source computer vision toolkit FiftyOne. As a machine learning engineer and developer evangelist, my job is to take heed to our open source community and convey them what they need — latest features, integrations, tutorials, workshops, you name it.

A couple of weeks ago, we added native support for vector serps and text similarity queries to FiftyOne, in order that users can find probably the most relevant images of their (often massive — containing thousands and thousands or tens of thousands and thousands of samples) datasets, via easy natural language queries.

This put us in a curious position: it was now possible for people using open source FiftyOne to readily search datasets with natural language queries, but using our documentation still required traditional keyword search.

We’ve a variety of documentation, which has its pros and cons. As a user myself, I sometimes find that given the sheer quantity of documentation, finding precisely what I’m in search of requires more time than I’d like.

I used to be not going to let this fly… so I built this in my spare time:

Semantically search your organization’s docs from the command line. Image courtesy of creator.

So, here’s how I turned our docs right into a semantically searchable vector database:

Yow will discover all of the code for this post within the voxel51/fiftyone-docs-search repo, and it’s easy to put in the package locally in edit mode with pip install -e ..

Higher yet, if you should implement semantic seek for your individual website using this method, you possibly can follow along! Listed below are the ingredients you’ll need:

  • Install the openai Python package and create an account: you’ll use this account to send your docs and queries to an inference endpoint, which can return an embedding vector for every bit of text.
  • Install the qdrant-client Python package and launch a Qdrant server via Docker: you’ll use Qdrant to create a locally hosted vector index for the docs, against which queries shall be run. The Qdrant service will run inside a Docker container.

My company’s docs are all hosted as HTML documents at https://docs.voxel51.com. A natural place to begin would have been to download these docs with Python’s requests library and parse the document with Beautiful Soup.

As a developer (and creator of lots of our docs), nonetheless, I assumed I could do higher. I already had a working clone of the GitHub repository on my local computer that contained all the raw files used to generate the HTML docs. A few of our docs are written in Sphinx ReStructured Text (RST), whereas others, like tutorials, are converted to HTML from Jupyter notebooks.

I figured (mistakenly) that the closer I could get to the raw text of the RST and Jupyter files, the simpler things can be.

RST

In RST documents, sections are delineated by lines consisting only of strings of =, - or _. For instance, here’s a document from the FiftyOne User Guide which incorporates all three delineators:

RST document from open source FiftyOne Docs. Image courtesy of creator.

I could then remove all the RST keywords, akin to toctree, code-block, and button_link (there have been many more), in addition to the :, ::, and .. that accompanied a keyword, the beginning of a latest block, or block descriptors.

Links were easy to maintain too:

no_links_section = re.sub(r"<[^>]+>_?","", section)

Things began to get dicey once I desired to extract the section anchors from RST files. Lots of our sections had anchors specified explicitly, whereas others were left to be inferred through the conversion to HTML.

Here is an example:

.. _brain-embeddings-visualization:

Visualizing embeddings
______________________

The FiftyOne Brain provides a robust
:meth:`compute_visualization() ` method
that you may use to generate low-dimensional representations of the samples
and/or individual objects in your datasets.

These representations could be visualized natively within the App's
:ref:`Embeddings panel `, where you possibly can interactively
select points of interest and consider the corresponding samples/labels of interest
within the :ref:`Samples panel `, and vice versa.

.. image:: /images/brain/brain-mnist.png
:alt: mnist
:align: center

There are two primary components to an embedding visualization: the tactic used
to generate the embeddings, and the dimensionality reduction method used to
compute a low-dimensional representation of the embeddings.

Embedding methods
-----------------

The `embeddings` and `model` parameters of
:meth:`compute_visualization() `
support quite a lot of ways to generate embeddings in your data:

Within the brain.rst file in our User Guide docs (a portion of which is reproduced above), the Visualizing embeddings section has an anchor #brain-embeddings-visualization specified by .. _brain-embeddings-visualization:. The Embedding methods subsection which immediately follows, nonetheless, is given an auto-generated anchor.

One other difficulty that soon reared its head was the right way to take care of tables in RST. List tables were fairly straightforward. As an example, here’s an inventory table from our View Stages cheat sheet:

.. list-table::

* - :meth:`match() `
* - :meth:`match_frames() `
* - :meth:`match_labels() `
* - :meth:`match_tags() `

Grid tables, alternatively, can get messy fast. They provide docs writers great flexibility, but this same flexibility makes parsing them a pain. Take this table from our Filtering cheat sheet:

+-----------------------------------------+-----------------------------------------------------------------------+
| Operation | Command |
+=========================================+=======================================================================+
| Filepath starts with "/Users" | .. code-block:: |
| | |
| | ds.match(F("filepath").starts_with("/Users")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath ends with "10.jpg" or "10.png" | .. code-block:: |
| | |
| | ds.match(F("filepath").ends_with(("10.jpg", "10.png")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Label incorporates string "be" | .. code-block:: |
| | |
| | ds.filter_labels( |
| | "predictions", |
| | F("label").contains_str("be"), |
| | ) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath incorporates "088" and is JPEG | .. code-block:: |
| | |
| | ds.match(F("filepath").re_match("088*.jpg")) |
+-----------------------------------------+-----------------------------------------------------------------------+

Inside a table, rows can take up arbitrary numbers of lines, and columns can vary in width. Code blocks inside grid table cells are also difficult to parse, as they occupy space on multiple lines, so their content is interspersed with content from other columns. Because of this code blocks in these tables must be effectively reconstructed through the parsing process.

Not the top of the world. But in addition not ideal.

Jupyter

Jupyter notebooks turned out to be relatively easy to parse. I used to be in a position to read the contents of a Jupyter notebook into an inventory of strings, with one string per cell:

import json
ifile = "my_notebook.ipynb"
with open(ifile, "r") as f:
contents = f.read()
contents = json.loads(contents)["cells"]
contents = [(" ".join(c["source"]), c['cell_type'] for c in contents]

Moreover, the sections were delineated by Markdown cells starting with #.

Nevertheless, given the challenges posed by RST, I made a decision to show to HTML and treat all of our docs on equal footing.

HTML

I built the HTML docs from my local install with bash generate_docs.bash, and started parsing them with Beautiful Soup. Nevertheless, I soon realized that when RST code blocks and tables with inline code were being converted to HTML, although they were rendering accurately, the HTML itself was incredibly unwieldy. Take our filtering cheat sheet for instance.

When rendered in a browser, the code block preceding the Dates and times section of our filtering cheat sheet looks like this:

Screenshot from cheat sheet in open source FiftyOne Docs. Image courtesy of creator.

The raw HTML, nonetheless, looks like this:

RST cheat sheet converted to HTML. Image courtesy of creator.

This will not be unattainable to parse, but additionally it is removed from ideal.

Markdown

Fortunately, I used to be in a position to overcome these issues by converting all the HTML files to Markdown with markdownify. Markdown had a couple of key benefits that made it the very best fit for this job.

  1. : code formatting was simplified from the spaghetti strings of span elements to inline code snippets marked with single ` before and after, and blocks of code were marked by triple quotes ```before and after. This also made it easy to separate into text and code.
  2. unlike raw RST, this Markdown included section heading anchors, because the implicit anchors had already been generated. This manner, I could link not simply to the page containing the result, but to the precise section or subsection of that page.
  3. : Markdown provided a mostly uniform formatting for the initial RST and Jupyter documents, allowing us to provide their content consistent treatment within the vector search application.

Note on LangChain

A few of you might know in regards to the open source library LangChain for constructing applications with LLMs, and should be wondering why I didn’t just use LangChain’s Document Loaders and Text Splitters. The reply: I needed more control!

Once the documents had been converted to Markdown, I proceeded to wash the contents and split them into smaller segments.

Cleansing

Cleansing most consisting in removing unnecessary elements, including:

  • Headers and footers
  • Table row and column scaffolding — e.g. the |’s in |select()| select_by()|
  • Extra newlines
  • Links
  • Images
  • Unicode characters
  • Bolding — i.e. **text**text

I also removed escape characters that were escaping from characters which have special meaning in our docs: _ and *. The previous is utilized in many method names, and the latter, as usual, is utilized in multiplication, patterns, and plenty of other places:

document = document.replace("_", "_").replace("*", "*")

Splitting documents into semantic blocks

With the contents of our docs cleaned, I proceeded to separate the docs into bite-sized blocks.

First, I split each document into sections. At first glance, it looks as if this could be done by finding any line that starts with a # character. In my application, I didn’t differentiate between h1, h2, h3, and so forth (# , ## , ###), so checking the primary character is sufficient. Nevertheless, this logic gets us in trouble after we realize that # can be employed to permit comments in Python code.

To bypass this problem, I split the document into text blocks and code blocks:

text_and_code = page_md.split('```')
text = text_and_code[::2]
code = text_and_code[1::2]

Then I identified the beginning of a latest section with a # to begin a line in a text block. I extracted the section title and anchor from this line:

def extract_title_and_anchor(header):
header = " ".join(header.split(" ")[1:])
title = header.split("[")[0]
anchor = header.split("(")[1].split(" ")[0]
return title, anchor

And assigned each block of text or code to the suitable section.

Initially, I also tried splitting the text blocks into paragraphs, hypothesizing that because a bit may contain details about many various topics, the embedding for that entire section might not be much like an embedding for a text prompt concerned with only one among those topics. This approach, nonetheless, resulted in top matches for many search queries disproportionately being single line paragraphs, which turned out to not be terribly informative as search results.

Take a look at the accompanying GitHub repo for the implementation of those methods that you may check out on your individual docs!

With documents converted, processed, and split into strings, I generated an embedding vector for every of those blocks. Because large language models are flexible and usually capable by nature, I made a decision to treat each text blocks and code blocks on the identical footing as pieces of text, and to embed them with the identical model.

I used OpenAI’s text-embedding-ada-002 model since it is simple to work with, achieves the best performance out of all of OpenAI’s embedding models (on the BEIR benchmark), and can be the most affordable. It’s so low-cost in actual fact ($0.0004/1K tokens) that generating all the embeddings for the FiftyOne docs only cost a couple of cents! As OpenAI themselves put it, “We recommend using text-embedding-ada-002 for nearly all use cases. It’s higher, cheaper, and simpler to make use of.”

With this embedding model, you possibly can generate a 1536-dimensional vector representing any input prompt, as much as 8,191 tokens (roughly 30,000 characters).

To start, it is advisable create an OpenAI account, generate an API key at https://platform.openai.com/account/api-keys, export this API key as an environment variable with:

export OPENAI_API_KEY=""

You may even must install the openai Python library:

pip install openai

I wrote a wrapper around OpenAI’s API that takes in a text prompt and returns an embedding vector:

MODEL = "text-embedding-ada-002"

def embed_text(text):
response = openai.Embedding.create(
input=text,
model=MODEL
)
embeddings = response['data'][0]['embedding']
return embeddings

To generate embeddings for all of our docs, we just apply this function to every of the subsections — text and code blocks — across all of our docs.

With embeddings in hand, I created a vector index to go looking against. I selected to make use of Qdrant for a similar reasons we selected so as to add native Qdrant support to FiftyOne: it’s open source, free, and simple to make use of.

To start with Qdrant, you possibly can pull a pre-built Docker image and run the container:

docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant

Moreover, you will have to put in the Qdrant Python client:

pip install qdrant-client

I created the Qdrant collection:

import qdrant_client as qc
import qdrant_client.http.models as qmodels

client = qc.QdrantClient(url="localhost")
METRIC = qmodels.Distance.DOT
DIMENSION = 1536
COLLECTION_NAME = "fiftyone_docs"

def create_index():
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config = qmodels.VectorParams(
size=DIMENSION,
distance=METRIC,
)
)

I then created a vector for every subsection (text or code block):

import uuid
def create_subsection_vector(
subsection_content,
section_anchor,
page_url,
doc_type
):

vector = embed_text(subsection_content)
id = str(uuid.uuid1().int)[:32]
payload = {
"text": subsection_content,
"url": page_url,
"section_anchor": section_anchor,
"doc_type": doc_type,
"block_type": block_type
}
return id, vector, payload

For every vector, you possibly can provide additional context as a part of the payload. On this case, I included the URL (and anchor) where the result could be found, the type of document, so the user can specify in the event that they want to go looking through all the docs, or simply certain varieties of docs, and the contents of the string which generated the embedding vector. I also added the block type (text or code), so if the user is in search of a code snippet, they will tailor their search to that purpose.

Then I added these vectors to the index, one page at a time:

def add_doc_to_index(subsections, page_url, doc_type, block_type):
ids = []
vectors = []
payloads = []

for section_anchor, section_content in subsections.items():
for subsection in section_content:
id, vector, payload = create_subsection_vector(
subsection,
section_anchor,
page_url,
doc_type,
block_type
)
ids.append(id)
vectors.append(vector)
payloads.append(payload)

## Add vectors to collection
client.upsert(
collection_name=COLLECTION_NAME,
points=qmodels.Batch(
ids = ids,
vectors=vectors,
payloads=payloads
),
)

Once the index has been created, running a search on the indexed documents could be achieved by embedding the query text with the identical embedding model, after which searching the index for similar embedding vectors. With a Qdrant vector index, a basic query could be performed with the Qdrant client’s search() command.

To make my company’s docs searchable, I desired to allow users to filter by section of the docs, in addition to by the kind of block that was encoded. Within the parlance of vector search, filtering results while still ensuring that a predetermined variety of results (specified by the top_k argument) shall be returned is known as pre-filtering.

To attain this, I wrote a programmatic filter:

def _generate_query_filter(query, doc_types, block_types):
"""Generates a filter for the query.
Args:
query: A string containing the query.
doc_types: A listing of document types to go looking.
block_types: A listing of block types to go looking.
Returns:
A filter for the query.
"""
doc_types = _parse_doc_types(doc_types)
block_types = _parse_block_types(block_types)

_filter = models.Filter(
must=[
models.Filter(
should= [
models.FieldCondition(
key="doc_type",
match=models.MatchValue(value=dt),
)
for dt in doc_types
],

),
models.Filter(
should= [
models.FieldCondition(
key="block_type",
match=models.MatchValue(value=bt),
)
for bt in block_types
]
)
]
)

return _filter

The interior _parse_doc_types() and _parse_block_types() functions handle cases where the argument is string or list-valued, or is None.

Then I wrote a function query_index() that takes the user’s text query, pre-filters, searches the index, and extracts relevant information from the payload. The function returns an inventory of tuples of the shape (url, contents, rating), where the rating indicates how good of a match the result’s to the query text.

def query_index(query, top_k=10, doc_types=None, block_types=None):
vector = embed_text(query)
_filter = _generate_query_filter(query, doc_types, block_types)

results = CLIENT.search(
collection_name=COLLECTION_NAME,
query_vector=vector,
query_filter=_filter,
limit=top_k,
with_payload=True,
search_params=_search_params,
)

results = [
(
f"{res.payload['url']}#{res.payload['section_anchor']}",
res.payload["text"],
res.rating,
)
for res in results
]

return results

The ultimate step was providing a clean interface for the user to semantically search against these “vectorized” docs.

I wrote a function print_results(), which takes the query, results from query_index(), and a rating argument (whether or to not print the similarity rating), and prints the ends in a simple to interpret way. I used the wealthy Python package to format hyperlinks within the terminal in order that when working in a terminal that supports hyperlinks, clicking on the hyperlink will open the page in your default browser. I also used webbrowser to routinely open the link for the highest result, if desired.

Display search results with wealthy hyperlinks. Image courtesy of creator.

For Python-based searches, I created a category FiftyOneDocsSearch to encapsulate the document search behavior, so that when a FiftyOneDocsSearch object has been instantiated (potentially with default settings for search arguments):

from fiftyone.docs_search import FiftyOneDocsSearch
fosearch = FiftyOneDocsSearch(open_url=False, top_k=3, rating=True)

You’ll be able to search inside Python by calling this object. To question the docs for “ load a dataset”, as an example, you simply must run:

fosearch(“ load a dataset”)
Semantically search your organization’s docs inside a Python process. Image courtesy of creator.

I also used argparse to make this docs search functionality available via the command line. When the package is installed, the docs are CLI searchable with:

fiftyone-docs-search query "" 

Only for fun, because fiftyone-docs-search query is a bit cumbersome, I added an alias to my .zsrch file:

alias fosearch='fiftyone-docs-search query'

With this alias, the docs are searchable from the command line with:

fosearch "" args

Coming into this, I already fashioned myself an influence user of my company’s open source Python library, FiftyOne. I had written lots of the docs, and I had used (and proceed to make use of) the library on a every day basis. However the technique of turning our docs right into a searchable database forced me to grasp our docs on an excellent deeper level. It’s at all times great while you’re constructing something for others, and it finally ends up helping you as well!

Here’s what I learned:

  • : it makes beautiful docs, nevertheless it is a little bit of a pain to parse
  • OpenAI’s text-embeddings-ada-002 model is great at understanding the meaning behind a text string, even when it has barely atypical formatting. Gone are the times of stemming and painstakingly removing stop words and miscellaneous characters.
  • : break your documents up into the smallest possible meaningful segments, and retain context. For longer pieces of text, it’s more likely that overlap between a search query and a component of the text in your index shall be obscured by less relevant text within the segment. In the event you break the document up too small, you run the danger that many entries within the index will contain little or no semantic information.
  • : with minimal lift, and with none fine-tuning, I used to be in a position to dramatically enhance the searchability of our docs. From initial estimates, it seems that this improved docs search is greater than twice as more likely to return relevant results than the old keyword search approach. Moreover, the semantic nature of this vector search approach implies that users can now search with arbitrarily phrased, arbitrarily complex queries, and are guaranteed to get the desired variety of results.

In the event you end up (or others) continually digging or sifting through treasure troves of documentation for specific kernels of data, I encourage you to adapt this process for your individual use case. You’ll be able to modify this to work in your personal documents, or your organization’s archives. And if you happen to do, I guarantee you’ll walk away from the experience seeing your documents in a latest light!

Listed below are a couple of ways you can extend this for your individual docs!

  • Hybrid search: mix vector search with traditional keyword search
  • Go global: Use Qdrant Cloud to store and query the gathering within the cloud
  • Incorporate web data: use requests to download HTML directly from the online
  • Automate updates: use Github Actions to trigger recomputation of embeddings at any time when the underlying docs change
  • Embed: wrap this in a Javascript element and drop it in as a substitute for a conventional search bar

All code used to construct the package is open source, and could be present in the voxel51/fiftyone-docs-search repo.

LEAVE A REPLY

Please enter your comment!
Please enter your name here