Home Artificial Intelligence How I Turned My Company’s Docs right into a Searchable Database with OpenAI Converting the docs to a unified format Processing the documents Embedding text and code blocks with OpenAI Making a Qdrant vector index Querying the index Writing the search wrapper Conclusion

How I Turned My Company’s Docs right into a Searchable Database with OpenAI Converting the docs to a unified format Processing the documents Embedding text and code blocks with OpenAI Making a Qdrant vector index Querying the index Writing the search wrapper Conclusion

4
How I Turned My Company’s Docs right into a Searchable Database with OpenAI
Converting the docs to a unified format
Processing the documents
Embedding text and code blocks with OpenAI
Making a Qdrant vector index
Querying the index
Writing the search wrapper
Conclusion

Image courtesy of Unsplash.

For the past six months, I’ve been working at series A startup Voxel51, a and creator of the open source computer vision toolkit FiftyOne. As a machine learning engineer and developer evangelist, my job is to hearken to our open source community and produce them what they need — latest features, integrations, tutorials, workshops, you name it.

A couple of weeks ago, we added native support for vector search engines like google and yahoo and text similarity queries to FiftyOne, in order that users can find essentially the most relevant images of their (often massive — containing tens of millions or tens of tens of millions of samples) datasets, via easy natural language queries.

This put us in a curious position: it was now possible for people using open source FiftyOne to readily search datasets with natural language queries, but using our documentation still required traditional keyword search.

We’ve quite a lot of documentation, which has its pros and cons. As a user myself, I sometimes find that given the sheer quantity of documentation, finding precisely what I’m on the lookout for requires more time than I’d like.

I used to be not going to let this fly… so I built this in my spare time:

Semantically search your organization’s docs from the command line. Image courtesy of writer.

So, here’s how I turned our docs right into a semantically searchable vector database:

You will discover all of the code for this post within the voxel51/fiftyone-docs-search repo, and it’s easy to put in the package locally in edit mode with pip install -e ..

Higher yet, if you desire to implement semantic seek for your individual website using this method, you’ll be able to follow along! Listed below are the ingredients you’ll need:

  • Install the openai Python package and create an account: you’ll use this account to send your docs and queries to an inference endpoint, which is able to return an embedding vector for every bit of text.
  • Install the qdrant-client Python package and launch a Qdrant server via Docker: you’ll use Qdrant to create a locally hosted vector index for the docs, against which queries can be run. The Qdrant service will run inside a Docker container.

My company’s docs are all hosted as HTML documents at https://docs.voxel51.com. A natural start line would have been to download these docs with Python’s requests library and parse the document with Beautiful Soup.

As a developer (and writer of a lot of our docs), nevertheless, I believed I could do higher. I already had a working clone of the GitHub repository on my local computer that contained the entire raw files used to generate the HTML docs. A few of our docs are written in Sphinx ReStructured Text (RST), whereas others, like tutorials, are converted to HTML from Jupyter notebooks.

I figured (mistakenly) that the closer I could get to the raw text of the RST and Jupyter files, the simpler things can be.

RST

In RST documents, sections are delineated by lines consisting only of strings of =, - or _. For instance, here’s a document from the FiftyOne User Guide which comprises all three delineators:

RST document from open source FiftyOne Docs. Image courtesy of writer.

I could then remove the entire RST keywords, similar to toctree, code-block, and button_link (there have been many more), in addition to the :, ::, and .. that accompanied a keyword, the beginning of a latest block, or block descriptors.

Links were easy to care for too:

no_links_section = re.sub(r"<[^>]+>_?","", section)

Things began to get dicey once I desired to extract the section anchors from RST files. A lot of our sections had anchors specified explicitly, whereas others were left to be inferred through the conversion to HTML.

Here is an example:

.. _brain-embeddings-visualization:

Visualizing embeddings
______________________

The FiftyOne Brain provides a robust
:meth:`compute_visualization() ` method
that you could use to generate low-dimensional representations of the samples
and/or individual objects in your datasets.

These representations might be visualized natively within the App's
:ref:`Embeddings panel `, where you'll be able to interactively
select points of interest and examine the corresponding samples/labels of interest
within the :ref:`Samples panel `, and vice versa.

.. image:: /images/brain/brain-mnist.png
:alt: mnist
:align: center

There are two primary components to an embedding visualization: the strategy used
to generate the embeddings, and the dimensionality reduction method used to
compute a low-dimensional representation of the embeddings.

Embedding methods
-----------------

The `embeddings` and `model` parameters of
:meth:`compute_visualization() `
support a wide range of ways to generate embeddings on your data:

Within the brain.rst file in our User Guide docs (a portion of which is reproduced above), the Visualizing embeddings section has an anchor #brain-embeddings-visualization specified by .. _brain-embeddings-visualization:. The Embedding methods subsection which immediately follows, nevertheless, is given an auto-generated anchor.

One other difficulty that soon reared its head was the best way to cope with tables in RST. List tables were fairly straightforward. For example, here’s a listing table from our View Stages cheat sheet:

.. list-table::

* - :meth:`match() `
* - :meth:`match_frames() `
* - :meth:`match_labels() `
* - :meth:`match_tags() `

Grid tables, alternatively, can get messy fast. They offer docs writers great flexibility, but this same flexibility makes parsing them a pain. Take this table from our Filtering cheat sheet:

+-----------------------------------------+-----------------------------------------------------------------------+
| Operation | Command |
+=========================================+=======================================================================+
| Filepath starts with "/Users" | .. code-block:: |
| | |
| | ds.match(F("filepath").starts_with("/Users")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath ends with "10.jpg" or "10.png" | .. code-block:: |
| | |
| | ds.match(F("filepath").ends_with(("10.jpg", "10.png")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Label comprises string "be" | .. code-block:: |
| | |
| | ds.filter_labels( |
| | "predictions", |
| | F("label").contains_str("be"), |
| | ) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath comprises "088" and is JPEG | .. code-block:: |
| | |
| | ds.match(F("filepath").re_match("088*.jpg")) |
+-----------------------------------------+-----------------------------------------------------------------------+

Inside a table, rows can take up arbitrary numbers of lines, and columns can vary in width. Code blocks inside grid table cells are also difficult to parse, as they occupy space on multiple lines, so their content is interspersed with content from other columns. Which means that code blocks in these tables have to be effectively reconstructed through the parsing process.

Not the tip of the world. But in addition not ideal.

Jupyter

Jupyter notebooks turned out to be relatively easy to parse. I used to be in a position to read the contents of a Jupyter notebook into a listing of strings, with one string per cell:

import json
ifile = "my_notebook.ipynb"
with open(ifile, "r") as f:
contents = f.read()
contents = json.loads(contents)["cells"]
contents = [(" ".join(c["source"]), c['cell_type'] for c in contents]

Moreover, the sections were delineated by Markdown cells starting with #.

Nevertheless, given the challenges posed by RST, I made a decision to show to HTML and treat all of our docs on equal footing.

HTML

I built the HTML docs from my local install with bash generate_docs.bash, and started parsing them with Beautiful Soup. Nevertheless, I soon realized that when RST code blocks and tables with inline code were being converted to HTML, although they were rendering appropriately, the HTML itself was incredibly unwieldy. Take our filtering cheat sheet for instance.

When rendered in a browser, the code block preceding the Dates and times section of our filtering cheat sheet looks like this:

Screenshot from cheat sheet in open source FiftyOne Docs. Image courtest of writer.

The raw HTML, nevertheless, looks like this:

RST cheat sheet converted to HTML. Image courtest of writer.

This will not be unimaginable to parse, but it is usually removed from ideal.

Markdown

Fortunately, I used to be in a position to overcome these issues by converting the entire HTML files to Markdown with markdownify. Markdown had a couple of key benefits that made it the perfect fit for this job.

  1. : code formatting was simplified from the spaghetti strings of span elements to inline code snippets marked with single ` before and after, and blocks of code were marked by triple quotes ```before and after. This also made it easy to separate into text and code.
  2. unlike raw RST, this Markdown included section heading anchors, because the implicit anchors had already been generated. This manner, I could link not simply to the page containing the result, but to the precise section or subsection of that page.
  3. : Markdown provided a mostly uniform formatting for the initial RST and Jupyter documents, allowing us to offer their content consistent treatment within the vector search application.

Note on LangChain

A few of you might know in regards to the open source library LangChain for constructing applications with LLMs, and will be wondering why I didn’t just use LangChain’s Document Loaders and Text Splitters. The reply: I needed more control!

Once the documents had been converted to Markdown, I proceeded to scrub the contents and split them into smaller segments.

Cleansing

Cleansing most consisting in removing unnecessary elements, including:

  • Headers and footers
  • Table row and column scaffolding — e.g. the |’s in |select()| select_by()|
  • Extra newlines
  • Links
  • Images
  • Unicode characters
  • Bolding — i.e. **text**text

I also removed escape characters that were escaping from characters which have special meaning in our docs: _ and *. The previous is utilized in many method names, and the latter, as usual, is utilized in multiplication, patterns, and lots of other places:

document = document.replace("_", "_").replace("*", "*")

Splitting documents into semantic blocks

With the contents of our docs cleaned, I proceeded to separate the docs into bite-sized blocks.

First, I split each document into sections. At first glance, it looks like this might be done by finding any line that starts with a # character. In my application, I didn’t differentiate between h1, h2, h3, and so forth (# , ## , ###), so checking the primary character is sufficient. Nevertheless, this logic gets us in trouble after we realize that # can also be employed to permit comments in Python code.

To bypass this problem, I split the document into text blocks and code blocks:

text_and_code = page_md.split('```')
text = text_and_code[::2]
code = text_and_code[1::2]

Then I identified the beginning of a latest section with a # to start out a line in a text block. I extracted the section title and anchor from this line:

def extract_title_and_anchor(header):
header = " ".join(header.split(" ")[1:])
title = header.split("[")[0]
anchor = header.split("(")[1].split(" ")[0]
return title, anchor

And assigned each block of text or code to the suitable section.

Initially, I also tried splitting the text blocks into paragraphs, hypothesizing that because a piece may contain details about many various topics, the embedding for that entire section is probably not just like an embedding for a text prompt concerned with only certainly one of those topics. This approach, nevertheless, resulted in top matches for many search queries disproportionately being single line paragraphs, which turned out to not be terribly informative as search results.

Try the accompanying GitHub repo for the implementation of those methods that you could check out on your individual docs!

With documents converted, processed, and split into strings, I generated an embedding vector for every of those blocks. Because large language models are flexible and customarily capable by nature, I made a decision to treat each text blocks and code blocks on the identical footing as pieces of text, and to embed them with the identical model.

I used OpenAI’s text-embedding-ada-002 model since it is simple to work with, achieves the very best performance out of all of OpenAI’s embedding models (on the BEIR benchmark), and can also be the most cost effective. It’s so low-cost actually ($0.0004/1K tokens) that generating the entire embeddings for the FiftyOne docs only cost a couple of cents! As OpenAI themselves put it, “We recommend using text-embedding-ada-002 for nearly all use cases. It’s higher, cheaper, and simpler to make use of.”

With this embedding model, you’ll be able to generate a 1536-dimensional vector representing any input prompt, as much as 8,191 tokens (roughly 30,000 characters).

To start, that you must create an OpenAI account, generate an API key at https://platform.openai.com/account/api-keys, export this API key as an environment variable with:

export OPENAI_API_KEY=""

You can even must install the openai Python library:

pip install openai

I wrote a wrapper around OpenAI’s API that takes in a text prompt and returns an embedding vector:

MODEL = "text-embedding-ada-002"

def embed_text(text):
response = openai.Embedding.create(
input=text,
model=MODEL
)
embeddings = response['data'][0]['embedding']
return embeddings

To generate embeddings for all of our docs, we just apply this function to every of the subsections — text and code blocks — across all of our docs.

With embeddings in hand, I created a vector index to go looking against. I selected to make use of Qdrant for a similar reasons we selected so as to add native Qdrant support to FiftyOne: it’s open source, free, and straightforward to make use of.

To start with Qdrant, you’ll be able to pull a pre-built Docker image and run the container:

docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant

Moreover, you will have to put in the Qdrant Python client:

pip install qdrant-client

I created the Qdrant collection:

import qdrant_client as qc
import qdrant_client.http.models as qmodels

client = qc.QdrantClient(url="localhost")
METRIC = qmodels.Distance.DOT
DIMENSION = 1536
COLLECTION_NAME = "fiftyone_docs"

def create_index():
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config = qmodels.VectorParams(
size=DIMENSION,
distance=METRIC,
)
)

I then created a vector for every subsection (text or code block):

import uuid
def create_subsection_vector(
subsection_content,
section_anchor,
page_url,
doc_type
):

vector = embed_text(subsection_content)
id = str(uuid.uuid1().int)[:32]
payload = {
"text": subsection_content,
"url": page_url,
"section_anchor": section_anchor,
"doc_type": doc_type,
"block_type": block_type
}
return id, vector, payload

For every vector, you’ll be able to provide additional context as a part of the payload. On this case, I included the URL (and anchor) where the result might be found, the type of document, so the user can specify in the event that they want to go looking through the entire docs, or simply certain sorts of docs, and the contents of the string which generated the embedding vector. I also added the block type (text or code), so if the user is on the lookout for a code snippet, they’ll tailor their search to that purpose.

Then I added these vectors to the index, one page at a time:

def add_doc_to_index(subsections, page_url, doc_type, block_type):
ids = []
vectors = []
payloads = []

for section_anchor, section_content in subsections.items():
for subsection in section_content:
id, vector, payload = create_subsection_vector(
subsection,
section_anchor,
page_url,
doc_type,
block_type
)
ids.append(id)
vectors.append(vector)
payloads.append(payload)

## Add vectors to collection
client.upsert(
collection_name=COLLECTION_NAME,
points=qmodels.Batch(
ids = ids,
vectors=vectors,
payloads=payloads
),
)

Once the index has been created, running a search on the indexed documents might be achieved by embedding the query text with the identical embedding model, after which searching the index for similar embedding vectors. With a Qdrant vector index, a basic query might be performed with the Qdrant client’s search() command.

To make my company’s docs searchable, I desired to allow users to filter by section of the docs, in addition to by the form of block that was encoded. Within the parlance of vector search, filtering results while still ensuring that a predetermined variety of results (specified by the top_k argument) can be returned is known as pre-filtering.

To attain this, I wrote a programmatic filter:

def _generate_query_filter(query, doc_types, block_types):
"""Generates a filter for the query.
Args:
query: A string containing the query.
doc_types: An inventory of document types to go looking.
block_types: An inventory of block types to go looking.
Returns:
A filter for the query.
"""
doc_types = _parse_doc_types(doc_types)
block_types = _parse_block_types(block_types)

_filter = models.Filter(
must=[
models.Filter(
should= [
models.FieldCondition(
key="doc_type",
match=models.MatchValue(value=dt),
)
for dt in doc_types
],

),
models.Filter(
should= [
models.FieldCondition(
key="block_type",
match=models.MatchValue(value=bt),
)
for bt in block_types
]
)
]
)

return _filter

The inner _parse_doc_types() and _parse_block_types() functions handle cases where the argument is string or list-valued, or is None.

Then I wrote a function query_index() that takes the user’s text query, pre-filters, searches the index, and extracts relevant information from the payload. The function returns a listing of tuples of the shape (url, contents, rating), where the rating indicates how good of a match the result’s to the query text.

def query_index(query, top_k=10, doc_types=None, block_types=None):
vector = embed_text(query)
_filter = _generate_query_filter(query, doc_types, block_types)

results = CLIENT.search(
collection_name=COLLECTION_NAME,
query_vector=vector,
query_filter=_filter,
limit=top_k,
with_payload=True,
search_params=_search_params,
)

results = [
(
f"{res.payload['url']}#{res.payload['section_anchor']}",
res.payload["text"],
res.rating,
)
for res in results
]

return results

The ultimate step was providing a clean interface for the user to semantically search against these “vectorized” docs.

I wrote a function print_results(), which takes the query, results from query_index(), and a rating argument (whether or to not print the similarity rating), and prints the leads to a simple to interpret way. I used the wealthy Python package to format hyperlinks within the terminal in order that when working in a terminal that supports hyperlinks, clicking on the hyperlink will open the page in your default browser. I also used webbrowser to routinely open the link for the highest result, if desired.

Display search results with wealthy hyperlinks. Image courtesy of writer.

For Python-based searches, I created a category FiftyOneDocsSearch to encapsulate the document search behavior, so that when a FiftyOneDocsSearch object has been instantiated (potentially with default settings for search arguments):

from fiftyone.docs_search import FiftyOneDocsSearch
fosearch = FiftyOneDocsSearch(open_url=False, top_k=3, rating=True)

You’ll be able to search inside Python by calling this object. To question the docs for “How you can load a dataset”, as an example, you simply must run:

fosearch(“How you can load a dataset”)
Semantically search your organization’s docs inside a Python process. Image courtesy of writer.

I also used argparse to make this docs search functionality available via the command line. When the package is installed, the docs are CLI searchable with:

fiftyone-docs-search query "" 

Only for fun, because fiftyone-docs-search query is a bit cumbersome, I added an alias to my .zsrch file:

alias fosearch='fiftyone-docs-search query'

With this alias, the docs are searchable from the command line with:

fosearch "" args

Coming into this, I already fashioned myself an influence user of my company’s open source Python library, FiftyOne. I had written lots of the docs, and I had used (and proceed to make use of) the library on a day by day basis. However the means of turning our docs right into a searchable database forced me to know our docs on a good deeper level. It’s at all times great once you’re constructing something for others, and it finally ends up helping you as well!

Here’s what I learned:

  • : it makes beautiful docs, however it is a little bit of a pain to parse
  • OpenAI’s text-embeddings-ada-002 model is great at understanding the meaning behind a text string, even when it has barely atypical formatting. Gone are the times of stemming and painstakingly removing stop words and miscellaneous characters.
  • : break your documents up into the smallest possible meaningful segments, and retain context. For longer pieces of text, it’s more likely that overlap between a search query and an element of the text in your index can be obscured by less relevant text within the segment. If you happen to break the document up too small, you run the danger that many entries within the index will contain little or no semantic information.
  • : with minimal lift, and with none fine-tuning, I used to be in a position to dramatically enhance the searchability of our docs. From initial estimates, it seems that this improved docs search is greater than twice as more likely to return relevant results than the old keyword search approach. Moreover, the semantic nature of this vector search approach implies that users can now search with arbitrarily phrased, arbitrarily complex queries, and are guaranteed to get the required variety of results.

If you happen to end up (or others) always digging or sifting through treasure troves of documentation for specific kernels of data, I encourage you to adapt this process for your individual use case. You’ll be able to modify this to work on your personal documents, or your organization’s archives. And should you do, I guarantee you’ll walk away from the experience seeing your documents in a latest light!

Listed below are a couple of ways you could possibly extend this for your individual docs!

  • Hybrid search: mix vector search with traditional keyword search
  • Go global: Use Qdrant Cloud to store and query the gathering within the cloud
  • Incorporate web data: use requests to download HTML directly from the net
  • Automate updates: use Github Actions to trigger recomputation of embeddings at any time when the underlying docs change
  • Embed: wrap this in a Javascript element and drop it in as a alternative for a standard search bar

All code used to construct the package is open source, and might be present in the voxel51/fiftyone-docs-search repo.

4 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here