Optimizing Vector Search: Why You Should Flatten Structured Data 

-

structured data right into a RAG system, engineers often default to embedding raw JSON right into a vector database. The fact, nonetheless, is that this intuitive approach results in dramatically poor performance. Modern embeddings are based on the BERT architecture, which is basically the encoder a part of a Transformer, and are trained on an enormous text dataset with the primary goal of capturing semantic meaning. Modern embedding models can provide incredible retrieval performance, but they’re trained on a big set of unstructured text with a concentrate on semantic meaning. In consequence, although embedding JSON may appear to be an intuitively easy and chic solution, using a generic embedding model for JSON objects would show results removed from peak performance.

Deep dive

Tokenization

Step one is tokenization, which takes the text and splits it into tokens, that are generally a generic a part of the word. The fashionable embedding models utilize Byte-Pair Encoding (BPE) or WordPiece tokenization algorithms. These algorithms are optimized for natural language, breaking words into common sub-components. When a tokenizer encounters raw JSON, it struggles with the high frequency of non-alphanumeric characters. For instance, "usd": 10, will not be viewed as a key-value pair; as an alternative, it’s fragmented:

  • The quotes ("), colon (:), and comma (,)
  • Tokens usd and 10 

This creates a low . In natural language, just about all words contribute to the semantic “signal”. While in JSON (and other structured formats), a major percentage of tokens are “wasted” on structural syntax that comprises zero semantic value.

Attention calculation

The core power of Transformers lies in the eye mechanism. This enables the model to weight the importance of tokens relative to one another.

Within the sentence The worth is 10 US dollars or 9 euros, attention can easily link the worth 10 to the concept price because these relationships are well-represented within the model’s pre-training data and the model has seen this linguistic pattern hundreds of thousands of times. However, within the raw JSON:

"price": {
  "usd": 10,
  "eur": 9,
 }

the model encounters structural syntax it was not primarily optimized to “read”. Without the linguistic connector, the resulting vector will fail to capture the true intent of the info, because the relationships between the important thing and the worth are obscured by the format itself. 

Mean Pooling

The ultimate step in generating a single embedding representation of the document is Mean Pooling. Mathematically, the ultimate embedding (E) is the centroid of all token vectors (e1, e2, e3) within the document:

Mean Pooling calculation: Converting a sequence of n token embeddings right into a single vector representation by averaging their values. 

That is where the JSON tokens change into a mathematical liability. If 25% of the tokens within the document are structural markers (braces, quotes, colons), the ultimate vector is heavily influenced by the “meaning” of punctuation. In consequence, the vector is effectively “pulled” away from its true semantic center within the vector space by these noise tokens. When a user submits a natural language query, the space between the “clean” query vector and “noisy” JSON vector increases, directly hurting the retrieval metrics.

Flatten it

So now that we all know in regards to the JSON limitations, we want to determine tips on how to resolve them. The final and most straightforward approach is to flatten the JSON and convert it into natural language.

Let’s consider the standard product object:

{
 "skuId": "123",
 "description": "This can be a test product used for demonstration purposes",
 "quantity": 5,
 "price": {
  "usd": 10,
  "eur": 9,
 },
 "availableDiscounts": ["1", "2", "3"],
 "giftCardAvailable": "true", 
 "category": "demo product"
 ...
}

This is a straightforward object with some attributes like description, etc. Let’s apply the tokenization to it and see the way it looks:

Tokenization of raw JSON. Notice the high density of distinct tokens for syntax (braces, quotes, colons) that contribute to noise relatively than meaning. 

Now, let’s convert it into text to make the embeddings’ work easier. With a view to try this, we will define a template and substitute the JSON values into it. For instance, this template may very well be used to explain the product:

Product with SKU {skuId} belongs to the category "{category}"
Description: {description}
It has a quantity of {quantity} available 
The worth is {price.usd} US dollars or {price.eur} euros  
Available discount ids include {availableDiscounts as comma-separated list}  
Gift cards are {giftCardAvailable ? "available" : "not available"} for this product

So the end result will appear to be:

Product with SKU 123 belongs to the category "demo product"
Description: This can be a test product used for demonstration purposes
It has a quantity of 5 available
The worth is 10 US dollars or 9 euros
Available discount ids include 1, 2, and three
Gift cards can be found for this product

And apply tokenizer to it:

Tokenization of the flattened text. The resulting sequence is shorter (14% fewer tokens) and composed primarily of semantically meaningful words. 

Not only does it have 14% fewer tokens now, however it is also a much clearer form with the semantic meaning and required context.

Let’s measure the results

Note: Complete, reproducible code for this experiment is accessible within the Google Colab notebook

Now let’s attempt to measure retrieval performance for each options. We’re going to concentrate on the usual retrieval metrics like Recall@k, Precision@k, and MRR to maintain it easy, and can utilize a generic embedding model (all-MiniLM-L6-v2) and the Amazon ESCI dataset with random 5,000 queries and three,809 associated products.

The all-MiniLM-L6-v2 is a preferred alternative, which is small (22.7m params) but provides fast and accurate results, making it alternative for this experiment.

For the dataset, the version of Amazon ESCI is used, specifically milistu/amazon-esci-data (), which is accessible on Hugging Face and comprises a group of Amazon products and search queries data.

The flattening function used for text conversion is:

def flatten_product(product):
  return (
    f"Product {product['product_title']} from brand {product['product_brand']}" 
    f" and product id {product['product_id']}" 
    f" and outline {product['product_description']}"
)

A sample of the raw JSON data is:

{
  "product_id": "B07NKPWJMG",
  "title": "RoWood 3D Puzzles for Adults, Picket Mechanical Gear Kits for Teens Kids Age 14+",
  "description": "

Specifications
Model Number: Rowood Treasure box LK502
Average construct time: 5 hours
Total Pieces: 123
Model weight: 0.69 kg
Box weight: 0.74 KG
Assembled size: 100*124*85 mm
Box size: 320*235*39 mm
Certificates: EN71,-1,-2,-3,ASTMF963
Advisable Age Range: 14+
Contents
Plywood sheets
Metal Spring
Illustrated instructions
Accessories
MADE FOR ASSEMBLY
-Follow the instructions provided within the booklet and assembly 3d puzzle with some exciting and fascinating fun. Fell the pride of self creation getting this exquisite picket work like a professional.
GLORIFY YOUR LIVING SPACE
-Revive the enigmatic charm and cheer your parties and get-togethers with an experience that is exclusive and interesting .
", "brand": "RoWood", "color": "Treasure Box" }

For the vector search, two FAISS indexes are created: one for the flattened text and one for the JSON-formatted text. Each indexes are flat, which suggests that they may compare distances for every of the present entries as an alternative of utilizing an Approximate Nearest Neighbour (ANN) index. This is vital to be sure that retrieval metrics will not be affected by the ANN.

D = 384
index_json = faiss.IndexFlatIP(D)
index_flatten = faiss.IndexFlatIP(D)

To cut back the dataset a random variety of 5,000 queries has been chosen and all corresponding products have been embedded and added to the indexes. In consequence, the collected metrics are as follows:

Comparing the 2 indexing methods using the all-MiniLM-L6-v2 embedding model on the Amazon ESCI dataset. The flattened approach consistently yields higher scores across all key retrieval metrics (Precision@10, Recall@10, and MRR). 

And the performance change of the flattened version is:

Converting the structured JSON to natural language text resulted in significant gains, including a 19.1% boost in Recall@10 and a 27.2% boost in MRR (Mean Reciprocal Rank), confirming the superior semantic representation of the flattened data. 

The evaluation confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a straightforward preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The primary takeaway for engineers constructing RAG systems is that effective data preparation is incredibly necessary for achieving peak performance of the semantic retrieval/RAG system.

References

[1] Full experiment code https://colab.research.google.com/drive/1dTgt6xwmA6CeIKE38lf2cZVahaJNbQB1?usp=sharing
[2] Model 
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
[3] Amazon ESCI dataset. Specific version used: https://huggingface.co/datasets/milistu/amazon-esci-data
The unique dataset available at https://www.amazon.science/code-and-datasets/shopping-queries-dataset-a-large-scale-esci-benchmark-for-improving-product-search
[4] FAISS https://ai.meta.com/tools/faiss/

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x