structured data right into a RAG system, engineers often default to embedding raw JSON right into a vector database. The fact, nonetheless, is that this intuitive approach results in dramatically poor performance. Modern embeddings are based on the BERT architecture, which is basically the encoder a part of a Transformer, and are trained on an enormous text dataset with the primary goal of capturing semantic meaning. Modern embedding models can provide incredible retrieval performance, but they’re trained on a big set of unstructured text with a concentrate on semantic meaning. In consequence, although embedding JSON may appear to be an intuitively easy and chic solution, using a generic embedding model for JSON objects would show results removed from peak performance.
Deep dive
Tokenization
Step one is tokenization, which takes the text and splits it into tokens, that are generally a generic a part of the word. The fashionable embedding models utilize Byte-Pair Encoding (BPE) or WordPiece tokenization algorithms. These algorithms are optimized for natural language, breaking words into common sub-components. When a tokenizer encounters raw JSON, it struggles with the high frequency of non-alphanumeric characters. For instance, "usd": 10, will not be viewed as a key-value pair; as an alternative, it’s fragmented:
- The quotes (
"), colon (:), and comma (,) - Tokens
usdand10
This creates a low . In natural language, just about all words contribute to the semantic “signal”. While in JSON (and other structured formats), a major percentage of tokens are “wasted” on structural syntax that comprises zero semantic value.
Attention calculation
The core power of Transformers lies in the eye mechanism. This enables the model to weight the importance of tokens relative to one another.
Within the sentence The worth is 10 US dollars or 9 euros, attention can easily link the worth 10 to the concept price because these relationships are well-represented within the model’s pre-training data and the model has seen this linguistic pattern hundreds of thousands of times. However, within the raw JSON:
"price": {
"usd": 10,
"eur": 9,
}
the model encounters structural syntax it was not primarily optimized to “read”. Without the linguistic connector, the resulting vector will fail to capture the true intent of the info, because the relationships between the important thing and the worth are obscured by the format itself.
Mean Pooling
The ultimate step in generating a single embedding representation of the document is Mean Pooling. Mathematically, the ultimate embedding (E) is the centroid of all token vectors (e1, e2, e3) within the document:
That is where the JSON tokens change into a mathematical liability. If 25% of the tokens within the document are structural markers (braces, quotes, colons), the ultimate vector is heavily influenced by the “meaning” of punctuation. In consequence, the vector is effectively “pulled” away from its true semantic center within the vector space by these noise tokens. When a user submits a natural language query, the space between the “clean” query vector and “noisy” JSON vector increases, directly hurting the retrieval metrics.
Flatten it
So now that we all know in regards to the JSON limitations, we want to determine tips on how to resolve them. The final and most straightforward approach is to flatten the JSON and convert it into natural language.
Let’s consider the standard product object:
{
"skuId": "123",
"description": "This can be a test product used for demonstration purposes",
"quantity": 5,
"price": {
"usd": 10,
"eur": 9,
},
"availableDiscounts": ["1", "2", "3"],
"giftCardAvailable": "true",
"category": "demo product"
...
}
This is a straightforward object with some attributes like description, etc. Let’s apply the tokenization to it and see the way it looks:

Now, let’s convert it into text to make the embeddings’ work easier. With a view to try this, we will define a template and substitute the JSON values into it. For instance, this template may very well be used to explain the product:
Product with SKU {skuId} belongs to the category "{category}"
Description: {description}
It has a quantity of {quantity} available
The worth is {price.usd} US dollars or {price.eur} euros
Available discount ids include {availableDiscounts as comma-separated list}
Gift cards are {giftCardAvailable ? "available" : "not available"} for this product
So the end result will appear to be:
Product with SKU 123 belongs to the category "demo product"
Description: This can be a test product used for demonstration purposes
It has a quantity of 5 available
The worth is 10 US dollars or 9 euros
Available discount ids include 1, 2, and three
Gift cards can be found for this product
And apply tokenizer to it:

Not only does it have 14% fewer tokens now, however it is also a much clearer form with the semantic meaning and required context.
Let’s measure the results
Note: Complete, reproducible code for this experiment is accessible within the Google Colab notebook
Now let’s attempt to measure retrieval performance for each options. We’re going to concentrate on the usual retrieval metrics like Recall@k, Precision@k, and MRR to maintain it easy, and can utilize a generic embedding model (all-MiniLM-L6-v2) and the Amazon ESCI dataset with random 5,000 queries and three,809 associated products.
The all-MiniLM-L6-v2 is a preferred alternative, which is small (22.7m params) but provides fast and accurate results, making it alternative for this experiment.
For the dataset, the version of Amazon ESCI is used, specifically milistu/amazon-esci-data (), which is accessible on Hugging Face and comprises a group of Amazon products and search queries data.
The flattening function used for text conversion is:
def flatten_product(product):
return (
f"Product {product['product_title']} from brand {product['product_brand']}"
f" and product id {product['product_id']}"
f" and outline {product['product_description']}"
)
A sample of the raw JSON data is:
{
"product_id": "B07NKPWJMG",
"title": "RoWood 3D Puzzles for Adults, Picket Mechanical Gear Kits for Teens Kids Age 14+",
"description": " Specifications
Model Number: Rowood Treasure box LK502
Average construct time: 5 hours
Total Pieces: 123
Model weight: 0.69 kg
Box weight: 0.74 KG
Assembled size: 100*124*85 mm
Box size: 320*235*39 mm
Certificates: EN71,-1,-2,-3,ASTMF963
Advisable Age Range: 14+
Contents
Plywood sheets
Metal Spring
Illustrated instructions
Accessories
MADE FOR ASSEMBLY
-Follow the instructions provided within the booklet and assembly 3d puzzle with some exciting and fascinating fun. Fell the pride of self creation getting this exquisite picket work like a professional.
GLORIFY YOUR LIVING SPACE
-Revive the enigmatic charm and cheer your parties and get-togethers with an experience that is exclusive and interesting .
",
"brand": "RoWood",
"color": "Treasure Box"
}
For the vector search, two FAISS indexes are created: one for the flattened text and one for the JSON-formatted text. Each indexes are flat, which suggests that they may compare distances for every of the present entries as an alternative of utilizing an Approximate Nearest Neighbour (ANN) index. This is vital to be sure that retrieval metrics will not be affected by the ANN.
D = 384
index_json = faiss.IndexFlatIP(D)
index_flatten = faiss.IndexFlatIP(D)
To cut back the dataset a random variety of 5,000 queries has been chosen and all corresponding products have been embedded and added to the indexes. In consequence, the collected metrics are as follows:

all-MiniLM-L6-v2 embedding model on the Amazon ESCI dataset. The flattened approach consistently yields higher scores across all key retrieval metrics (Precision@10, Recall@10, and MRR). And the performance change of the flattened version is:

The evaluation confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a straightforward preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The primary takeaway for engineers constructing RAG systems is that effective data preparation is incredibly necessary for achieving peak performance of the semantic retrieval/RAG system.
References
[1] Full experiment code https://colab.research.google.com/drive/1dTgt6xwmA6CeIKE38lf2cZVahaJNbQB1?usp=sharing
[2] Model https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
[3] Amazon ESCI dataset. Specific version used: https://huggingface.co/datasets/milistu/amazon-esci-data
The unique dataset available at https://www.amazon.science/code-and-datasets/shopping-queries-dataset-a-large-scale-esci-benchmark-for-improving-product-search
[4] FAISS https://ai.meta.com/tools/faiss/
