E-commerce catalogs often contain sparse product data, generic images, a basic title, and short description. This limits discoverability, engagement, and conversion. Manual enrichment doesn’t scale since it relies on catalog managers to manually write descriptions, apply tags, and categorize. The method is slow, inconsistent, and error-prone.
This tutorial shows developers, product managers, and catalog teams deploy an AI-powered enrichment blueprint that transforms a single product image into wealthy, localized catalog entries.
Using NVIDIA Nemotron large language models (LLMs) and vision-language models (VLMs)—including Nemotron-Nano-12B-V2-VL, Llama-3.3-Nemotron-Super-49B-V1, FLUX.1-Kontext-Dev for image generation, and TRELLIS Image-to-3D models—the system routinely generates detailed titles and descriptions, accurate categories, comprehensive tags, localized cultural variations, and interactive 3D assets tailored to regional markets.
The tutorial covers the entire architecture, API usage for VLM evaluation and asset generation, deployment strategies with Docker containers, and real-world integration patterns. By the top, this tutorial demonstrates automate catalog enrichment at scale, turning sparse product data like “Black Purse” into wealthy listings like “Glamorous Black Evening Handbag with Gold Accents” complete with detailed descriptions, validated categories, tags, and multiple asset types.
Prerequisites
This tutorial assumes intermediate to advanced technical knowledge. It involves working with AI APIs, constructing REST services, and deploying containerized applications. Basic familiarity with the listed technologies will assist in following along and implementing the system:
- Python 3.11+
- The uv package manager (or pip)
- An NVIDIA API key
- A HuggingFace token for FLUX model access
- Docker and Docker Compose
Creating an AI-powered catalog enrichment blueprint
To handle the scalability and consistency gap of manual catalog enrichment, and discoverability and conversion issues, the blueprint is designed as an end-to-end catalog transformation pipeline. The modular system of specialised models works together containerized with Docker and served through NVIDIA NIM for enterprise-grade performance.


Here’s the core technology stack:
- NVIDIA Nemotron VLM (nemotron-nano-12b-v2-vl): Analyzes product images to extract features, categories, and context.
- NVIDIA Nemotron LLM (llama-3_3-nemotron-super-49b-v1_5): Acts because the “brain,” generating wealthy, localized text (titles, descriptions) and planning culturally-aware prompts for image generation.
- Black Forest Labs FLUX.1-Kontext-dev: Generate recent, high-quality 2D image variations.
- Microsoft TRELLIS Image-to-3D: Transforms 2D product images into interactive 3D models.
A very powerful a part of this solution is its modular, three-stage API. A typical mistake is constructing one slow, monolithic API call that does the whole lot.
- Stage 1: Fast VLM evaluation (POST /vlm/analyze)
- Job: Takes a picture, locale, existing product data, and brand instructions as optional.
- Output: Wealthy, structured JSON. It returns improved titles, descriptions, validated categories, comprehensive tags, and attributes localized to the goal region.
- Stage 2: Image generation (POST /generate/variation)
- Job: Takes the output from Stage 1, the title, description, tags, and original image.
- Output: A brand new, culturally-appropriate 2D image variation.
- Stage 3: 3D asset generation (POST /generate/3d)
- Job: Takes the unique 2D image.
- Output: An interactive 3D .glb model.
The frontend can call /vlm/analyze, get quick results to indicate the user, after which offer buttons to “generate 3D model” or “create marketing assets,” which trigger asynchronous backend jobs.
Constructing the enrichment pipeline
On this section, the backend is run locally to call the enrichment APIs end-to-end. A product image is uploaded to generate enriched, localized metadata, create a picture variation with quality scoring, and produce a 3D asset. The three-stage API approach is described next.
Step 1: Arrange the local backend
First, get the FastAPI backend server running on a neighborhood machine to check the API endpoints.
Clone the repository:
git clone https://github.com/NVIDIA-AI-Blueprints/Retail-Catalog-Enrichment.git
cd Retail-Catalog-Enrichment
Create an .env file in the foundation directory with the API keys:
NGC_API_KEY=your_nvidia_api_key_here
HF_TOKEN=your_huggingface_token_here
Arrange the Python environment using uv (or pip):
# Create and activate a virtual environment
# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate
# Install dependencies
uv pip install -e .
Run the FastAPI server with Uvicorn:
uvicorn --app-dir src backend.important:app --host 0.0.0.0 --port 8000 --reload
The API is now live at http://localhost:8000. Its health might be checked at http://localhost:8000/health.
Step 2: Visual evaluation
With the server running, the core /vlm/analyze endpoint might be used. That is the workhorse of the system, designed for fast, synchronous feedback.
Execute a basic evaluation of a product image. This command sends a product image (bag.jpg) and specifies the en-US locale.
curl -X POST
-F "image=@bag.jpg;type=image/jpeg"
-F "locale=en-US"
http://localhost:8000/vlm/analyze
Review the JSON response. In only a couple of seconds, a wealthy JSON object is returned. That is the “before-and-after” transformation:
{
"title": "Glamorous Black Evening Handbag with Gold Accents",
"description": "This exquisite handbag exudes sophistication and elegance. Crafted from high-quality, glossy leather...",
"categories": ["accessories"],
"tags": ["black leather", "gold accents", "evening bag", "rectangular shape"],
"colours": ["black", "gold"],
"locale": "en-US"
}
Step 3: Augment data with localization and brand voice
The true power of the API comes from its augmentation capabilities. Localize content for a brand new region by providing existing product data and a brand new locale. This instance targets the Spanish market (es-ES). The system is sensible enough to reinforce the sparse data using regional terminology.
curl -X POST
-F "image=@bag.jpg;type=image/jpeg"
-F 'product_data={"title":"Black Purse","description":"Elegant bag"}'
-F "locale=es-ES"
http://localhost:8000/vlm/analyze
Apply a custom brand voice using the brand_instructions parameter. A brand isn’t generic, so the content shouldn’t be either. This guides the AI’s tone, voice, and taxonomy.
curl -X POST
-F "image=@product.jpg;type=image/jpeg"
-F 'product_data={"title":"Beauty Product","description":"Nice cream"}'
-F "locale=en-US"
-F 'brand_instructions=You're employed at a premium beauty retailer. Use a playful, empowering, and inclusive brand voice. Concentrate on self-expression and sweetness discovery. Use terms like "beauty lovers", "glow", "radiant", and "treat yourself".'
http://localhost:8000/vlm/analyze
The AI will generate an outline that’s accurate and on-brand.
Step 4: Generate cultural image variations
Now that wealthy, localized text has been generated, the /generate/variation endpoint might be used to create matching 2D marketing assets.
Generate a brand new image by passing in the outcomes from Step 2. This endpoint uses the localized text as a plan to generate a brand new image with the FLUX model.
curl -X POST
-F "image=@bag.jpg;type=image/jpeg"
-F "locale=en-US"
-F "title=Glamorous Black Evening Handbag with Gold Accents"
-F "description=This exquisite handbag exudes sophistication..."
-F 'categories=["accessories"]'
-F 'tags=["black leather","gold accents","evening bag"]'
-F 'colours=["black","gold"]'
http://localhost:8000/generate/variation
This call returns JSON with a generated_image_b64 string. If using the es-ES locale, the model generates a background more fitting for that market, like a Mediterranean courtyard as a substitute of a contemporary studio.
Review the JSON response:
{
"generated_image_b64": "iVBORw0KGgoAAAANS...",
"artifact_id": "a4511bbed05242078f9e3f7ead3b2247",
"image_path": "data/outputs/a4511bbed05242078f9e3f7ead3b2247.png",
"metadata_path": "data/outputs/a4511bbed05242078f9e3f7ead3b2247.json",
"locale": "en-US"
}
Step 5: Automated quality control with NVIDIA Nemotron VLM
Generative AI is powerful, but it may hallucinate. In an enterprise catalog, a “Black Handbag” can’t suddenly have a blue strap or a missing handle. To resolve this, an agentic reflection loop has been implemented.
As a substitute of counting on human reviewers, a Quality Assurance Agent powered by NVIDIA Nemotron VLM might be deployed. This module acts as a strict critic, performing a “reflection” step that compares the generated variation against the unique product image to make sure fidelity.
Before the API responds, this agent analyzes the generated image against the unique product photo across five strict dimensions:
- Product consistency: Do colours, materials, and textures match the unique?
- Structural fidelity: Are key elements like handles, zippers, and pockets preserved?
- Size and scale: Does the product look realistically sized in its recent context?
- Anatomical accuracy: If a human model is present, are the hands and fingers rendered accurately?
- Background quality: Is the lighting and context photorealistic?
The “VLM Judge” output: The API returns the generated asset alongside an in depth quality report, including a high quality rating and a listing of specific issues.
{
"generated_image_b64": "iVBORw0KGgoAAAANSUhEUgA...",
"artifact_id": "027c08866d90450399f6bf9980ab7...",
"image_path": "/path/to/outputs/027c08866d90450399f6bf9980ab73...png",
"metadata_path": "/path/to/outputs/027c08866d90450399f6bf9980ab73...json",
"quality_score": 72.5,
"quality_issues": [
"Product appears slightly oversized relative to background context",
"Minor texture inconsistency on handle hardware"
],
"locale": "en-US"
}
This feature provides the critical metadata needed for automation. Software integrators can expand this functionality to construct self-correcting pipelines where the system autonomously retries generation with adjusted prompts until the VLM Judge awards a passing rating (e.g., >85).
Step 6: Create interactive 3D assets
Finally, bring the product to life with a 3D model using the /generate/3d endpoint.
Request a 3D model from the unique 2D image. This is an easy call that only needs the image.
curl -X POST
-F "image=@bag.jpg;type=image/jpeg"
http://localhost:8000/generate/3d
--output product.glb
In a couple of seconds, a product.glb file is generated. This file might be dropped directly into any web-based 3D viewer, allowing customers to examine the product from every angle.
Request a JSON response (optional). For web clients, it’s often easier to handle a JSON response. To do that, set return_json=true.
curl -X POST
-F "image=@bag.jpg;type=image/jpeg"
-F "return_json=true"
http://localhost:8000/generate/3d
Review the JSON response: It will return the 3D model as a base64 string, together with metadata.
{
"glb_base64": "Z2xURgIAAA...A=",
"artifact_id": "c724a1b8e1f54a6b8d2c9a7e6f3d1b9f",
"metadata": {
"slat_cfg_scale": 5.0,
"ss_cfg_scale": 10.0,
"slat_sampling_steps": 50,
"ss_sampling_steps": 50,
"seed": 0,
"size_bytes": 1234567
}
}
Step 7: Move to production (Docker and troubleshooting)
Listed here are a couple of common suggestions for debugging and moving to a full production-like deployment.
- Run the complete stack with Docker. In this instance, the backend was run locally; nevertheless, the entire project is designed for Docker. The docker-compose.yml file will launch the frontend, the backend, and all of the AI models served through NVIDIA NIM (NVIDIA Interface for Models).
- Check GPU availability. If models fail, the primary check must be nvidia-smi to make sure Docker can see the GPUs.
- Inspect service logs. The perfect strategy to debug is by tailing the logs for a selected service: docker-compose logs -f backend
Extensibility and future features
The goal of extending this blueprint is to extend the breadth and quality of commerce-ready assets and metadata autonomously. The project roadmap includes several extensions that might be built on:
- Agentic social media research: This planned feature introduces a specialized social media research agent as a part of an agentic workflow, where autonomous agents handle complex tasks. Powered by reasoning models like NVIDIA Nemotron and using tool calling with social media APIs or MCPs, the agent analyzes real-world usage patterns, sentiment, and trending terminology, feeding these insights into the /vlm/analyze step to maintain product descriptions wealthy, relevant, and on-trend.
- Short video generation: The subsequent step is so as to add one other generative endpoint to create 3-5 second product video clips. Using open source models, short video clips might be generated directly from 2D images, making a dynamic, AI-generated lifestyle clip or product spin while not having a fancy video shoot.
This foundation is designed for extension. Modules might be added for virtual try-on, automated ad generation, or dynamic pricing models by following the identical pattern of adding a brand new, specialized microservice.
Conclusion
We’ve successfully built a robust, AI-driven pipeline that solves the sparse catalog problem. The important thing takeaways for constructing a system like this are:
- Go modular: A production-ready system must separate fast evaluation from slow generation. This provides a responsive UI and the flexibleness to treat asset generation as an on-demand or background task.
- Localization is essential: True enrichment isn’t just translation; it’s cultural adaptation. By making locale a core parameter, the system generates text and pictures that resonate with global audiences.
- Brand voice is a feature: The brand_instructions parameter is a game-changer. It transforms the LLM from a generic generator right into a true, scalable brand assistant.
Resources
Able to construct this yourself? Dive into the project documentation:
Learn more in regards to the Retail Catalog Enrichment Blueprint.
