
A compound AI approach to Brazilian Portuguese personas grounded in real-world distributions
Grounding Brazil’s AI with Real Data
Constructing AI systems that serve national populations requires data that reflects local language, demographics, and cultural context. For Brazil—home to greater than 200 million people across diverse regions—this stays a persistent challenge, as much of today’s high-quality training data is English-centric or unavailable for business use.
Nemotron-Personas-Brazil helps close that gap. It’s an open dataset (CC BY 4.0) of 6 million fully synthetic personas, statistically grounded in official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). Every persona is aligned to real demographic, geographic, and occupational distributions—but no real person is represented.
This release extends NVIDIA’s growing Nemotron-Personas Collection, which already includes the USA, Japan, India, and Singapore. Like others in the gathering, the Brazil dataset covers attributes similar to age, sex, education, occupation, and site.
The dataset is designed for Brazilian developers and researchers constructing sovereign AI, with data that’s locally grounded, culturally informed, and commercially usable (CC BY 4.0). It was in-built collaboration with WideLabs, an NVIDIA Inception member with deep experience supporting government and regulated-sector AI deployments across Latin America.
What’s within the Dataset?
At a look:
- 6 million Brazilian personas (1 million records × 6 personas each)
- ~1.4 billion tokens total, including ~450 million persona tokens
- 20 fields per record: 6 persona fields + 14 contextual fields grounded in official statistics
- Full geographic coverage: all 26 Brazilian states + the Federal District
- ~457k unique Portuguese names
- 1,500+ occupation categories reflecting Brazil’s workforce
- Multiple persona types including: skilled, sports, arts, travel, amongst others.
Each persona is written in natural Brazilian Portuguese and includes cultural background, skills, goals, hobbies, and interests.
How We Built It
Data Generation Pipeline
Nemotron-Personas-Brazil was built using NeMo Data Designer, NVIDIA’s compound AI system for synthetic data generation. The pipeline supports structured generation, validation, and retry mechanisms required to supply large-scale, population-aware datasets.
Key components include:
- Probabilistic Graphical Model (Apache-2.0) for statistical grounding
- GPT-OSS-120B (Apache-2.0) for narrative generation in Brazilian Portuguese
An prolonged version of Nemotron-Personas-Brazil will likely be available directly inside NeMo Data Designer, enabling developers to generate, refine, and extend Brazilian Portuguese personas as a part of their very own synthetic data pipelines.
Enhanced Cultural Context
To be able to capture the socio-demographic and geographic diversity and complexity of Brazil’s population, Nemotron-Personas-Brazil leveraged population census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).
- Geography – Personas are anchored on the state and municipality level, reflecting regional variation across Brazil’s five macro-regions.
- Occupation – Expands beyond job titles to incorporate skills, expertise, and profession trajectories, including micro-entrepreneurs and regional trades.
- Life Stages – Incorporates student status, unemployment, and retirement to reflect real population dynamics.
- Cultural Traits – Natural-language personas capture Brazilian social norms, interests, and lifestyle dimensions similar to arts, sports, and travel.
- Language Fidelity – All personas are written in natural Brazilian Portuguese, reflecting local naming conventions and communication styles.
The result’s a dataset that’s statistically grounded, culturally representative, and fully synthetic by design.
Private By Design
This dataset incorporates no personally identifiable information. While we use real-world distributions of ages, names, and occupations from official public sources, nothing is tied to any real person, living or deceased. Every persona is fully synthetic, so you’ll be able to train on authentic cultural patterns without compromising privacy.
Who This Data Is For
Nemotron-Personas-Brazil is designed primarily for Brazilian developers and researchers constructing sovereign AI systems. By providing high-quality, population-representative data in Brazilian Portuguese, the dataset addresses gaps left by predominantly English-language training corpora.
Global developers can also leverage the dataset to enhance model performance and alignment in Brazilian cultural and linguistic contexts.
Practical AI Applications
- Multi-turn conversation: Use personas as seeds to generate authentic dialogue datasets
- Domain-specific training: Construct culturally aware AI assistants
- Bias testing & fairness: Evaluate model performance across rural vs. urban populations, age groups, and education levels—ensuring your AI works fairly across all segments of Brazilian society
Why It Matters
AI model builders have long struggled with access to diverse, high-quality training data that reflects real-world populations. Proprietary datasets dominate enterprise AI, creating barriers for researchers, startups, and developers in underrepresented regions.
- Data diversity: Prevents narrow training and model collapse by reflecting Brazil’s full population spectrum
- Cultural authenticity: Reduces reliance on Western-centric datasets and supports sovereign AI development
- Privacy-preservation: Designed to fulfill Brazil’s data protection requirements and emerging AI governance standards
By releasing Nemotron-Personas-Brazil under CC BY 4.0, we’re democratizing access to enterprise-grade synthetic data—enabling anyone to construct culturally authentic AI without barriers of cost, privacy concerns, or geography.
Start Constructing with Nemotron-Personas-Brazil
You may load the dataset directly from Hugging Face:
from datasets import load_dataset
dataset = load_dataset("nvidia/nemotron-personas-brazil")
Wish to learn more about NVIDIA’s open data products, or taken with co-designing a future dataset? Join the conversation on NVIDIA’s Discord.

