A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

In large-scale LLM development, improving model quality depends not only on data quantity but additionally on data quality and specificity. While pretraining datasets often contain an enormous range of knowledge, they will lack the conceptual targeting needed to strengthen particular skills, resembling reasoning or programming proficiency. To handle this challenge, we designed an approach for scalable, concept-driven synthetic data generation — a workflow that allows researchers to generate data aligned with desired model capabilities. As an initial application, we construct a pretraining-scale synthetic dataset consisting of 15 million Python programming problems, released because the Nemotron-Pretraining-Code-Concepts subset of the Nemotron-Pretraining-Specialized-v1.1 dataset. We show that including these data in the ultimate 100 billion tokens of Nemotron-Nano-v3 pretraining yields a six-point gain on the HumanEval benchmark.

Our workflow centers on a curated taxonomy of programming knowledge derived from large-scale annotation of the Nemotron‑Pretraining‑Code‑{v1,v2} datasets. This taxonomy encodes hundreds of programming concepts organized hierarchically, from fundamental constructs (e.g., strings, recursion) to advanced algorithmic and data-structure patterns. Using this taxonomy, developers can perform targeted data generation through the mixture and distillation of chosen concepts, enabling experimenters to manage difficulty, diversity, and conceptual balance across generated data.

To judge this workflow in practice, we applied it to create a large-scale synthetic dataset aimed toward enhancing foundational Python programming skills in LLM pretraining. We first identified 91 core concepts most relevant to the HumanEval benchmark (though still broadly representative of real programming knowledge) by classifying its code-completion prompts inside our taxonomy. Guided by mixtures of those concepts, we generated roughly 15 million synthetic Python programming problems, each of which was validated to consist of working Python code (using Python’s ast.parse function). Figure 1 provides a visible summary of the applying of our workflow to synthetize the Code Concepts dataset and Figure 2 provides a visualization of the issue generation process and an example of an idea seed and artificial problem pair.

Figure 1: Concept-driven data generation used to generate the Code Concepts dataset. Using a taxonomy constructed from the Nemotron-Pretraining-Code-{v1,v2} datasets, we extract programming concepts from HumanEval prompts and use those for open-ended generation. Our workflow resulted in ~15M Python programming problems derived from 91 different programming concepts.

Figure 2: A visible representation of Python programming problem generation as a part of our concept-driven data-generation workflow. A prompt is constructed from a mix of concepts (included within the blue box and represented using dot-notation), an instruction and a few constraints. Using GPT-OSS 120B an issue is generated, then parsed and filtered for quality. For this particular example, the mixture of the concepts data-structures.sets.operation, algorithms.arrays.processing and algorithms.geometry.computational contributed to an issue involving counting distinct convex‑hull areas from all sufficiently large subsets of an inventory of points.

To validate these generated data, we included 10 billion tokens of the Code Concepts dataset into the ultimate 100 billion tokens of Nemotron Nano‑v3 pretraining. After training and evaluation, we discover that the resulting model yields a six‑point improvement in HumanEval accuracy, from 73 to 79. Figure 3 shows a comparison of base-model evaluations between Nemotron-Nano-v3 and Nemotron-Nano-v3 trained with the Code Concepts dataset. Beyond quantitative gains, qualitative assessment reveals stronger performance across varied programming concepts (e.g., graph algorithms, set operations) and improved handling of edge cases and execution reasoning.

We view this dataset as a validation of the broader concept-driven generation workflow somewhat than a one-off artifact. By releasing each the dataset and the underlying taxonomy under a permissive open license (CC‑BY‑4.0), we hope to enable the community to increase this method to other domains and use cases in scalable, targeted LLM pretraining.

Figure 3: Base-model benchmark evaluation results obtained after performing a 100 billion token data-ablation experiment using ~10 billion tokens of the Code Concepts data. The model trained on the Code Concepts data achieves a six-point gain on HumanEval and most other benchmarks are unchanged.