The One-Stop Framework for Constructing Data for LLMs and SLMs

Once we take into consideration constructing a model - be it a Large Language Model (LLM) or a Small Language Model (SLM) - the very first thing we’d like is data. While an unlimited amount of open data is out there, it rarely is available in the precise format required to coach or align models.
In practice, we regularly face scenarios where the raw data is not enough. We’d like data that’s more structured, domain-specific, complex, or aligned with the duty at hand. Let’s take a look at some common situations:

Complex Scenarios Missing

You begin with a straightforward dataset, however the model fails on advanced reasoning tasks. How do you generate more complex datasets to strengthen performance?

Knowledge Base to Q&A

You have already got a knowledge base, but it surely’s not in Q&A format. How are you going to transform it right into a usable question-answering dataset?

From SFT to DPO

You’ve got prepared a supervised fine-tuning (SFT) dataset. But now you ought to align your model using Direct Preference Optimization (DPO). How are you going to generate preference pairs?

Depth of Questions

You could have a Q&A dataset, however the questions are shallow. How are you going to create in-depth, multi-turn, or reasoning-heavy questions?

Domain-Specific Mid-Training

You possess an enormous corpus but have to filter and curate data for mid-training on a selected domain.

PDFs and Images to Documents

Your data lives in PDFs or images, and you should convert them into structured documents for constructing a Q&A system.

Boosting Reasoning Ability

You have already got reasoning datasets, but need to push models toward higher “considering tokens” for step-by-step problem-solving.

Quality Filtering

Not all data is nice data. How do you mechanically filter out poor-quality samples and keep only the high-value ones?

Small to Large Contexts

Your dataset has small chunks of context, but you ought to construct larger-context datasets optimized for RAG (Retrieval-Augmented Generation) pipelines.

Cross-Language Conversion

You could have German datasets but have to translate, adapt, and repurpose them into English Q&A systems.
And the list goes on. The needs around data constructing never end when working with modern AI models.

Enter SyGra: One Framework for Every Data Challenge

That is where SyGra is available in.
SyGra is a low-code/no-code framework designed to simplify dataset creation, transformation, and alignment for LLMs and SLMs. As a substitute of writing complex scripts and pipelines, you possibly can concentrate on prompt engineering, while SyGra takes care of the heavy lifting.

Key Features of SyGra:

✅ Python Library + Framework: Easy to integrate into existing ML workflows with SyGra library.
✅ Supports Multiple Inference Backends: Works seamlessly with vLLM, Hugging Face TGI, Triton, Ollama, and more.
✅ Low-Code/No-Code: Construct complex datasets without heavy engineering effort.
✅ Flexible Data Generation: From Q&A to DPO, reasoning to multi-language, SyGra adapts to your use case.

Why SyGra Matters

Data is the muse of AI. The standard, diversity, and structure of your data often matter greater than model architecture tweaks. By enabling flexible and scalable dataset creation, SyGra helps teams:

Speed up model alignment (SFT, DPO, RAG pipelines).
Save engineering time with plug-and-play workflows.
Improve model robustness across complex and domain-specific tasks.
Reduce manual dataset curation effort.

Note: An example implementation could be found at https://github.com/ServiceNow/SyGra/blob/important/docs/tutorials/image_to_qna_tutorial.md

SyGra Architecture

Few Example Task
https://github.com/ServiceNow/SyGra/tree/important/tasks/examples

Final Thoughts

The journey of constructing and refining datasets never ends. Each use case brings latest challenges - from translation and knowledge base conversion to reasoning enhancement and domain filtering. With SyGra, you do not have to reinvent the wheel each time.
As a substitute, you get a unified framework that empowers you to generate, filter, and align data in your models - so you possibly can concentrate on what really matters: constructing smarter AI systems.

References

Paper Link: https://arxiv.org/abs/2508.15432
Documentation: https://servicenow.github.io/SyGra/
Git Repository: https://github.com/ServiceNow/SyGra

Source link

The One-Stop Framework for Constructing Data for LLMs and SLMs

Complex Scenarios Missing

Knowledge Base to Q&A

From SFT to DPO

Depth of Questions

Domain-Specific Mid-Training

PDFs and Images to Documents

Boosting Reasoning Ability

Quality Filtering

Small to Large Contexts

Cross-Language Conversion

Enter SyGra: One Framework for Every Data Challenge

Why SyGra Matters

SyGra Architecture

Final Thoughts

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Dataset Recording, VLA High quality‑Tuning, and On‑Device Optimizations

How Human Work Will Remain Priceless in an AI World

Amodei torches OpenAI in leaked memo

Online harassment is entering its AI era

RAG with Hybrid Search: How Does Keyword Search Work?

The One-Stop Framework for Constructing Data for LLMs and SLMs

Complex Scenarios Missing

Knowledge Base to Q&A

From SFT to DPO

Depth of Questions

Domain-Specific Mid-Training

PDFs and Images to Documents

Boosting Reasoning Ability

Quality Filtering

Small to Large Contexts

Cross-Language Conversion

Enter SyGra: One Framework for Every Data Challenge

Why SyGra Matters

SyGra Architecture

Final Thoughts

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.