What Are the Data-Centric AI Concepts behind GPT Models?


https://arxiv.org/abs/2303.10158. Image by the writer.

Artificial Intelligence (AI) has made incredible strides in transforming the way in which we live, work, and interact with technology. Recently, that one area that has seen significant progress is the event of Large Language Models (LLMs), corresponding to GPT-3, ChatGPT, and GPT-4. These models are able to performing tasks corresponding to language translation, text summarization, and question-answering with impressive accuracy.

While it’s difficult to disregard the increasing model size of LLMs, it’s also vital to acknowledge that their success is due largely to the massive amount and high-quality data used to coach them.

In this text, we are going to present an summary of the recent advancements in LLMs from a data-centric AI perspective, drawing upon insights from our recent survey papers [1,2] with corresponding technical resources on GitHub. Particularly, we are going to take a more in-depth have a look at GPT models through the lens of data-centric AI, a growing concept in the information science community. We’ll unpack the data-centric AI concepts behind GPT models by discussing three data-centric AI goals: training data development, inference data development, and data maintenance.

Large Language Models (LLMs) and GPT Models

LLMs are a variety of Natual Language Processing model which can be trained to infer words inside a context. For instance, probably the most basic function of an LLM is to predict missing tokens given the context. To do that, LLMs are trained to predict the probability of every token candidate from massive data.

An illustrative example of predicting the possibilities of missing tokens with an LLM inside a context. Image by the writer.

GPT models confer with a series of LLMs created by OpenAI, corresponding to GPT-1, GPT-2, GPT-3, InstructGPT, and ChatGPT/GPT-4. Similar to other LLMs, GPT models’ architectures are largely based on Transformers, which use text and positional embeddings as input, and a focus layers to model tokens’ relationships.

GPT-1 model architecture. Image from the paper https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

The later GPT models use similar architectures as GPT-1, apart from using more model parameters with more layers, larger context length, hidden layer size, etc.

Models size comparison of GPT models. Image by the writer.

What’s data-centric AI?

Data-centric AI is an emerging recent way of fascinated by construct AI systems. It has been advocated by Andrew Ng, an AI pioneer.

Data-centric AI is the discipline of systematically engineering the information used to construct an AI system. — Andrew Ng

Prior to now, we mainly focused on creating higher models with data largely unchanged (model-centric AI). Nevertheless, this approach can result in problems in the actual world since it doesn’t consider the various problems which will arise in the information, corresponding to inaccurate labels, duplicates, and biases. Consequently, “overfitting” a dataset may not necessarily lead to higher model behaviors.

In contrast, data-centric AI focuses on improving the standard and quantity of knowledge used to construct AI systems. Because of this the eye is on the information itself, and the models are relatively more fixed. Developing AI systems with a data-centric approach holds more potential in real-world scenarios, as the information used for training ultimately determines the utmost capability of a model.

It will be significant to notice that “data-centric” differs fundamentally from “data-driven”, because the latter only emphasizes the use of knowledge to guide AI development, which usually still centers on developing models slightly than engineering data.

Comparison between data-centric AI and model-centric AI. https://arxiv.org/abs/2301.04819 Image by the writer.

The data-centric AI framework consists of three goals:

  • is to gather and produce wealthy and high-quality data to support the training of machine learning models.
  • is to create novel evaluation sets that may provide more granular insights into the model or trigger a selected capability of the model with engineered data inputs.
  • is to make sure the standard and reliability of knowledge in a dynamic environment. Data maintenance is critical as data in the actual world is just not created once but slightly necessitates continuous maintenance.
Data-centric AI framework. https://arxiv.org/abs/2303.10158. Image by the writer.

Why Data-centric AI Made GPT Models Successful

Months earlier, Yann LeCun tweeted that ChatGPT was nothing recent. Indeed, all techniques (transformer, reinforcement learning from human feedback, etc.) utilized in ChatGPT and GPT-4 should not recent in any respect. Nevertheless, they did achieve incredible results that previous models couldn’t. So, what’s the driving force of their success?

The amount and quality of the information used for training GPT models have seen a major increase through higher data collection, data labeling, and data preparation strategies.

  • BooksCorpus dataset is utilized in training. This dataset incorporates 4629.00 MB of raw text, covering books from a spread of genres corresponding to Adventure, Fantasy, and Romance.
    Pertaining GPT-1 on this dataset can increase performances on downstream tasks with fine-tuning.
  • WebText is utilized in training. That is an internal dataset in OpenAI created by scraping outbound links from Reddit.
    (1) Curate/filter data by only using the outbound links from Reddit, which received at the very least 3 karma. (2) Use tools Dragnet and Newspaper to extract clean contents. (3) Adopt de-duplication and another heuristic-based cleansing (details not mentioned within the paper)
    40 GB of text is obtained after filtering. GPT-2 achieves strong zero-shot results without fine-tuning.
  • The training of GPT-3 is principally based on Common Crawl.
    (1) Train a classifier to filter out low-quality documents based on the similarity of every document to WebText, a proxy for high-quality documents. (2) Use Spark’s MinHashLSH to fuzzily deduplicate documents. (3) Augment the information with WebText, books corpora, and Wikipedia.
    570GB of text is obtained after filtering from 45TB of plaintext (only one.27% of knowledge is chosen on this quality filtering). GPT-3 significantly outperforms GPT-2 within the zero-shot setting.
  • Let humans evaluate the reply to tune GPT-3 in order that it may well higher align with human expectations. They’ve designed tests for annotators, and only those that can pass the tests are eligible to annotate. They’ve even designed a survey to be certain that the annotators benefit from the annotating process.
    (1) Use human-provided answers to prompts to tune the model with supervised training. (2) Collect comparison data to coach a reward model after which use this reward model to tune GPT-3 with reinforcement learning from human feedback (RLHF).
    InstructGPT shows higher truthfulness and fewer bias, i.e., higher alignment.
  • The small print should not disclosed by OpenAI. However it is thought that ChatGPT/GPT-4 largely follow the design of previous GPT models, they usually still use RLHF to tune models (with potentially more and better quality data/labels). It is often believed that GPT-4 used a fair larger dataset, because the model weights have been increased.

As recent GPT models are already sufficiently powerful, we are able to achieve various goals by tuning prompts (or tuning inference data) with the model fixed. For instance, we are able to conduct text summarization by offering the text to be summarized alongside an instruction like “summarize it” or “TL;DR” to steer the inference process.

Prompt tuning. https://arxiv.org/abs/2303.10158. Image by the writer.

Designing the right prompts for inference is a difficult task. It heavily relies on heuristics. A pleasant survey has summarized different promoting methods. Sometimes, even semantically similar prompts can have very diverse outputs. On this case, Soft Prompt-Based Calibration could also be required to cut back variance.

Soft prompt-based calibration. Image from the paper https://arxiv.org/abs/2303.13035v1 with original authors’ permission.

The research of inference data development for LLMs continues to be in its early stage. More inference data development techniques which were utilized in other tasks could possibly be applied in LLMs within the near future.

ChatGPT/GPT-4, as a industrial product, is just not only trained once but slightly is updated constantly and maintained. Clearly, we are able to’t understand how data maintenance is executed outside of OpenAI. So, we discuss some general data-centric AI strategies which can be or shall be very likely used for GPT models:
After we use ChatGPT/GPT-4, our prompts/feedback could possibly be, in turn, utilized by OpenAI to further advance their models. Quality metrics and assurance strategies can have been designed and implemented to gather high-quality data on this process.
Various tools might have been developed to visualise and comprehend user data, facilitating a greater understanding of users’ requirements and guiding the direction of future improvements.
Because the variety of users of ChatGPT/GPT-4 grows rapidly, an efficient data administration system is required to enable fast data acquisition.

ChatGPT/GPT-4 collects user feedback with “thumb up” and “thumb down” to further evolve their system. Screenshot from https://chat.openai.com/chat.

What Can the Data Science Community Learn from this Wave of LLMs?

The success of LLMs has revolutionized AI. Looking forward, LLMs could further revolutionize the information science lifecycle. We make two predictions:

  • After years of research, the model design is already very mature, especially after Transformer. Engineering data becomes an important (or possibly the one) strategy to improve AI systems in the long run. Also, when the model becomes sufficiently powerful, we don’t must train models in our day by day work. As a substitute, we only must design the right inference data (prompt engineering) to probe knowledge from the model. Thus, the research and development of data-centric AI will drive future advancements.
  • Lots of the tedious data science works could possibly be performed rather more efficiently with the assistance of LLMs. For instance, ChaGPT/GPT-4 can already write workable codes to process and clean data. Moreover, LLMs may even be used to create data for training. For instance, recent work has shown that generating synthetic data with LLMs can boost model performance in clinical text mining.
Generating synthetic data with LLMs to coach the model. Image from the paper https://arxiv.org/abs/2303.04360 with the unique authors’ permission.


I hope this text can encourage you in your individual work. You possibly can learn more in regards to the data-centric AI framework and the way it advantages LLMs in the next papers:

We’ve got maintained a GitHub repo, which is able to frequently update the relevant data-centric AI resources. Stay tuned!

Within the later articles, I’ll delve into the three goals of data-centric AI (training data development, inference data development, and data maintenance) and introduce the representative methods.


What are your thoughts on this topic?
Let us know in the comments below.


Notify of
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments
relaxing october jazz
relaxing october jazz
4 months ago

relaxing october jazz

Share this article

Recent posts

Conversational AI revolutionizes the shopper experience landscape

I feel the identical applies after we discuss either agents or employees or supervisors. They do not necessarily wish to be alt-tabbing or...

Former Twitter engineers are constructing Particle, an AI-powered news reader

A team led by former Twitter engineers is rethinking how AI may be used to assist people process news and data. Particle.news, which entered...

China, shocked by the looks of 'Sora'… “China is only a 'fine-tuned version' of the USA”

China showed a shocked response to OpenAI's video-generating artificial intelligence (AI) 'Sora'. There's concern that the technology gap has widened to the purpose...

What’s Multitenancy in Vector Databases?

While you upload and manage your data on GitHub that nobody else can see unless you make it public, you share physical infrastructure with...

Synapsoft launches Synap document viewer on ‘GPT Store’

Synapsoft (CEO Jeon Kyeong-heon), a specialist in artificial intelligence (AI) digital document software as a service (SaaS), announced on the twenty second that it...

Recent comments

skapa binance-konto on LLMs and the Emerging ML Tech Stack
бнанс рестраця для США on Model Evaluation in Time Series Forecasting
Bonus Pendaftaran Binance on Meet Our Fleet
Créer un compte gratuit on About Me — How I give AI artists a hand
To tài khon binance on China completely blocks ‘Chat GPT’
Regístrese para obtener 100 USDT on Reducing bias and improving safety in DALL·E 2
crystal teeth whitening on What babies can teach AI
binance referral bonus on DALL·E API now available in public beta
www.binance.com prihlásení on Neural Networks and Life
Büyü Yapılmışsa Nasıl Bozulur on Introduction to PyTorch: from training loop to prediction
yıldızname on OpenAI Function Calling
Kısmet Bağlılığını Çözmek İçin Dua on Examining Flights within the U.S. with AWS and Power BI
Kısmet Bağlılığını Çözmek İçin Dua on How Meta’s AI Generates Music Based on a Reference Melody
Kısmet Bağlılığını Çözmek İçin Dua on ‘이루다’의 스캐터랩, 기업용 AI 시장에 도전장
uçak oyunu bahis on Thanks!
para kazandıran uçak oyunu on Make Machine Learning Work for You
medyum on Teaching with AI
aviator oyunu oyna on Machine Learning for Beginners !
yıldızname on Final DXA-nation
adet kanı büyüsü on ‘Fake ChatGPT’ app on the App Store
Eşini Eve Bağlamak İçin Dua on LLMs and the Emerging ML Tech Stack
aviator oyunu oyna on AI as Artist’s Augmentation
Büyü Yapılmışsa Nasıl Bozulur on Some Guy Is Trying To Turn $100 Into $100,000 With ChatGPT
Eşini Eve Bağlamak İçin Dua on Latest embedding models and API updates
Kısmet Bağlılığını Çözmek İçin Dua on Jorge Torres, Co-founder & CEO of MindsDB – Interview Series
gideni geri getiren büyü on Joining the battle against health care bias
uçak oyunu bahis on A faster method to teach a robot
uçak oyunu bahis on Introducing the GPT Store
para kazandıran uçak oyunu on Upgrading AI-powered travel products to first-class
para kazandıran uçak oyunu on 10 Best AI Scheduling Assistants (September 2023)
aviator oyunu oyna on 🤗Hugging Face Transformers Agent
Kısmet Bağlılığını Çözmek İçin Dua on Time Series Prediction with Transformers
para kazandıran uçak oyunu on How China is regulating robotaxis
bağlanma büyüsü on MLflow on Cloud
para kazandıran uçak oyunu on Can The 2024 US Elections Leverage Generative AI?
Canbar Büyüsü on The reverse imitation game
bağlanma büyüsü on The NYU AI School Returns Summer 2023
para kazandıran uçak oyunu on Beyond ChatGPT; AI Agent: A Recent World of Staff
Büyü Yapılmışsa Nasıl Bozulur on The Murky World of AI and Copyright
gideni geri getiren büyü on ‘Midjourney 5.2’ creates magical images
Büyü Yapılmışsa Nasıl Bozulur on Microsoft launches the brand new Bing, with ChatGPT inbuilt
gideni geri getiren büyü on MemCon 2023: We’ll Be There — Will You?
adet kanı büyüsü on Meet the Fellow: Umang Bhatt
aviator oyunu oyna on Meet the Fellow: Umang Bhatt
abrir uma conta na binance on The reverse imitation game
código de indicac~ao binance on Neural Networks and Life
Larry Devin Vaughn Wall on How China is regulating robotaxis
Jon Aron Devon Bond on How China is regulating robotaxis
otvorenie úctu na binance on Evolution of Blockchain by DLC
puravive reviews consumer reports on AI-Driven Platform Could Streamline Drug Development
puravive reviews consumer reports on How OpenAI is approaching 2024 worldwide elections
www.binance.com Registrácia on DALL·E now available in beta