What Are the Data-Centric AI Concepts behind GPT Models?


https://arxiv.org/abs/2303.10158. Image by the writer.

Artificial Intelligence (AI) has made incredible strides in transforming the way in which we live, work, and interact with technology. Recently, that one area that has seen significant progress is the event of Large Language Models (LLMs), corresponding to GPT-3, ChatGPT, and GPT-4. These models are able to performing tasks corresponding to language translation, text summarization, and question-answering with impressive accuracy.

While it’s difficult to disregard the increasing model size of LLMs, it’s also vital to acknowledge that their success is due largely to the massive amount and high-quality data used to coach them.

In this text, we are going to present an summary of the recent advancements in LLMs from a data-centric AI perspective, drawing upon insights from our recent survey papers [1,2] with corresponding technical resources on GitHub. Particularly, we are going to take a more in-depth have a look at GPT models through the lens of data-centric AI, a growing concept in the information science community. We’ll unpack the data-centric AI concepts behind GPT models by discussing three data-centric AI goals: training data development, inference data development, and data maintenance.

Large Language Models (LLMs) and GPT Models

LLMs are a variety of Natual Language Processing model which can be trained to infer words inside a context. For instance, probably the most basic function of an LLM is to predict missing tokens given the context. To do that, LLMs are trained to predict the probability of every token candidate from massive data.

An illustrative example of predicting the possibilities of missing tokens with an LLM inside a context. Image by the writer.

GPT models confer with a series of LLMs created by OpenAI, corresponding to GPT-1, GPT-2, GPT-3, InstructGPT, and ChatGPT/GPT-4. Similar to other LLMs, GPT models’ architectures are largely based on Transformers, which use text and positional embeddings as input, and a focus layers to model tokens’ relationships.

GPT-1 model architecture. Image from the paper https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

The later GPT models use similar architectures as GPT-1, apart from using more model parameters with more layers, larger context length, hidden layer size, etc.

Models size comparison of GPT models. Image by the writer.

What’s data-centric AI?

Data-centric AI is an emerging recent way of fascinated by construct AI systems. It has been advocated by Andrew Ng, an AI pioneer.

Data-centric AI is the discipline of systematically engineering the information used to construct an AI system. — Andrew Ng

Prior to now, we mainly focused on creating higher models with data largely unchanged (model-centric AI). Nevertheless, this approach can result in problems in the actual world since it doesn’t consider the various problems which will arise in the information, corresponding to inaccurate labels, duplicates, and biases. Consequently, “overfitting” a dataset may not necessarily lead to higher model behaviors.

In contrast, data-centric AI focuses on improving the standard and quantity of knowledge used to construct AI systems. Because of this the eye is on the information itself, and the models are relatively more fixed. Developing AI systems with a data-centric approach holds more potential in real-world scenarios, as the information used for training ultimately determines the utmost capability of a model.

It will be significant to notice that “data-centric” differs fundamentally from “data-driven”, because the latter only emphasizes the use of knowledge to guide AI development, which usually still centers on developing models slightly than engineering data.

Comparison between data-centric AI and model-centric AI. https://arxiv.org/abs/2301.04819 Image by the writer.

The data-centric AI framework consists of three goals:

  • is to gather and produce wealthy and high-quality data to support the training of machine learning models.
  • is to create novel evaluation sets that may provide more granular insights into the model or trigger a selected capability of the model with engineered data inputs.
  • is to make sure the standard and reliability of knowledge in a dynamic environment. Data maintenance is critical as data in the actual world is just not created once but slightly necessitates continuous maintenance.
Data-centric AI framework. https://arxiv.org/abs/2303.10158. Image by the writer.

Why Data-centric AI Made GPT Models Successful

Months earlier, Yann LeCun tweeted that ChatGPT was nothing recent. Indeed, all techniques (transformer, reinforcement learning from human feedback, etc.) utilized in ChatGPT and GPT-4 should not recent in any respect. Nevertheless, they did achieve incredible results that previous models couldn’t. So, what’s the driving force of their success?

The amount and quality of the information used for training GPT models have seen a major increase through higher data collection, data labeling, and data preparation strategies.

  • BooksCorpus dataset is utilized in training. This dataset incorporates 4629.00 MB of raw text, covering books from a spread of genres corresponding to Adventure, Fantasy, and Romance.
    Pertaining GPT-1 on this dataset can increase performances on downstream tasks with fine-tuning.
  • WebText is utilized in training. That is an internal dataset in OpenAI created by scraping outbound links from Reddit.
    (1) Curate/filter data by only using the outbound links from Reddit, which received at the very least 3 karma. (2) Use tools Dragnet and Newspaper to extract clean contents. (3) Adopt de-duplication and another heuristic-based cleansing (details not mentioned within the paper)
    40 GB of text is obtained after filtering. GPT-2 achieves strong zero-shot results without fine-tuning.
  • The training of GPT-3 is principally based on Common Crawl.
    (1) Train a classifier to filter out low-quality documents based on the similarity of every document to WebText, a proxy for high-quality documents. (2) Use Spark’s MinHashLSH to fuzzily deduplicate documents. (3) Augment the information with WebText, books corpora, and Wikipedia.
    570GB of text is obtained after filtering from 45TB of plaintext (only one.27% of knowledge is chosen on this quality filtering). GPT-3 significantly outperforms GPT-2 within the zero-shot setting.
  • Let humans evaluate the reply to tune GPT-3 in order that it may well higher align with human expectations. They’ve designed tests for annotators, and only those that can pass the tests are eligible to annotate. They’ve even designed a survey to be certain that the annotators benefit from the annotating process.
    (1) Use human-provided answers to prompts to tune the model with supervised training. (2) Collect comparison data to coach a reward model after which use this reward model to tune GPT-3 with reinforcement learning from human feedback (RLHF).
    InstructGPT shows higher truthfulness and fewer bias, i.e., higher alignment.
  • The small print should not disclosed by OpenAI. However it is thought that ChatGPT/GPT-4 largely follow the design of previous GPT models, they usually still use RLHF to tune models (with potentially more and better quality data/labels). It is often believed that GPT-4 used a fair larger dataset, because the model weights have been increased.

As recent GPT models are already sufficiently powerful, we are able to achieve various goals by tuning prompts (or tuning inference data) with the model fixed. For instance, we are able to conduct text summarization by offering the text to be summarized alongside an instruction like “summarize it” or “TL;DR” to steer the inference process.

Prompt tuning. https://arxiv.org/abs/2303.10158. Image by the writer.

Designing the right prompts for inference is a difficult task. It heavily relies on heuristics. A pleasant survey has summarized different promoting methods. Sometimes, even semantically similar prompts can have very diverse outputs. On this case, Soft Prompt-Based Calibration could also be required to cut back variance.

Soft prompt-based calibration. Image from the paper https://arxiv.org/abs/2303.13035v1 with original authors’ permission.

The research of inference data development for LLMs continues to be in its early stage. More inference data development techniques which were utilized in other tasks could possibly be applied in LLMs within the near future.

ChatGPT/GPT-4, as a industrial product, is just not only trained once but slightly is updated constantly and maintained. Clearly, we are able to’t understand how data maintenance is executed outside of OpenAI. So, we discuss some general data-centric AI strategies which can be or shall be very likely used for GPT models:
After we use ChatGPT/GPT-4, our prompts/feedback could possibly be, in turn, utilized by OpenAI to further advance their models. Quality metrics and assurance strategies can have been designed and implemented to gather high-quality data on this process.
Various tools might have been developed to visualise and comprehend user data, facilitating a greater understanding of users’ requirements and guiding the direction of future improvements.
Because the variety of users of ChatGPT/GPT-4 grows rapidly, an efficient data administration system is required to enable fast data acquisition.

ChatGPT/GPT-4 collects user feedback with “thumb up” and “thumb down” to further evolve their system. Screenshot from https://chat.openai.com/chat.

What Can the Data Science Community Learn from this Wave of LLMs?

The success of LLMs has revolutionized AI. Looking forward, LLMs could further revolutionize the information science lifecycle. We make two predictions:

  • After years of research, the model design is already very mature, especially after Transformer. Engineering data becomes an important (or possibly the one) strategy to improve AI systems in the long run. Also, when the model becomes sufficiently powerful, we don’t must train models in our day by day work. As a substitute, we only must design the right inference data (prompt engineering) to probe knowledge from the model. Thus, the research and development of data-centric AI will drive future advancements.
  • Lots of the tedious data science works could possibly be performed rather more efficiently with the assistance of LLMs. For instance, ChaGPT/GPT-4 can already write workable codes to process and clean data. Moreover, LLMs may even be used to create data for training. For instance, recent work has shown that generating synthetic data with LLMs can boost model performance in clinical text mining.
Generating synthetic data with LLMs to coach the model. Image from the paper https://arxiv.org/abs/2303.04360 with the unique authors’ permission.


I hope this text can encourage you in your individual work. You possibly can learn more in regards to the data-centric AI framework and the way it advantages LLMs in the next papers:

We’ve got maintained a GitHub repo, which is able to frequently update the relevant data-centric AI resources. Stay tuned!

Within the later articles, I’ll delve into the three goals of data-centric AI (training data development, inference data development, and data maintenance) and introduce the representative methods.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x