What Are the Data-Centric AI Concepts behind GPT Models?

Unpacking the data-centric AI techniques utilized in ChatGPT and GPT-4

https://arxiv.org/abs/2303.10158. Image by the writer.

Artificial Intelligence (AI) has made incredible strides in transforming the way in which we live, work, and interact with technology. Recently, that one area that has seen significant progress is the event of Large Language Models (LLMs), corresponding to GPT-3, ChatGPT, and GPT-4. These models are able to performing tasks corresponding to language translation, text summarization, and question-answering with impressive accuracy.

While it’s difficult to disregard the increasing model size of LLMs, it’s also vital to acknowledge that their success is due largely to the massive amount and high-quality data used to coach them.

In this text, we are going to present an summary of the recent advancements in LLMs from a data-centric AI perspective, drawing upon insights from our recent survey papers [1,2] with corresponding technical resources on GitHub. Particularly, we are going to take a more in-depth have a look at GPT models through the lens of data-centric AI, a growing concept in the information science community. We’ll unpack the data-centric AI concepts behind GPT models by discussing three data-centric AI goals: training data development, inference data development, and data maintenance.

Large Language Models (LLMs) and GPT Models

LLMs are a variety of Natual Language Processing model which can be trained to infer words inside a context. For instance, probably the most basic function of an LLM is to predict missing tokens given the context. To do that, LLMs are trained to predict the probability of every token candidate from massive data.

An illustrative example of predicting the possibilities of missing tokens with an LLM inside a context. Image by the writer.

GPT models confer with a series of LLMs created by OpenAI, corresponding to GPT-1, GPT-2, GPT-3, InstructGPT, and ChatGPT/GPT-4. Similar to other LLMs, GPT models’ architectures are largely based on Transformers, which use text and positional embeddings as input, and a focus layers to model tokens’ relationships.

GPT-1 model architecture. Image from the paper https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

The later GPT models use similar architectures as GPT-1, apart from using more model parameters with more layers, larger context length, hidden layer size, etc.

Models size comparison of GPT models. Image by the writer.

What’s data-centric AI?

Data-centric AI is an emerging recent way of fascinated by construct AI systems. It has been advocated by Andrew Ng, an AI pioneer.

Data-centric AI is the discipline of systematically engineering the information used to construct an AI system. — Andrew Ng

Prior to now, we mainly focused on creating higher models with data largely unchanged (model-centric AI). Nevertheless, this approach can result in problems in the actual world since it doesn’t consider the various problems which will arise in the information, corresponding to inaccurate labels, duplicates, and biases. Consequently, “overfitting” a dataset may not necessarily lead to higher model behaviors.

In contrast, data-centric AI focuses on improving the standard and quantity of knowledge used to construct AI systems. Because of this the eye is on the information itself, and the models are relatively more fixed. Developing AI systems with a data-centric approach holds more potential in real-world scenarios, as the information used for training ultimately determines the utmost capability of a model.

It will be significant to notice that “data-centric” differs fundamentally from “data-driven”, because the latter only emphasizes the use of knowledge to guide AI development, which usually still centers on developing models slightly than engineering data.

Comparison between data-centric AI and model-centric AI. https://arxiv.org/abs/2301.04819 Image by the writer.

The data-centric AI framework consists of three goals:

is to gather and produce wealthy and high-quality data to support the training of machine learning models.
is to create novel evaluation sets that may provide more granular insights into the model or trigger a selected capability of the model with engineered data inputs.
is to make sure the standard and reliability of knowledge in a dynamic environment. Data maintenance is critical as data in the actual world is just not created once but slightly necessitates continuous maintenance.

Data-centric AI framework. https://arxiv.org/abs/2303.10158. Image by the writer.

Why Data-centric AI Made GPT Models Successful

Months earlier, Yann LeCun tweeted that ChatGPT was nothing recent. Indeed, all techniques (transformer, reinforcement learning from human feedback, etc.) utilized in ChatGPT and GPT-4 should not recent in any respect. Nevertheless, they did achieve incredible results that previous models couldn’t. So, what’s the driving force of their success?