Can You Construct Large Language Models Like ChatGPT At Half Cost?

Artificial Intelligence

Can You Construct Large Language Models Like ChatGPT At Half Cost?

admin

May 15, 2023

Can You Construct Large Language Models Like ChatGPT At Half Cost?

Large Language Models (LLMs) like GPT-3 and ChatGPT have revolutionized AI by offering Natural Language Understanding and content generation capabilities. But their development comes at a hefty price limiting accessibility and further research. Researchers estimate that training GPT-3 cost OpenAI around $5 million. Nevertheless, Microsoft recognized the potential and invested $1 billion in 2019 and $10 billion in 2023 in OpenAI’s GPT-3 and ChatGPT enterprise.

LLMs are machine learning models trained on extensive textual data for NLP applications. They’re based on transformer architecture and utilize attention mechanisms for NLP tasks like question-answering, machine translation, sentiment evaluation, etc.

The query arises: can the efficiency of those large models be increased while concurrently reducing computational cost and training time?

Several approaches, like Progressive Neural Networks, Network Morphism, intra-layer model parallelism, knowledge inheritance, etc., have been developed to cut back the computational cost of coaching neural networks. The novel LiGO (Linear Growth Operator) approach we are going to discuss is setting a latest benchmark. It halves the computational cost of coaching LLMs.

Before discussing this method, examining the aspects contributing to the high price of constructing LLMs is important.

Cost of Constructing Large Language Models

Three major expenses for developing LLMs are as follows:

1. Computational Resources

Constructing LLMs require massive computational resources to coach on large datasets. They need to process billions of parameters and learn complex patterns from massive textual data.

Investment in specialized hardware akin to Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) is required for constructing and training LLMs to realize state-of-the-art performance.

For example, GPT-3 was trained on a supercomputer with 10000 enterprise-grade GPUs (H100 and A100) and 285,000 CPU cores.

2. Energy Consumption

The intensive computational resources required for constructing LLMs end in significant energy consumption. For example, training 175 billion parameters GPT-3 took 14.8 days using 10,000 V100 GPUs, such as 3.55 million GPU hours. Such a high level of energy consumption has significant environmental effects as well.

3. Data Storage & Management

LLMs are trained on large datasets. For example, GPT-3 was trained on an unlimited corpus of textual data, including Common Crawl, WebText2, Books1, Books2, and Wikipedia, amongst other sources. Significant infrastructure investment is required to gather, curate and store these datasets.

Also, cloud storage is required for data storage, and human expertise for data preprocessing and version control. Furthermore, ensuring that your data strategy complies with regulations like GDPR also adds to the associated fee.

LiGO Technique: Reduce the Cost of Constructing Large Language Models to Half

LiGO (Linear Growth Operator) is a novel technique developed by researchers at MIT to cut back the computational cost of coaching LLMs by 50%. The tactic involves initializing the weights of larger models from those of smaller pre-trained models, enabling efficient scaling of neural networks.

Image from the Paper: Learning to Grow Pretrained Models For Efficient Transformer Training

Yoon Kim, the senior creator of the paper, says:

This method maintains the performance advantages of larger models with reduced computational cost and training time in comparison with training a big model from scratch. LiGO utilizes a data-driven linear growth operator that mixes depth and width operators for optimum performance.

The paper utilized various datasets to conduct text-based experiments, including the English Wikipedia corpus for training BERT and RoBERTa models and the C4 dataset for training GPT2.

The LiGO technique experimentation included growing BERT-Small to BERT-Base, BERT-Base to BERT-Large, RoBERTaSmall to RoBERTa-Base, GPT2-Base to GPT2-Medium, and CaiT-XS to CaiT-S.

The researchers compared their approach with several other baselines, including training from scratch, progressive training, bert2BERT, and KI.

LiGO technique offered 44.7% savings in FLOPs (floating-point operations per second) and 40.7% savings in wall time in comparison with training BERT-Base from scratch by reusing the BERT-Small model. LiGO growth operator outperforms StackBERT, MSLT, bert2BERT, and KI in efficient training.

Advantages of Using a Training Optimization Technique Like LiGO

LiGO is an efficient neural network training method that has various advantages listed as follows:

1. Faster Training

As stated earlier, faster training is the principal advantage of the LiGO technique. It trains LLMs in half the time, increasing productivity and reducing costs.

2. Resource Efficient

LiGO is resource-efficient because it minimizes wall time and FLOPs, resulting in a less expensive and eco-friendly approach to training large transformer models.

3. Generalization

The LiGO technique has improved the performance of each language and vision transformers suggesting that it’s a generalizable technique that could be applied to varied tasks.

Constructing business AI products is only one facet of the general expenses related to AI systems. One other major factor of costs comes from every day operations. For example, it costs OpenAI about $700,000 each day to reply queries using ChatGPT. Researchers are expected to proceed exploring approaches that make LLMs cost-effective during training and more accessible on runtime.

For more AI-related content, visit unite.ai.