Home Artificial Intelligence Inside LLaMA: Meta AI Recent Large Language Model that Outperforms GPT-3 Across Many Tasks Architecture LLaMA in Motion The First LLaMA Implementation

Inside LLaMA: Meta AI Recent Large Language Model that Outperforms GPT-3 Across Many Tasks Architecture LLaMA in Motion The First LLaMA Implementation

1
Inside LLaMA: Meta AI Recent Large Language Model that Outperforms GPT-3 Across Many Tasks
Architecture
LLaMA in Motion
The First LLaMA Implementation

Created Using Midjourney

I recently began an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to maintain you up so far with machine learning projects, research papers and ideas. Please give it a try by subscribing below:

Large Language Models (LLMs) have recently taken the world by storm with their remarkable ability to perform recent tasks from textual instructions or a number of examples. This ability, often called few-shot learning, was first observed when models were scaled as much as a sufficient size. Because of this, researchers have focused on scaling these models even further. The overall assumption is that more parameters will lead to raised performance. Nevertheless, recent research has shown that, for a given compute budget, the perfect performance shouldn’t be achieved by the biggest models. As a substitute, smaller models trained on more data outperform their larger counterparts. In that context, Meta AI recently published a paper detailing LLaMA, a 65B LLM that’s in a position to outperform GPT-3 across many tasks despite being significantly smaller.

The core principle behind LLaMA is to realize the perfect possible performance at various inference budgets by training on more tokens than what is usually used. LLaMA ranges from 7B to 65B parameters and has competitive performance in comparison with the perfect existing LLMs. As an example, LLaMA-13B outperforms GPT-3 on most benchmarks despite being 10× smaller. This model is more likely to democratize the access and study of LLMs since it will possibly be run on a single GPU. At the upper end of the dimensions, the 65B-parameter model can also be competitive with the perfect large language models corresponding to Chinchilla or PaLM-540B.

What sets LLaMA aside from other models is that it only uses publicly available data, making it compatible with open sourcing. Most existing models depend on data that’s either not publicly available or undocumented. Although there are some exceptions, corresponding to OPT, GPT-NeoX, BLOOM, and GLM, none of them are competitive with PaLM-62B or Chinchilla.

LLaMA’s mode is predicated on a typical transformer architecture, incorporating various improvements from recent research, corresponding to pre-normalization, SwiGLU activation function, and rotary embeddings. To reinforce training stability, LLaMA normalizes the input of every transformer sub-layer using the RMSNorm normalizing function as a substitute of normalizing the output as in the unique architecture. Moreover, LLaMA replaces the ReLU non-linearity with the SwiGLU activation function to enhance performance, using a dimension of 234d in addition to absolutely the positional embeddings and adding rotary positional embeddings at each layer of the network to scale back computational overhead.

To reinforce training efficiency, Meta AI used an efficient implementation of the causal multi-head attention operator, reducing memory usage and computation along with checkpointing to save lots of expensive activations through the backward pass and manually implement the backward function for the transformer layers to scale back the variety of activations that must be recomputed. Finally, To scale back memory usage further, LLaMA relies on model and sequence parallelism and overlap computation of activations and communication between GPUs over the network. The result was visible through the training of the 65B-parameter model. LLaMA processed roughly 380 tokens/sec/GPU on 2048 A100 GPUs with 80GB of RAM, taking around 21 days to coach over our dataset containing 1.4T tokens.

LLaMA was evaluated on 20 benchmarks, including zero-shot and few-shot tasks, and compared it with other foundation models, corresponding to GPT-3, Gopher, Chinchilla, and PaLM, together with OPT models, GPT-J, and GPTNeo. Results showed that LLaMA was in a position to outperform GPT-3 despite being 10 times smaller in size.

Image Credit: Meta AI

A number of the results of LLaMA are incredibly sophisticated and factually accurate, showing strong signs of reasoning.

Image Credit: Meta AI
Image Credit: Meta AI
Image Credit: Meta AI

LLaMA hasn’t been open-sourced yet. Nevertheless, not wasting any time, AI startup Nebuly released ChatLLaMA, an open-source implementation of LLaMA based on RLHF. ChatLLaMA enables the implementation of ChatGPT-style service using pre-trained LLaMA models. In comparison with the unique ChatGPT, ChatLLaMA offers faster and cheaper training processes and single-GPU inference, because of the smaller size of LLaMA architectures. Plus, the library includes built-in support for DeepSpeed ZERO, allowing you to hurry up the fine-tuning process. ChatLLaMA also supports all LLaMA model architectures (7B, 13B, 33B, 65B), supplying you with the flexibleness to fine-tune the model based in your preferences for training time and inference performance.

The code for using ChatLLaMA is super easy, as illustrated below:

from chatllama.rlhf.trainer import RLTrainer
from chatllama.rlhf.config import Config

path = "path_to_config_file.yaml"
config = Config(path=path)
trainer = RLTrainer(config.trainer)
trainer.distillate()
trainer.train()
trainer.training_stats.plot()

LLaMA is definitely a really interesting development within the LLM space. Meta AI has enabled early access to the model. Hopefully, a generally available release can be available soon.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here