LLaMA model weights can be found over the web on various web sites. This shouldn’t be legal but I’m sharing only a “The way to — tutorial”
All work shown here is provided by LLaMAnnon
magnet:xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce
Get the .torrent
file .
Please download and seed all of the model weights in the event you can. If you should run a single model, don’t forget to download the tokenizer.model
file too.
The official method really useful by meta is using Conda so –
Arrange Conda
- Open a terminal and run:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- Run
chmod +x Miniconda3-latest-Linux-x86_64.sh
- Run
./Miniconda3-latest-Linux-x86_64.sh
- Go along with the default options. When it shows you the license, hit
q
to proceed the installation. - Refresh your shell by logging out and logging in back again.
- Create an env:
conda create -n llama
- Activate the env:
conda activate llama
- Install the dependencies:
NVIDIA:conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
AMD:pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.2
- Clone the INT8 repo by the user tloen:
git clone https://github.com/tloen/llama-int8 && cd llama-int8
- Install the necessities:
pip install -r requirements.txt
pip install -e .
Loading the weights for 13B and better models need a substantial amount of DRAM. IIRC it takes about 50GB for 13B, and over 100GB for 30B. You’ll need a swap file to handle excess memory usage. This is barely used for the loading process; the inference is unaffected (so long as you meet the VRAM requirements).
- Create a swapfile:
sudo dd if=/dev/zero of=/swapfile bs=4M count=13000 status=progress
This can create about ~50GB swapfile. Edit thecount
to your preference. 13000 means 4MBx13000. - Mark it as swap:
sudo mkswap /swapfile
- Activate it:
sudo swapon /swapfile
If you should delete it, simply run sudo swapoff /swapfile
after which rm /swapfile
.
I’ll assume your LLaMA models are in ~/Downloads/LLaMA
.
- Open a terminal in your
llama-int8
folder (the one you cloned). - Run:
python example.py --ckpt_dir ~/Downloads/LLaMA/7B --tokenizer_path ~/Downloads/LLaMA/tokenizer.model --max_batch_size=1
- You’re done. Wait for the model to complete loading and it’ll generate a prompt.
By default, the llama-int8 repo has a brief prompt baked into example.py
.
- Open the
example.py
file within thellama-int8
directory. - Navigate to line 136. It starts with triple quotations,
"""
. - Replace the present prompt with whatever you may have in mind.
Good luck!! The word on the road is that the 7b model is pretty dumb and that is the one version fitting on an enthusiast GPU (16-24gb; 8gb is a NO-GO). There are some tricks to suit a 13B model to suit (using 8bit memory shenanigans but I even have not done that and I’m undecided the way it affects the model itself).