Home Artificial Intelligence Learning Transformers Code First: Part 1 — The Setup

Learning Transformers Code First: Part 1 — The Setup

8
Learning Transformers Code First: Part 1 — The Setup

I don’t learn about you, but sometime taking a look at code is simpler than reading papers. Once I was working on AdventureGPT, I began by reading the source code to BabyAGI, an implementation of the ReAct paper in around 600 lines of python.

Recently, I became aware of a recent paper called TinyStories through episode 33 of the wonderful Cognitive Revolution Podcast. TinyStories attempts to indicate that models trained on thousands and thousands (not billions) of parameters will be effective with high-enough quality data. Within the case of the Microsoft researchers within the paper, they utilized synthetic data generated from GPT-3.5 and GPT-4 that may have cost around $10k retail to generate. The dataset and models can be found from the writer’s HuggingFace repo.

I used to be captivated to listen to that a model may very well be trained on 30M and fewer parameters. For reference, I’m running all my model training and inference on a Lenovo Legion 5 laptop with a GTX 1660 Ti. Even only for inference, most models with over 3B parameters are too large to run on my machine. I do know there are cloud compute resources available for a price, but I’m learning all this in my spare time and may really only afford the modest OpenAI bill I rack up via API calls. Due to this fact, the concept that there have been models I could train on my modest hardware immediately lit me up.

I began reading the TinyStories paper and shortly realized that they utilized the now defunct GPT Neo model in there model training. I began digging into the code to see if I could understand it and I spotted I needed something even smaller to start out from. For context, I’m mainly a backend software engineer with barely enough machine learning experience to not get completely lost when hearing people speak about neural nets. I’m nowhere near a correct ML engineer and this led me to type “gpt from scratch” into my preferred search engine to seek out a gentler introduction. I discovered the video below and all the things shifted.

This was what I used to be searching for. Along with the fundamental repo linked within the video, there may be a sophisticated version called nanoGPT which remains to be under energetic development. What’s more,. To me, that was much more exciting than the video. I closed the video and commenced pouring over the source code. nanoGPT utilizes PyTorch, which I’ve never used before. It also features barely enough math to make and machine learning jargon to make the neophyte in me anxious. This was going to be something of a much bigger undertaking than I anticipated.

The most effective ways to know something is to put in writing about. Due to this fact, I plan on picking apart the code within the nanoGPT repo, reading the famous “Attention is All You Need” paper, and learning transformers in a bottoms-up, hands on way. Whatever I learn along the way in which I hope to put in writing about on this series. If you ought to follow along, clone the nanoGPT repo to your machine (the model may even be trained on CPU, so no hardware excuses) and follow along.

The very first thing I did after cloning the repo was follow the README’s instructions for training the only model, the character-level generation model using the tiny_shakespeare dataset. There’s a script to organize the dataset for training, a script to do the actual training, and a sampling script to output generated text. With a couple of terminal commands and an hour+ of coaching, I had easy model to output Shakespearean-sounding text.

Following instructions is all well and good, but I don’t really understand something until I modify it to work for my very own use case. My goal here was to coach an identical character-level model using the TinyStories dataset. This required creating my very own data preparation script to get the dataset ready for training. Let’s dig into that deeper.

The nanoGPT has two varieties of data preparation scripts: one for GPT-2 style models and one for character-level models. I grabbed a number of the code from the GPT-2 models for downloading from HuggingFace repositories and took all the things else from the tiny_shakespeare character-level script. One necessary point here, tiny_shakespeare is just over 1MB and comprises only 40k lines of Shakespeare. TinyStories is over 3GB compressed and comprises 39.7M stories. The methods for tokenizing and slicing tiny_shakespeare were in a roundabout way transferable, no less than not with the 32GB of RAM my laptop has. I crashed my machine several times trying pythonic, easy-to-read methods of preparing TinyStories. The ultimate script uses a couple of tricks I’ll detail below.

First off, my preferred solution for processing lists of knowledge is list comprehension, a syntax for generating recent lists from existing lists with modifications. The problem with list comprehension on this case is that that 3GB of compressed text becomes closer to 10GB in RAM. Now, list comprehension requires multiple copies of the list in RAM. Not a difficulty for small data, but unworkable for TinyStories.

The outputs of the information preparation scripts is a compressed NumPy array of character level encoding for the train and validation data plus a metadata pickle which incorporates the total list of unique characters and the encoding/decoding maps to convert these characters to numbers. Using this as reference, we don’t need anything apart from the ultimate encoded array of numbers once the unique characters are found and mapped to numbers. One of the best method to do that memory efficiently is to iterate through a the information with an easy for-loop while constructing these outputs piece-mils. To do that, you initialize an initial variable before the loop which then gets updated each interaction. This prevents multiple versions of the dataset from being held in RAM and only outputs what we’d like. The ultimate vocab generation code is below:

chars_dataset = set([])
len_dataset = 0

# get all of the unique characters that occur on this text in addition to total length for training data
desc = "Enumerate characters in training set"
for story in tqdm(dataset['train']['text'], desc):
chars = list(set(story))

for char in chars:
chars_dataset.add(char)

len_dataset += len(story)

That said, an array of 30.7M stories (over 4B characters) encoded as numbers still takes up a non-trivial amount of RAM because Python is storing the ints dynamically. Enter NumPy, which has a rather more efficient array storage where you possibly can specify the precise size of the ints. Along with the efficient storage, NumPy also has a memory efficient array concatenation which will be used to construct the ultimate encoded array iteratively reasonably than all of sudden.

My of entirety on the script was so as to add a progress bar using tqdm for every step and I used to be finally able to run the script. So, I ran it overnight and got here back within the morning. Once I got here back, the script was still running, with over 100 estimated hours of compute time remaining.

That is when it really hit me: 30.7M stories is small for a language model, but may be very much not a toy dataset to be processed on a single thread. It was time to usher in the massive guns: parallelization. Parallelism brings in quite a lot of complexity and overhead, however the performance gains was definitely worth the trade off. Luckily, there are a variety of ways to parallelize Python code. Lots of these solutions require major rewrites to a serially executed script or complicated abstractions. With a little bit digging, I discovered something that allowed me to maintain most of my script the identical but still run multiple processes to benefit from all of my threads.

Ray is a library for easily parallelizing methods in Python and may easily be run locally or as a cluster. It handles running tasks in a queue and spinning up employee processes to eat away at that queue. There is a wonderful guide to ray below if this has whet your appetite.

When it got here to selecting what to parallelize, the encode function gave the impression of an excellent candidate. It has clear inputs and outputs, no negative effects on those inputs, and was easily considered one of the biggest portions of the compute time. Adapting the prevailing code to work with ray couldn’t have been easier: the function becomes accessible to ray via a decorator, the functional call changes barely so as to add a distant attribute, and there may be a function to kick off executing all the information. Below is an example of the way it looked in my code base initially:

import ray

ray.init()

# given all of the unique characters inside a dataset,
# create a novel mapping of characters to ints
stoi = { ch:i for i,ch in enumerate(chars_dataset) }

@ray.distant
def encode(s):
return [stoi[c] for c in s]

encoded_stories = []
for story in dataset[‘train’][‘text’]:
encoded_stories.append(encode.distant(story))

ray.get(encoded_stories)

Armed with all my CPU’s power, I forged ahead only to instantly crash my laptop. With the locally distributed call stack utilized by ray, your complete dataset was in memory several times over. Simply enqueuing your complete dataset caused an out-of-memory error. Annoyed, I used this as an excuse to purchase more RAM (64GB here we come!), but continued to tweak the code while the RAM shipped.

The following logical place was to batch the requests being handled by ray into something that might fit inside an inexpensive amount of memory. Adding batching logic was reasonably straightforward and is present in the ultimate codebase I’ll link to at the tip of the article. What actually became interesting was experimenting with the batch size. Initially, I selected a random batch size (5000) and it started off well, however it became obvious to me that a good period of time was being spent on the single-threaded code during each batch.

Essentially, watching my preferred system monitor, I saw a single core pinned for minutes before finally all my laptop’s cores lit up for a couple of seconds before going back to a only a single core being utilized. This lead my to play with the batch size a bit, hoping to feed the ravenous CPU cores faster and keep them engaged longer. Lowering the batch size didn’t help because there was a lot synchronous code in each batch used to slice and prepare a batch from the total dataset. That code couldn’t be parallelized, so it meant that every batch had a big startup cost time sensible generating the chunk. This led me to try the alternative, increasing the chunk size to maintain the cores more engaged for longer. This worked, as chunk generation took the identical period of time no matter chunk size, but each chunk processed more data. Combining this with moving my encoding post-processing into ray functions, I used to be capable of chew through 30% of the training dataset in only a couple of hours, all on a single laptop.

Finally, after a couple of more hours, I had a totally prepared, custom dataset to feed to the character-level model. I used to be pleased that I didn’t must resort to utilizing expensive cloud compute to process the training set, which was my next move if the RAM increase didn’t work. What’s more, I learned intimately what it meant to create/process a dataset for a character-level model.

In the subsequent article on this series, I will likely be examining the actual model code, explaining as best I can and linking to copious external resources to offer additional information where my knowledge falls short. Once the article is written, I’ll return and supply a link here. Within the meantime, I’ve linked the ultimate version of my dataset preparation script below so you possibly can follow along and see what it takes to process a somewhat large dataset on a limited compute platform.

8 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here