Faster Text Generation with TensorFlow and XLA

-


Joao Gante's avatar


TL;DR: Text Generation on 🤗 transformers using TensorFlow can now be compiled with XLA. It’s as much as 100x
faster than before, and even faster than PyTorch
— check the colab below!

Open In Colab



Text Generation

As the standard of enormous language models increased, so did our expectations of what those models could do. Especially
for the reason that release of OpenAI’s GPT-2, models with text
generation capabilities have been within the highlight. And for legitimate reasons — these models could be used to
summarize, translate, they usually even have demonstrated zero-shot learning capabilities on some language tasks.
This blog post will show the right way to take probably the most of this technology with TensorFlow.

The 🤗 transformers library began with NLP models, so it’s natural that text generation is of utmost
importance to us.
It is a component of Hugging Face democratization efforts to make sure it’s accessible, easily controllable, and efficient.
There’s a previous blog post about the different sorts of text
generation. Nevertheless, below there is a quick recap of the core functionality — be happy to
skip it in the event you’re
acquainted with our generate function and need to leap straight into TensorFlow’s specificities.

Let’s start with the fundamentals. Text generation could be deterministic or stochastic, depending on the
do_sample flag. By default it’s set to False, causing the output to be deterministic, which can be referred to as
Greedy Decoding.
When it’s set to True, also referred to as Sampling, the output shall be stochastic, but you possibly can still
obtain reproducible results through the seed argument (with the identical format as in stateless TensorFlow random
number generation
).
As a rule of thumb, you would like deterministic generation in the event you wish
to acquire factual information from the model and stochastic generation in the event you’re aiming at more creative outputs.



from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id
inputs = tokenizer(["TensorFlow is"], return_tensors="tf")

generated = model.generate(**inputs, do_sample=True, seed=(42, 0))
print("Sampling output: ", tokenizer.decode(generated[0]))


Depending on the goal application, longer outputs is likely to be desirable. You possibly can control the length of the generation
output with max_new_tokens, keeping in mind that longer generations would require more resources.

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), max_new_tokens=5
)
print("Limiting to five recent tokens:", tokenizer.decode(generated[0]))

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), max_new_tokens=30
)
print("Limiting to 30 recent tokens:", tokenizer.decode(generated[0]))


Sampling has a couple of knobs you possibly can play with to manage randomness. Crucial is temperature, which sets the general entropy
of your output — values below 1.0 will prioritize sampling tokens with the next likelihood, whereas values above 1.0
do the other. Setting it to 0.0 reduces the behavior to Greedy Decoding, whereas very large values approximate
uniform sampling.

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), temperature=0.7
)
print("Temperature 0.7: ", tokenizer.decode(generated[0]))

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), temperature=1.5
)
print("Temperature 1.5: ", tokenizer.decode(generated[0]))


Contrarily to Sampling, Greedy Decoding will at all times pick the almost certainly token at each iteration of generation.
Nevertheless, it often ends in sub-optimal outputs. You possibly can increase the standard of the outcomes through the num_beams
argument. When it’s larger than 1, it triggers Beam Search, which constantly explores high-probability sequences.
This exploration comes at the fee of additional resources and computational time.

generated = model.generate(**inputs, num_beams=2)
print("Beam Search output:", tokenizer.decode(generated[0]))


Finally, when running Sampling or Beam Search, you need to use num_return_sequences to return several sequences. For
Sampling it’s comparable to running generate multiple times from the identical input prompt, while for Beam Search it
returns the best scoring generated beams in descending order.

generated = model.generate(**inputs, num_beams=2, num_return_sequences=2)
print(
    "All generated hypotheses:",
    "n".join(tokenizer.decode(out) for out in generated)
)




The fundamentals of text generation, as you possibly can see, are straightforward to manage. Nevertheless, there are a lot of options
not covered within the examples above, and it’s encouraged to read the
documentation
for advanced use cases.
Sadly, whenever you run generate with TensorFlow, you may notice that it takes some time to execute.
In case your goal application expects low latency or a considerable amount of input prompts, running text generation with
TensorFlow looks like an expensive endeavour. 😬

Fear not, for the rest of this blog post goals to show that one line of code could make a drastic improvement.
When you’d fairly jump straight into motion,
the colab
has an interactive example you possibly can fiddle with!



TensorFlow and XLA

XLA, or Accelerated Linear Algebra, is a compiler originally developed to speed up
TensorFlow models. Nowadays, it’s also the compiler behind JAX, and it could actually even
be used with PyTorch. Although the word “compiler” might sound daunting for
some, XLA is straightforward to make use of with TensorFlow — it comes packaged contained in the tensorflow library, and it could actually be
triggered with the jit_compile argument in any graph-creating function.

For those of you acquainted with TensorFlow 1 🧓, the concept of a TensorFlow graph comes naturally, because it was the one
mode of operation. First, you defined the operations in a declarative fashion to create a graph. Afterwards, you possibly can
pipe inputs through the graph and observe the outputs. Fast, efficient, but painful to debug. With TensorFlow 2 got here
Eager Execution and the flexibility to code the models imperatively — the TensorFlow team explains the difference in additional
detail in their blog post.

Hugging Face writes their TensorFlow models with Eager Execution in mind. Transparency is a core value, and having the ability
to examine the model internals at any point may be very benefitial to that end. Nevertheless, that does mean that some uses of
the models don’t profit from the graph mode performance benefits out of the box (e.g. when calling model(args)).

Fortunately, the TensorFlow team has users like us covered 🥳! Wrapping a function containing TensorFlow code with
tf.function will try to convert it right into a graph when
you call the wrapped function. When you’re training a model, calling model.compile() (without run_eagerly=True) does
precisely that wrapping, so that you simply profit from graph mode whenever you call model.fit(). Since tf.function could be
utilized in any function containing TensorFlow code, it means you need to use it on functions that transcend model inference,
making a single optimized graph.

Now that you realize the right way to create TensorFlow graphs, compiling them with XLA is simple — simply add jit_compile=True
as an argument to the functions mentioned above (tf.function and tf.keras.Model.compile). Assuming every part went well
(more on that below) and that you simply are using a GPU or a TPU, you’ll notice that the primary call will take some time, but
that the remaining ones are much, much faster. Here’s an easy example of a function that performs model inference and a few post-processing of its outputs:


import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
inputs = tokenizer(["TensorFlow is"], return_tensors="tf")

def most_likely_next_token(inputs):
    model_output = model(inputs)
    return tf.argmax(model_output.logits[:, -1, :], axis=-1)

print("Calling regular function with TensorFlow code...")
most_likely_next_token(inputs)

In a single line, you possibly can create an XLA-accelerated function from the function above.

xla_most_likely_next_token = tf.function(most_likely_next_token, jit_compile=True)

print("Calling XLA function... (for the primary time -- shall be slow)")
xla_most_likely_next_token(inputs)

print("Calling XLA function... (for the second time -- shall be fast)")
xla_most_likely_next_token(inputs)



Text Generation using TensorFlow with XLA

As with all optimization procedure, there isn’t a free lunch — XLA isn’t any exception. From the attitude of a text
generation user, there is simply one technical aspect that you must take into accout. Without digging an excessive amount of into
details, XLA utilized in this fashion does just-in-time (JIT)
compilation of a tf.function whenever you call it, which relies on polymorphism.

If you compile a function this fashion, XLA keeps track of the form and variety of every tensor, in addition to the information of
every non-tensor function input. The function is compiled to a binary, and each time it is named with the identical tensor
shape and kind (with ANY tensor data) and the identical non-tensor arguments, the compiled function could be reused.
Contrarily, in the event you call the function with a special shape or type in an input tensor, or in the event you use a special
non-tensor argument, then a brand new costly compilation step will happen. Summarized in an easy example:


import tensorflow as tf

@tf.function(jit_compile=True)
def max_plus_constant(tensor, scalar):
    return tf.math.reduce_max(tensor) + scalar


max_plus_constant(tf.constant([0, 0, 0]), 1)




max_plus_constant(tf.constant([1000, 0, -10]), 1)



max_plus_constant(tf.constant([0, 0, 0], dtype=tf.int64), 1)



max_plus_constant(tf.constant([0, 0, 0, 0]), 1)



max_plus_constant(tf.constant([0, 0, 0]), 2)

In practice, for text generation, it simply means the input ought to be padded to a multiple of a certain length (so it
has a limited variety of possible shapes), and that using different options shall be slow for the primary time you utilize
them. Let’s have a look at what happens whenever you naively call generation with XLA.


import time
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM




tokenizer = AutoTokenizer.from_pretrained(
    "gpt2", padding_side="left", pad_token=""
)
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id
input_1 = ["TensorFlow is"]
input_2 = ["TensorFlow is a"]


xla_generate = tf.function(model.generate, jit_compile=True)


tokenized_input_1 = tokenizer(input_1, return_tensors="tf")  
tokenized_input_2 = tokenizer(input_2, return_tensors="tf")  
print(f"`tokenized_input_1` shape = {tokenized_input_1.input_ids.shape}")
print(f"`tokenized_input_2` shape = {tokenized_input_2.input_ids.shape}")

print("Calling XLA generation with tokenized_input_1...")
print("(shall be slow because it is the primary call)")
start = time.time_ns()
xla_generate(**tokenized_input_1)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} msn")


print("Calling XLA generation with tokenized_input_2...")
print("(has a special length = will trigger tracing again)")
start = time.time_ns()
xla_generate(**tokenized_input_2)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} msn")

Oh no, that is terribly slow! An answer to maintain the several combos of shapes in check is thru padding,
as mentioned above. The tokenizer classes have a pad_to_multiple_of argument that could be used to attain a balance
between accepting any input length and limiting tracing.

padding_kwargs = {"pad_to_multiple_of": 8, "padding": True}
tokenized_input_1_with_padding = tokenizer(
    input_1, return_tensors="tf", **padding_kwargs
)  
tokenized_input_2_with_padding = tokenizer(
    input_2, return_tensors="tf", **padding_kwargs
)  
print(
    "`tokenized_input_1_with_padding` shape = ",
    f"{tokenized_input_1_with_padding.input_ids.shape}"
)
print(
    "`tokenized_input_2_with_padding` shape = ",
    f"{tokenized_input_2_with_padding.input_ids.shape}"
)

print("Calling XLA generation with tokenized_input_1_with_padding...")
print("(slow, first time running with this length)")
start = time.time_ns()
xla_generate(**tokenized_input_1_with_padding)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} msn")


print("Calling XLA generation with tokenized_input_2_with_padding...")
print("(shall be fast!)")
start = time.time_ns()
xla_generate(**tokenized_input_2_with_padding)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} msn")

That is a lot better, successive generation calls performed this fashion shall be orders of magnitude faster than before.
Be mindful that trying recent generation options, at any point, will trigger tracing.

print("Calling XLA generation with the identical input, but with recent options...")
print("(slow again)")
start = time.time_ns()
xla_generate(**tokenized_input_1_with_padding, num_beams=2)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} msn")

From a developer perspective, counting on XLA implies being aware of a couple of additional nuances. XLA shines when the dimensions
of the information structures are known prematurely, akin to in model training. Then again, when their dimensions are
inconceivable to find out or certain dynamic slices are used, XLA fails to compile. Modern implementations of text
generation are auto-regressive, whose natural behavior is to expand tensors and to abruptly interrupt some operations
because it goes — in other words, not XLA-friendly by default.
Now we have rewritten our entire TensorFlow text generation codebase
to vectorize operations and use fixed-sized
structures with padding. Our NLP models were also modified to accurately use their positional embeddings within the
presence of padded structures. The result ought to be invisible to TensorFlow text generation users, aside from the
availability of XLA compilation.



Benchmarks and Conclusions

Above you saw that you would be able to convert TensorFlow functions right into a graph and speed up them with XLA compilation.
Current types of text generation are simply an auto-regressive functions that alternate between a model forward pass
and a few post-processing, producing one token per iteration. Through XLA compilation, all the process gets
optimized, leading to faster execution. But how much faster? The Gradio demo below incorporates some benchmarks
comparing Hugging Face’s text generation on multiple GPU models for the 2 essential ML frameworks, TensorFlow and PyTorch.

When you explore the outcomes, two conclusions develop into quickly visible:

  1. As this blog post has been increase to here, TensorFlow text generation is way faster when XLA is used. We’re
    talking about speedups larger than 100x in some cases, which truly demonstrates the facility of a compiled graph 🚀
  2. TensorFlow text generation with XLA is the fastest option within the overwhelming majority of cases, in a few of them by as
    much as 9x faster, debunking the parable that PyTorch is the go-to framework for serious NLP tasks 💪

Give the colab
a go, and revel in the facility of text generation supercharged with XLA!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x