Current long-context large language models (LLMs) can process inputs as much as 100,000 tokens, yet they struggle to generate outputs exceeding even a modest length of two,000 words. Controlled experiments reveal that the model’s effective generation length is inherently limited by the examples seen during supervised fine-tuning (SFT). In other words, this output limitation stems from the scarcity of long-output examples in existing SFT datasets.
Recent advancements in long-context LLMs have led to the event of models with significantly expanded memory capacities, able to processing history exceeding 100,000 tokens in length. Nonetheless, despite their ability to handle extensive inputs, current long-context LLMs struggle to generate equally lengthy outputs.
To explore this limitation, LongWriter probes the utmost output length of state-of-the-art long-context models with multiple queries that require responses of various lengths, resembling “Write a ten,000-word article on the history of the Roman Empire.” The outcomes show that each one models consistently fail to supply outputs beyond 2,000 words in length. Meanwhile, evaluation of user interaction logs reveals that over 1% of user prompts explicitly request outputs exceeding this limit, highlighting a pressing need in current research to beat this limitation.
To deal with this, LongWriter introduces AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, LongWriter constructs LongWriter-6k, a dataset containing 6,000 SFT data samples with output lengths starting from 2k to 32k words. By incorporating this dataset into model training, LongWriter successfully scales the output length of existing models to over 10,000 words while maintaining output quality.
LongWriter also develops LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. The 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models.
In this text, we’ll discuss the LongWriter framework, explore its architecture, and compare its performance against state-of-the-art long-context large language models. Let’s start.
Recent advancements in long context large language models (LLMs) have led to the creation of models with significantly increased memory capacities, able to processing histories that exceed 100,000 tokens. Despite this ability to handle extensive inputs, current long-context LLMs struggle to generate outputs of comparable length. To research this limitation, LongWriter examines the utmost output length of state-of-the-art long-context models through various queries that require different response lengths, resembling “Write a ten,000-word article on the history of the Roman Empire.” Based on the findings, LongWriter observes that each one models consistently fail to generate outputs longer than 2,000 words. Moreover, an evaluation of user interaction logs indicates that over 1% of user prompts specifically request outputs beyond this limit, highlighting an urgent need in current research to deal with this issue.
LongWriter’s study reveals a key insight: the constraint on output length is primarily rooted within the characteristics of the Supervised Advantageous-Tuning (SFT) datasets. Specifically, LongWriter finds that a model’s maximum generation length is effectively capped by the upper limit of output lengths present in its SFT dataset, despite its exposure to for much longer sequences in the course of the pretraining phase. This finding explains the ever present 2,000-word generation limit across current models, as existing SFT datasets rarely contain examples exceeding this length. Moreover, as many datasets are distilled from state-of-the-art LLMs, additionally they inherit the output length limitation from their source models.
To deal with this limitation, LongWriter introduces AgentWrite, a novel agent-based pipeline designed to leverage off-the-shelf LLMs to robotically construct prolonged, coherent outputs. AgentWrite operates in two stages: First, it crafts an in depth writing plan outlining the structure and goal word count for every paragraph based on the user’s input. Then, following this plan, it prompts the model to generate content for every paragraph in a sequential manner. LongWriter’s experiments validate that AgentWrite can produce high-quality and coherent outputs of as much as 20,000 words.
Constructing upon the AgentWrite pipeline, LongWriter leverages GPT-4o to generate 6,000 long-output SFT data, named LongWriter-6k, and adds this data to coach existing models. Notably, LongWriter-6k successfully unlocks the model’s ability to generate well-structured outputs exceeding 10,000 words in length. To carefully evaluate the effectiveness of this approach, LongWriter develops the LongBench-Write benchmark, which incorporates a various set of user writing instructions, with output length specifications starting from 0-500 words, 500-2,000 words, 2,000-4,000 words, and beyond 4,000 words. Evaluation on LongBench-Write shows that LongWriter’s 9B size model achieves state-of-the-art performance, even in comparison with larger proprietary models. LongWriter further constructs preference data and uses DPO to assist the model higher follow long writing instructions and generate higher quality written content, which has also been proven effective through experiments.
To summarize, LongWriter’s work makes the next novel contributions:
- Evaluation of Generation Length Limits: LongWriter identifies the first factor limiting the output length of current long-context LLMs, which is the constraint on the output length within the SFT data.
- AgentWrite: To beat this limitation, LongWriter proposes AgentWrite, which uses a divide-and-conquer approach with off-the-shelf LLMs to robotically construct SFT data with ultra-long outputs. Using this method, LongWriter constructs the LongWriter-6k dataset.
- Scaling Output Window Size of Current LLMs: LongWriter incorporates the LongWriter-6k dataset into its SFT data, successfully scaling the output window size of existing models to 10,000+ words without compromising output quality. LongWriter shows that DPO further enhances the model’s long-text writing capabilities.
AgentWrite: Automatic Data Construction
To utilize off-the-shelf LLMs for robotically generating SFT data with longer outputs, LongWriter designs AgentWrite, a divide-and-conquer style agent pipeline. AgentWrite first breaks down long writing tasks into multiple subtasks, with each subtask requiring the model to jot down just one paragraph. The model then executes these subtasks sequentially, and LongWriter concatenates the subtask outputs to acquire the ultimate long output. Such an approach of breaking down a posh task into multiple subtasks using LLM agents has already been applied in various fields, resembling problem-solving, software development, and model evaluation. LongWriter’s work is the primary to explore integrating planning to enable models to finish complex long-form writing tasks. Each step of AgentWrite is introduced intimately below.
Step I: Plan
Inspired by the thought technique of human writers, who typically start by making an overall plan for long writing tasks, LongWriter utilizes the planning capabilities of LLMs to output such a writing outline given a writing instruction. This plan includes the essential content and word count requirements for every paragraph. The prompt utilized by LongWriter is as follows:
“I want you to assist me break down the next long-form writing instruction into multiple subtasks. Each subtask will guide the writing of 1 paragraph within the essay and may include the essential points and word count requirements for that paragraph. The writing instruction is as follows: {User Instruction}. Please break it down in the next format, with each subtask taking on one line:
Paragraph 1 – Principal Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g., 400 words]
Paragraph 2 – Principal Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g. 1000 words].Be certain that every subtask is evident and specific, and that each one subtasks cover your entire content of the writing instruction. Don’t split the subtasks too finely; each subtask’s paragraph must be a minimum of 200 words and not more than 1000 words. Don’t output another content.”
Step II: Write
After obtaining the writing plan from Step I, LongWriter calls the LLM serially to finish each subtask, generating the writing content section by section. To make sure the coherence of the output, when LongWriter calls the model to generate the n-th section, the previously generated n−1 sections are also input, allowing the model to proceed writing the subsequent section based on the present writing history. Although this serial manner prevents parallel calls to the model to finish multiple subtasks concurrently, and the input length becomes longer, LongWriter shows in validation that the general coherence and quality of the writing obtained this fashion are far superior to the output generated in parallel. The prompt in use by LongWriter is:
“You’re a wonderful writing assistant. I offers you an original writing instruction and my planned writing steps. I may even offer you the text I even have already written. Please help me proceed writing the subsequent paragraph based on the writing instruction, writing steps, and the already written text.
Writing instruction:
{User Instruction}
Writing steps:
{The writing plan generated in Step I}
Already written text:
{Previous generated (n-1) paragraphs}
Please integrate the unique writing instruction, writing steps, and the already written text, and now proceed writing {The plan for the n-th paragraph, i.e., the n-th line within the writing plan}.”
Validation
LongWriter tests the generation length and quality of the proposed AgentWrite method on two long-form writing datasets. The primary one, LongWrite-Ruler, is used to measure exactly how long of an output the tactic can provide. The second, LongBench-Write, is principally used to guage how well the model-generated content aligns with user instructions when it comes to length and writing quality.
LongBench-Write: To judge the model’s performance on a more diverse range of long-form writing instructions, LongWriter collects 120 varied user writing prompts, with 60 in Chinese and 60 in English. To higher assess whether the model’s output length meets user requirements, LongWriter ensures that each one these instructions include explicit word count requirements. These instructions are divided into 4 subsets based on the word count requirements: 0-500 words, 500-2,000 words, 2,000-4,000 words, and over 4,000 words. Moreover, the instructions are categorized into seven types based on the output type: Literature and Creative Writing, Academic and Monograph, Popular Science, Functional Writing, News Report, Community Forum, and Education and Training.
During evaluation, LongWriter adopts two metrics: one for scoring the output length and one other for scoring the output quality. The model’s output length is scored based on how close it’s to the necessities laid out in the instructions. For output quality, LongWriter uses the LLM-as-a-judge approach, choosing the state-of-the-art GPT-4o model to attain the output across six dimensions: Relevance, Accuracy, Coherence, Clarity, Breadth and Depth, and Reading Experience. The ultimate rating is computed by averaging the length rating and the standard rating.
Validation results: LongWriter presents the output length measurement on LongWrite-Ruler and finds that AgentWrite successfully extends the output length of GPT-4o from a maximum of 2k words to roughly 20k words. LongWriter also assesses each the output quality and adherence to the required output length on LongBench-Write, showing that GPT-4o can successfully complete tasks with outputs under 2,000 words in length when evaluating AgentWrite’s performance.
Supervised Advantageous-Tuning
LongWriter conducts training based on two of the newest open-source models, namely GLM-4-9B and Llama-3.1-8B. Each of those are base models and support a context window of as much as 128k tokens, making them naturally suitable for training on long outputs. To make the training more efficient, LongWriter adopts packing training with loss weighting. The training on the 2 models leads to two models: LongWriter-9B (abbreviated for GLM-4-9B-LongWriter) and LongWriter-8B (abbreviated for Llama-3.1-8B-LongWriter).
At the identical time, LongWriter notices that if the loss is averaged by sequence, i.e., taking the mean of every sequence’s average loss inside a batch, the contribution of every goal token to the loss in long output data can be significantly lower than those with shorter outputs. In LongWriter’s experiments, it’s also found that this results in suboptimal model performance on tasks with long outputs. Subsequently, LongWriter chooses a loss weighting strategy that averages the loss by token, where the loss is computed because the mean of losses across all goal tokens inside that batch.
All models are trained using a node with 8xH800 80G GPUs and DeepSpeed+ZeRO3+CPU offloading. LongWriter uses a batch size of 8, a learning rate of 1e-5, and a packing length of 32k. The models are trained for 4 epochs, which takes roughly 2,500-3,000 steps.
Alignment (DPO)
To further improve the model’s output quality and enhance its ability to follow length constraints in instructions, LongWriter performs direct preference optimization (DPO) on the supervised fine-tuned LongWriter-9B model. The DPO data comes from GLM-4’s chat DPO data (roughly 50k entries). Moreover, LongWriter constructs 4k pairs of information specifically targeting long-form writing instructions. For every writing instruction, LongWriter samples 4 outputs from LongWriter-9B and scores these outputs following a selected method. A length-following rating can be combined as computed. The very best-scoring output is then chosen because the positive sample, and considered one of the remaining three outputs is randomly chosen because the negative sample.
The resulting model, LongWriter-9B-DPO, is trained for 250 steps on the above data mixture. LongWriter follows a selected recipe for DPO training.
LongWriter: Experiments and Results
LongWriter evaluates 4 proprietary models and 5 open-source models on LongBench-Write, together with the trained LongWriter models. To the most effective of LongWriter’s knowledge, Suri-IORPO is the one prior model that can be aligned for long-form text generation. It’s trained based on Mistral-7B-Instruct-v0.2 using LoRA. Consistent with the evaluation setup on LongWrite-Ruler, LongWriter sets the output temperature to 0.5 and configures the model’s generation max tokens parameter to the utmost allowed by its API call. For open-source models, it is ready to 32,768.
Most previous models are unable to satisfy the length requirement of over 2,000 words, while LongWriter models consistently provide longer and richer responses to such prompts.
Observing the output length rating SlS_lSl for prompts in each required length range, LongWriter finds that previous models generally perform poorly (scoring below 70) on prompts within the [2k, 4k) range, with only Claude 3.5 Sonnet achieving a good rating. For prompts within the [4k, 20k) range, just about all previous models are completely unable to succeed in the goal output length, even scoring 0 (meaning all output lengths are lower than one-third of the required length). By adding training data from LongWriter-6k, LongWriter’s trained model can effectively reach the required output length while maintaining good quality, as suggested by the scores within the [2k, 20k) range and the scatter plots.
DPO effectively improves each the model’s output quality and its ability to follow length requirements in long generation.
By comparing the scores of LongWriter-9B and LongWriter9B-DPO, we discover that DPO significantly improves each Sl (+4%) and Sq (+3%) scores, and the development is consistent across all ranges. This shows that in long generation scenario, DPO still helps to enhance the model’s output quality and may higher align the model’s output length with 8 Preprint Figure 7: Cumulative average NLL lack of GLM4-9B and Llama-3.1-8B at different positions of LongWriter models’ outputs. Figure 8: LongWrite-Ruler test results of LongWriter models, showing their maximum generation lengths between 10k-20k words. the requested length. The latter conclusion has also been recently observed in Yuan et al. (2024) in shorter generations. We also manually annotate pairwise wins and losses for GPT-4o and three longwriter models on their outputs in LongBench-Write and visualize the leads to Figure 9. We are able to see that humans prefer the DPO-trained model over LongWriter-9B in 58% of the cases. Furthermore, despite having fewer parameters, LongWriter-9B-DPO achieves a tie with GPT-4o.
The output length limit of the LongWriter models is prolonged to between 10k and 20k words, while more data with long outputs is required to support even longer outputs.
Following the LongWrite-Ruler tes,we also present the LongWrite-Ruler test results of LongWriter models. The outcomes suggest that their maximum generation lengths are between 10k-20k words. The shortage of SFT data with longer outputs is probably going the first reason stopping the model from achieving longer output lengths.
Final Thoughts
On this work, we now have talked about LongWriter, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, identifies a 2,000-word generation limit for current LLMs and proposes increasing their output window size by adding long-output data during alignment. To robotically construct long-output data, LongWriter develops AgentWrite, an agent-based pipeline that uses off-the-shelf LLMs to create prolonged, coherent outputs. LongWriter successfully scales the output window size of current LLMs to over 10,000 words with the constructed LongWriter-6k. Extensive ablation studies on the training data exhibit the effectiveness of this approach. For future work, LongWriter suggests the next three directions: 1. Expand the AgentWrite framework to construct data with longer outputs to further extend LLMs’ output window size. 2. Refine the AgentWrite framework to attain higher quality long-output data. 3. Longer model outputs bring challenges to inference efficiency. Several methods have been proposed to enhance inference efficiency. It’s value investigating how these methods can ensure improved model efficiency without compromising generation quality.