Why Language Models Get ‘Lost’ in Conversation

A brand new paper from Microsoft Research and Salesforce finds that even probably the most capable Large Language Models (LLMs) crumble when instructions are given slightly than . The authors found that performance drops by a median of 39 percent across six tasks when a prompt is :

Source: https://arxiv.org/pdf/2505.06120

More strikingly, the of responses takes a nosedive, with prestigious models corresponding to ChatGPT-4.1 and Gemini 2.5 Pro swinging between near-perfect answers and manifest failures, depending on how the identical task is phrased; further, output consistency can drop by greater than half in the method.

To explore this behavior, the paper introduces a technique called *, which splits fully-specified prompts into smaller fragments and releases them separately right into a conversation.

In probably the most basic terms, that is similar to giving a cohesive and comprehensive single order at a restaurant, leaving the waiter with nothing to do but acknowledge the request; or else deciding to attack the matter collaboratively:

Two extreme versions of a restaurant conversation (not from the new paper, for illustrative purposes only).

For emphasis, the instance above perhaps puts the client in a negative light. However the core idea depicted within the second column is that of a transactional exchange that clarifies a problem-set, prior to addressing the issues – apparently a rational and reasonable way of approaching a task.

This setup is reflected in the brand new work’s drip-fed, approach to LLM interaction. The authors note that LLMs often generate overly long responses after which proceed to depend on their very own insights . This tendency, combined with other aspects, may cause the system to lose track of the exchange entirely.

The truth is, the researchers note what lots of us have found anecdotally – that the perfect strategy to get the conversation back on the right track is to start out a brand new conversation with the LLM.

The authors acknowledge that agentic systems corresponding to Autogen or LangChain can potentially improve the outcomes by acting as interpretative layers between the end-user and the LLM, only communicating with the LLM after they have gathered enough ‘sharded’ responses to coagulate right into a single cohesive query (which the end-user is not going to be exposed to).

Nonetheless, the authors contend that a separate abstraction layer shouldn’t be essential, or else be built directly into the source LLM:

But having tested the proposition across their array of examples, they conclude:

This interesting latest paper is titled , and comes from 4 researchers across MS Research and Salesforce,

Fragmented Conversations

The brand new method first breaks down conventional single-turn instructions into smaller shards, designed to be introduced at key moments during an LLM interaction, a structure that reflects the exploratory, back-and-forth type of engagement seen in systems corresponding to ChatGPT or Google Gemini.

Each original instruction is a single, self-contained prompt that delivers all the task in a single go, combining a high-level query, supporting context, and any relevant conditions. The sharded version breaks this into multiple smaller parts, with each shard adding only one piece of data:

Paired instructions showing (a) a complete prompt delivered in a single turn and (b) its sharded version used to simulate an underspecified, multi-turn interaction. Semantically, each version delivers the same informational payload.

The primary shard all the time introduces the predominant goal of the duty, while the remainder provide clarifying details. Together, they deliver the identical content as the unique prompt, but unfolded naturally over several turns within the conversation.

Each simulated conversation unfolds between three components: the the model under evaluation; the a simulated agent with access to the total instruction in sharded form; and the , which invigilates and scores the exchange.

The conversation begins with the user revealing the primary shard and the assistant replying freely. The system then classifies that response into considered one of several categories, corresponding to a or a .

If the model attempt a solution, a separate component extracts just the relevant span for evaluation, ignoring any surrounding text. On each latest turn, the user reveals one additional shard, prompting one other response. The exchange continues until either the model gets the reply right or there aren’t any shards left to disclose:

Diagram of a sharded conversation simulation, with the evaluated model highlighted in red.

Early tests showed that models often asked about information that hadn’t been shared yet, so the authors dropped the thought of showing shards in a set order. As a substitute, a simulator was used to make a decision which shard to disclose next, based on how the conversation was going.

The user simulator, implemented using GPT-4o-mini, was subsequently given full access to each all the instruction and the conversation history, tasked with deciding, at each turn, which shard to disclose next, based on how the exchange was unfolding.

The user simulator also each shard to take care of conversational flow, without altering the meaning. This allowed the simulation to reflect the ‘give-and-take’ of real dialogue, while preserving control over the duty structure.

Before the conversation begins, the assistant is given only the fundamental information needed to finish the duty, corresponding to a database schema or an API reference. It just isn’t told that the instructions can be broken up, and it just isn’t guided toward any specific way of handling the conversation. This is completed on purpose: in real-world use, models are almost never told that a prompt can be incomplete or updated over time, and leaving out this context helps the simulation reflect how the model behaves in a more realistic context.

GPT-4o-mini was also used to make a decision how the model’s replies ought to be classified, and to tug out any final answers from those replies. This helped the simulation stay flexible, but did introduce occasional mistakes: nonetheless, after checking several hundred conversations by hand, the authors found that fewer than five percent had any problems, and fewer than two percent showed a change in final result due to them, they usually considered this a low enough error rate throughout the parameters of the project.

Simulation Scenarios

The authors used five forms of simulation to check model behavior under different conditions, each a variation on how and when parts of the instruction are revealed.

Within the setting, the model receives all the instruction in a single turn. This represents the usual benchmark format and serves because the performance baseline.

The setting breaks the instruction into multiple pieces and delivers them separately, simulating a more realistic, underspecified conversation. That is the predominant setting used to check how well models handle multi-turn input.

Within the setting, the shards are stitched back together as a single list, preserving their wording but removing the turn-by-turn structure. This helps isolate the consequences of conversational fragmentation from rephrasing or content loss.

The setting runs like , but adds a final turn where all previous shards are restated before the model gives a final answer. This tests whether a summary prompt can assist get well lost context.

Finally, goes further, by repeating , keeping the total instruction visible because the conversation unfolds – and offering a more forgiving test of multi-turn ability.

Simulation types based on sharded instructions. A fully-specified prompt is split into smaller parts, which can then be used to simulate either single-turn (Full, Concat) or multi-turn (Sharded, Recap, Snowball) conversations, depending on how quickly the information is revealed.

Tasks and Metrics

Six generation tasks were chosen to cover each programming and natural language domains: code generation prompts were taken from HumanEval and LiveCodeBench; Text-to-SQL queries were sourced from Spider; API calls were constructed using data from the Berkeley Function Calling Leaderboard; elementary math problems were provided by GSM8K; tabular captioning tasks were based on ToTTo; and Multi-document summaries were drawn from the Summary of a Haystack dataset.

Model performance was measured using three core metrics: , , and .

captured how well a model did overall across multiple attempts; reflected the perfect results a model could reach, based on its top-scoring outputs; and measured how much those results varied, with larger gaps between best and worst outcomes indicating less stable behavior.

All scores were placed on a 0-100 scale to make sure consistency across tasks, and metrics computed for every instruction – after which averaged to offer an overall picture of model performance.

Six sharded tasks used in the experiments, covering both programming and natural language generation. Each task is shown with a fully-specified instruction and its sharded version. Between 90 and 120 instructions were adapted from established benchmarks for each task.

Contenders and Tests

Within the initial simulations (with an estimated cost of $5000), 600 instructions spanning six tasks were sharded and used to simulate three conversation types: , , and . For every combination of model, instruction, and simulation type, ten conversations were run, producing over 200,000 simulations in total – a schema that made it possible to capture each overall performance and deeper measures of aptitude and reliability.

Fifteen models were tested, spanning a big selection of providers and architectures: the OpenAI models GPT-4o (version 2024-11-20), GPT-4o-mini (2024-07-18), GPT-4.1 (2025-04-14), and the considering model o3 (2025-04-16).

Anthropic models were Claude 3 Haiku (2024-03-07) and Claude 3.7 Sonnet (2025-02-19), accessed via Amazon Bedrock.

Google contributed Gemini 2.5 Flash (preview-04-17) and Gemini 2.5 Pro (preview-03-25). Meta models were Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct, in addition to Llama 4 Scout-17B-16E, via Together AI.

The opposite entries were OLMo 2 13B, Phi-4, and Command-A, all accessed locally via Ollama or Cohere API; and Deepseek-R1, accessed through Amazon Bedrock.

For the 2 ‘considering’ models (o3 and R1), token limits were raised to 10,000 to accommodate longer reasoning chains:

Average performance scores for each model across six tasks: code, database, actions, data-to-text, math, and summary. Results are shown for three simulation types: full, concat, and sharded. Models are ordered by their average full-setting score. Shading reflects the degree of performance drop from the full setting, with the final two columns reporting average declines for concat and sharded relative to full.

Regarding these results, the authors state^†:

on the very same tasks

scores averaged 95 percent of , indicating that the performance drop within the sharded setting can’t be explained by information loss. Smaller models corresponding to Llama3.1-8B-Instruct, OLMo-2-13B, and Claude 3 Haiku showed more pronounced degradation under , suggesting that smaller models are generally less robust to rephrasing than larger ones.

The authors observe^†:

The initial test indicates that some models held up higher in specific tasks: Command-A on Actions, Claude 3.7 Sonnet, and GPT-4.1 on code; and Gemini 2.5 Pro on Data-to-Text, indicating that multi-turn ability varies by domain. Reasoning models corresponding to o3 and Deepseek-R1 fared no higher overall, perhaps because their longer replies introduced more assumptions, which tended to confuse the conversation.

Reliability

The connection between aptitude and reliability, clear in single-turn simulations, appeared to crumble under multi-turn conditions. While aptitude declined only modestly, unreliability on average. Models that were stable in full-format prompts, corresponding to GPT-4.1 and Gemini 2.5 Pro, became just as erratic as weaker models like Llama3.1-8B-Instruct or OLMo-2-13B once the instruction was fragmented.

Model responses often varied by as much as 50 points on the identical task, even when nothing latest was added, suggesting that the drop in performance was not on account of a scarcity of skill, but to the model becoming increasingly unstable across turns.

The paper states^†:

To check whether performance degradation was tied to the variety of turns, the authors ran a gradual sharding experiment, splitting each instruction into one to eight shards (see right-most column in image above).

Because the variety of shards increased, unreliability rose steadily, confirming that . Aptitude remained mostly unchanged, reinforcing that the difficulty lies in , not capability.

Temperature Control

A separate set of experiments tested whether unreliability was simply a byproduct of randomness. To do that, the authors varied the temperature setting of each the assistant and the user simulator across three values: 1.0, 0.5, and 0.0.

In single-turn formats like and , reducing the assistant’s temperature significantly improved reliability, cutting variation by as much as 80 percent; but within the setting, the identical intervention had little effect:

Unreliability scores for different combinations of assistant and user temperature across full, concat, and sharded settings, with lower values indicating greater response consistency.

Even when each the assistant and the user were set to zero temperature, unreliability remained high, with GPT-4o showing variation around 30 percent, suggesting that the instability seen in multi-turn conversations just isn’t just stochastic noise, but a structural weakness in how models handle fragmented input.

Implications

The authors write of the implications of their findings at unusual length on the paper’s conclusion, arguing that strong single-turn performance doesn’t guarantee multi-turn reliability, and cautioning against over-relying on fully-specified benchmarks when evaluating real-world readiness (since such benchmarks mask instability in additional natural, fragmented interactions).

Additionally they suggest that unreliability just isn’t only a sampling artifact, but a in how current models process evolving input, they usually suggest that this raises concerns for agent frameworks, which rely upon sustained reasoning across turns.

Finally, they argue that multi-turn ability ought to be treated as a core capability of LLMs, not something offloaded to external systems.

The authors note that their results likely the true scale of the issue, and draw attention to the best conditions of the test: the user simulator of their setup had full access to the instruction and will reveal shards in an optimal order, which gave the assistant an unrealistically favorable context (in real-world use, users often supply fragmented or ambiguous prompts without knowing what the model needs to listen to next).

Moreover, the assistant was evaluated after each turn, before the total conversation unfolded, stopping later confusion or self-contradiction from being penalized, which might otherwise worsen performance. These decisions, while essential for experimental control, mean that the reliability gaps observed in practice are more likely to be even greater than those reported.

They conclude:

Conclusion

Anyone who has spent a big period of time with an LLM will likely recognize the problems formulated here, from practical experience; and most of us, I imagine, have intuitively abandoned ‘lost’ LLM conversations for fresh ones, within the hope that the LLM may ‘start over’ and stop to obsess about material that got here up in an extended, winding and increasingly infuriating exchange.

It’s interesting to notice that throwing more context at the issue may not necessarily solve it; and indeed, to watch that the paper raises more questions than it provides answers (except by way of ways to skip around the issue).

†

Why Language Models Get ‘Lost’ in Conversation

Fragmented Conversations

Simulation Scenarios

Tasks and Metrics

Contenders and Tests

Reliability

Temperature Control

Implications

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Deep Reinforcement Learning: The Actor-Critic Method

Putting RL back in RLHF

EDA in Public (Part 3): RFM Evaluation for Customer Segmentation in Pandas

From DeepSpeed to FSDP and Back Again with Hugging Face Speed up

The Next Generation of HumanEval

Why Language Models Get ‘Lost’ in Conversation

Fragmented Conversations

Simulation Scenarios

Tasks and Metrics

Contenders and Tests

Reliability

Temperature Control

Implications

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.