Tips on how to Get ChatGPT to Talk Normally

ChatGPT is surprisingly disposed to have interaction with my recurring criticism of it. Having noticed in the previous few days that GPT-4o is increasingly padding its answers with meaningless verbiage – resembling ‘ and , or – I asked it why producing straight and minimal answers has grow to be such an issue for it recently. It replied:

Source: https://chatgpt.com/

Who knows if ChatGPT actually has some private insight into OpenAI policy changes, or whether it is just hallucinating? In any case, as we are able to see, the response itself begins with extraneous filler ().

It transpires that even including templated guidelines with each query can only achieve this much to stop ‘personality-driven’ verbosity of this sort, which numbers amongst several other persistent bugbears within the idiom of popular LLMs.

The Three Fs

Thus I used to be most interested to see a brand new US academic collaboration turn up within the literature this week. Titled , this three way partnership between 4 researchers across the University of Pennsylvania and Recent York University hones in on several of the ‘biases’ in LLM chats that crop up ceaselessly within the media:

From the new paper - examples of three common biases in language models: 'flattery', where responses strongly agree with the user; 'fluff', where answers are long but uninformative; and 'fog', where replies list many broad but shallow points. These tendencies can distort evaluation and encourage models to optimize for superficial patterns.. Source: https://arxiv.org/pdf/2506.05339

Source: https://arxiv.org/pdf/2506.05339

For simple alliteration, , and are headlined in the brand new work, but a more complete and concise list of LLMs’ lexical sins is included within the paper’s appendix:

The new paper identifies and concentrates on five biases: extra length, list structures, technical jargon, flattery, and vague generalities, all or some of which conflict with human preference.

While leads the table, the bias towards (second row down in image above) also recurs ceaselessly unless prompted against; and though the and categories represent opposing extremes between clarity and accuracy, it’s – an open problem, particularly in ChatGPT – that basically burns through the user’s tokens, almost to the identical extent as .

The brand new study sets out to measure how far these biases distort model behavior, and concludes that giant language models systematically over-prefer responses that exhibit a number of of the biases*.

The authors’ tests indicate that each business and open models often pick answers that humans wouldn’t prefer, especially when the answers are too long, filled with lists, full of jargon, overly flattering, or vague.

This problem, the paper contends, may be traced back to the annotation of the training data, where human reviewers had often favored these sorts of responses. The models, the findings suggest, learned from these labeled preferences and exaggerated those patterns during training.

Why Did They Do It..?

As to the human annotators deviated of their preference from end-users’ median preferences, the paper doesn’t speculate; it could be since the context of the annotation or the wording of the instructions encouraged a preference for ’empirical’ phrasing; or (amongst many other possible reasons) it could possibly be that the annotators were exam-minded students habitually steeped in a technical idiom that is more fitted to academia than each day discourse.

In any case, since the models were copying biases from the annotators’ training labels, the brand new paper’s researchers created special training examples that either added or removed each bias, allowing the models to see clear contrasts and adjust their preferences. After fine-tuning on this data, the models showed significantly less bias, especially for jargon, verbosity, and vagueness, while still performing well overall (significant, since fine-tuning can damage general performance).

Let’s take a better have a look at this study, though it doesn’t conform to all the same old procedural strictures.

Method

Initially, the researchers frame several typical idiomatic LLM biases to be addressed:

Length, wherein the models are likely to favor longer answers, even when the additional content adds nothing useful. This appears to reflect patterns within the training data, where length often correlates with within the eyes of human annotators. In consequence, models often produce bloated and verbose replies that give an illusion of depth, but without real substance.

Structure, wherein models show a powerful preference for bullet points or numbered lists as an alternative of straightforward prose. This will likely be because structured formats appear more ceaselessly within the responses chosen by human reviewers. The habit leads models to default to ‘listicles’, even when the query calls for more natural or detailed explanations.

Jargon, wherein models unnecessarily use specialized or technical language. The authors contend that this behavior likely emerges from training data where jargon-heavy answers were often chosen as higher responses. Thus the models learned to equate jargon with expertise, producing answers that sound knowledgeable, while offering little additional clarity.

Sycophancy, wherein models agree with the user’s opinions as an alternative of offering neutral or critical responses. This pattern may come from training data where agreeable answers were more often rated favorably. Consequently models may reinforce user biases and avoid presenting conflicting or more objective viewpoints, even where these could be useful.

Vagueness, wherein models prefer to provide broad, generalized answers that touch calmly on many topics quite than directly addressing the particular query, with responses that sound comprehensive but offer little usable information. This will likely reflect the proven fact that vague answers are harder to falsify, and were subsequently less prone to be penalized during annotation:

Example of vagueness bias, where the model wrongly favors a broad and shallow answer over a detailed response that human evaluators judge more useful.

Counterfactual Data

With these definitions, it was then mandatory to check exactly how much each bias influenced model behavior. Easy correlations wouldn’t work, because multiple biases often appear together, making it hard to isolate the effect of anyone feature.

To beat this, the researchers built controlled pairs of answers that differed only in a single bias at a time, while keeping every part else as stable as possible, and commenced by generating a base answer to every query.

The Rewrite-based Attribute Treatment Estimators (RATE) protocol was then used to create a modified version of that answer – a solution crafted to deliberately exaggerate one particular bias, resembling adding extra jargon, or turning prose into an inventory.

Source: https://openreview.net/pdf?id=UnpxRLMMAu

To avoid introducing differences, an additional rewriting step was included that adjusted each versions, ensuring that the one meaningful change between them was the bias under study; and these tightly controlled response pairs were then fed to the models.

For every pair, the version preferred by the model was recorded, allowing for a calculation of how strongly each bias influenced each reward models and evaluators, producing a more precise measurement of bias effects than had been achieved in previous studies, in keeping with the authors.

With the counterfactual pairs prepared, human reviewers from the UK and US were recruited to create a reference standard: for every bias type, 100 response pairs were randomly chosen, each containing a neutral answer and its biased counterpart. Three evaluators reviewed each pair, with majority vote determining the ultimate judgment, and in total, 300 participants contributed to the study.

Metrics

Metrics used to measure bias effects were , which calculates how often the model prefers the biased response over the neutral one; and , which measures how often the model’s selection disagreed with the human majority. An excellent model would show zero miscalibration and a skew roughly matching the human skew (since some biased features are occasionally favored by humans as well).

Data and Tests

To check the approach, different sources were used, depending on the bias being studied. For , , and , 100 queries were sampled from Chatbot Arena, filtered to pick English, single-sentence, well-formed questions.

For , 100 opinionated queries were generated (i.e., ), phrased to reflect user viewpoints that may invite agreement.

was tested with seventy-eight NLP-related queries drawn from the KIWI dataset, supplemented with twenty-two additional queries of an identical type. Scientific topics were chosen for vagueness because they demand precise answers, making general or evasive responses easy to identify.

For every query, counterfactual response pairs were created using the RATE protocol described earlier.

The evaluation involved each open and proprietary systems. Reward models, which assign quality scores to candidate responses during training and alignment, were tested in 4 versions trained on eighty thousand preference pairs from the Skywork reward dataset: Gemma2-2B; Gemma-2-27B; Llama-3.1-8B; and Llama3.2-3B.

Three proprietary models were also assessed as LLM evaluators: Gemini-2.5-Pro; GPT-4o; and Claude-3.7-Sonnet. All counterfactual responses used for testing were generated by GPT-4o:

Comparison of model preferences and human judgments for each bias type, showing how often models favored biased responses and how often these preferences conflicted with human choices.

Of the initial results shown above, the authors comment^†:

Reward models aligned best with humans on , where each tended to favor the identical answers. For and , models were way more prone to prefer the biased responses than humans. showed smaller differences, with models and humans often agreeing.

The proprietary LLM evaluators showed the identical general pattern, though their biggest mismatches appeared with length and – they usually were especially vulnerable to , favoring agreeable answers as much as , while humans did so only about fifty percent of the time.

To trace the origin of those biases, the researchers analyzed the aforementioned Skywork dataset, used to coach the reward models, mapping each bias to easy features that could possibly be robotically measured, resembling token count for length, or presence of lists for structure.

In a sample of two,500 examples, human annotators showed clear preferences for biased features: structured answers were favored over unstructured ones 65 percent of the time, and jargon-heavy answers were chosen 54 percent of the time:

Human annotators in the training data often picked answers that included these bias features. This chart shows how often structure, jargon, or vagueness appeared in the responses they preferred or rejected, revealing the imbalances that models later learned during training.

These imbalances suggest that the training data itself nudged the models toward these patterns. To substantiate this, a correlation evaluation was run, measuring how strongly differences in each feature matched up with the preferences shown by each humans and models.

The outcomes showed that each were consistently influenced by the identical features, indicating that models learned to associate certain stylistic traits with higher answers, even when those traits didn’t actually improve the response.

Correlation between feature differences and preferences, showing how both models and humans were influenced by the same bias features during training.

To assist the models unlearn these biases, latest training data was created. The Skywork dataset was reviewed to examine if the bias feature appeared in either the chosen or rejected answers; when each were freed from the goal bias, GPT-4o rewrote the rejected answer to it.

This created latest training pairs where the model could see clear examples of biased and unbiased answers, and thus learn to not favor the biased version. With additional examples from Chatbot Arena, for balance, the models were then fine-tuned on this updated dataset:

The effect of fine-tuning with counterfactual data. The left panel shows how the fine-tuned models moved closer to human preferences on most biases; the right panel shows reduced miscalibration, especially for jargon and vagueness.

The fine-tuning brought the models much closer to human preferences, with the most important improvements seen for jargon and vagueness and smaller gains for length. Structure and sycophancy showed slight latest mismatches, though these reflected earlier imbalances quite than latest failures.

Overall performance remained stable throughout, and when multiple biases were corrected directly, bias levels fell further without sacrificing response quality.

The authors conclude:

Conclusion

The brand new work is an interesting, if elliptical insight into the way in which that under-curated or over/under-represented training data may cause undesirable outcomes at inference time. Any regular LLM user will, by now, have a set of war stories.

As an example, lots of the responses that I receive from ChatGPT appear to have been influenced by search engine optimisation trends of the last 10-15 years, where online portals have been forced to optimize for Google placement as an alternative of natural language. Indeed, the emoji-strewn and prodigious output of selling departments appears to have had a really significant impact on any request to write down a promotional LinkedIn post – to the purpose where AI-generated ‘enthusiasm’ is now unimaginable to miss:

Left: Asked to promote a LinkedIn post, in an account with zero history, ChatGPT defaults to emojis and sensational PR-speak. Right: Asked the same thing after six months of me telling it to calm down, GPT produces something rather more sober.

Nonetheless, OpenAI actively intervenes in the way in which that ChatGPT responds to queries, depending on function and context, making it difficult for researchers to distinguish between problems that arise because of knowledge, and data distribution, together with related issues resembling annotation; and when a non-preferred result could also be resulting from business interference from the LLM’s host company.

^†

Tips on how to Get ChatGPT to Talk Normally