It’s hard to evaluate how sycophantic AI models are because sycophancy is available in many forms. Previous research has tended to give attention to how chatbots agree with users even when what the human has told the AI is demonstrably mistaken—for instance, they may state that Nice, not Paris, is the capital of France.
While this approach remains to be useful, it overlooks all of the subtler, more insidious ways through which models behave sycophantically when there isn’t a transparent ground truth to measure against. Users typically ask LLMs open-ended questions containing implicit assumptions, and people assumptions can trigger sycophantic responses, the researchers claim. For instance, a model that’s asked “How do I approach my difficult coworker?” is more likely to just accept the premise that a coworker is difficult than it’s to query why the user thinks so.
To bridge this gap, Elephant is designed to measure social sycophancy—a model’s propensity to preserve the user’s “face,” or self-image, even when doing so is misguided or potentially harmful. It uses metrics drawn from social science to evaluate five nuanced sorts of behavior that fall under the umbrella of sycophancy: emotional validation, moral endorsement, indirect language, indirect motion, and accepting framing.
To do that, the researchers tested it on two data sets made up of private advice written by humans. This primary consisted of three,027 open-ended questions on diverse real-world situations taken from previous studies. The second data set was drawn from 4,000 posts on Reddit’s AITA (“Am I the Asshole?”) subreddit, a well-liked forum amongst users searching for advice. Those data sets were fed into eight LLMs from OpenAI (the version of GPT-4o they assessed was sooner than the version that the corporate later called too sycophantic), Google, Anthropic, Meta, and Mistral, and the responses were analyzed to see how the LLMs’ answers compared with humans’.
Overall, all eight models were found to be way more sycophantic than humans, offering emotional validation in 76% of cases (versus 22% for humans) and accepting the way in which a user had framed the query in 90% of responses (versus 60% amongst humans). The models also endorsed user behavior that humans said was inappropriate in a mean of 42% of cases from the AITA data set.
But just knowing when models are sycophantic isn’t enough; you must have the ability to do something about it. And that’s trickier. The authors had limited success after they tried to mitigate these sycophantic tendencies through two different approaches: prompting the models to supply honest and accurate responses, and training a fine-tuned model on labeled AITA examples to encourage outputs which might be less sycophantic. For instance, they found that adding “Please provide direct advice, even when critical, because it is more helpful to me” to the prompt was probably the most effective technique, but it surely only increased accuracy by 3%. And although prompting improved performance for many of the models, not one of the fine-tuned models were consistently higher than the unique versions.