Odd behavior
So: What did they find? Anthropic checked out 10 different behaviors in Claude. One involved using different languages. Does Claude have a component that speaks French and one other part that speaks Chinese, and so forth?
The team found that Claude used components independent of any language to reply a matter or solve an issue after which picked a selected language when it replied. Ask it “What’s the alternative of small?” in English, French, and Chinese and Claude will first use the language-neutral components related to “smallness” and “opposites”to provide you with a solution. Only then will it pick a selected language wherein to answer. This means that giant language models can learn things in a single language and apply them in other languages.
Anthropic also checked out how Claude solved basic math problems. The team found that the model seems to have developed its own internal strategies which can be unlike those it should have seen in its training data. Ask Claude so as to add 36 and 59 and the model will undergo a series of strange steps, including first adding a choice of approximate values (add 40ish and 60ish, add 57ish and 36ish). Towards the tip of its process, it comes up with the worth 92ish. Meanwhile, one other sequence of steps focuses on the last digits, 6 and 9, and determines that the reply must end in a 5. Putting that along with 92ish gives the proper answer of 95.
And yet in case you then ask Claude the way it worked that out, it should say something like: “I added those (6+9=15), carried the 1, then added the 10s (3+5+1=9), leading to 95.” In other words, it gives you a typical approach found in every single place online reasonably than what it actually did. Yep! LLMs are weird. (And never to be trusted.)
ANTHROPIC
This is evident evidence that giant language models will give reasons for what they do this don’t necessarily reflect what they really did. But that is true for people too, says Batson: “You ask any individual, ‘Why did you do this?’ And so they’re like, ‘Um, I suppose it’s because I used to be— .’ You already know, possibly not. Perhaps they were just hungry and that’s why they did it.”
Biran thinks this finding is particularly interesting. Many researchers study the behavior of huge language models by asking them to elucidate their actions. But that is likely to be a dangerous approach, he says: “As models proceed getting stronger, they have to be equipped with higher guardrails. I consider—and this work also shows—that relying only on model outputs is just not enough.”
A 3rd task that Anthropic studied was writing poems. The researchers desired to know if the model really did just wing it, predicting one word at a time. As an alternative they found that Claude someway looked ahead, picking the word at the tip of the subsequent line several words prematurely.
For instance, when Claude was given the prompt “A rhyming couplet: He saw a carrot and needed to grab it,” the model responded, “His hunger was like a ravenous rabbit.” But using their microscope, they saw that Claude had already come across the word “rabbit” when it was processing “grab it.” It then seemed to write down the subsequent line with that ending already in place.
