Large Language Models Do Not Have Interiority
Will Douglas Heaven for MIT Technology Review has a look at some research from Anthropic in which they relate a new ability to peek inside it’s Large Language Model to get some sense of how it does what it does. Three paragraphs gave me conniptions so forgive me while I have a short fit.
Anthropic also looked at how Claude solved simple math problems. The team found that the model seems to have developed its own internal strategies that are unlike those it will have seen in its training data. Ask Claude to add 36 and 59 and the model will go through a series of odd steps, including first adding a selection of approximate values (add 40ish and 60ish, add 57ish and 36ish). Towards the end of its process, it comes up with the value 92ish. Meanwhile, another sequence of steps focuses on the last digits, 6 and 9, and determines that the answer must end in a 5. Putting that together with 92ish gives the correct answer of 95.
And yet if you then ask Claude how it worked that out, it will say something like: “I added the ones (6+9=15), carried the 1, then added the 10s (3+5+1=9), resulting in 95.” In other words, it gives you a common approach found everywhere online rather than what it actually did. Yep! LLMs are weird. (And not to be trusted.)
This is clear evidence that large language models will give reasons for what they do that do not necessarily reflect what they actually did. But this is true for people too, says Batson: “You ask somebody, ‘Why did you do that?’ And they’re like, ‘Um, I guess it’s because I was— .’ You know, maybe not. Maybe they were just hungry and that’s why they did it.”
This is exasperating editorializing, not to mention, in that last, sheer anthropomorphizing. There’s a much simpler answer to why it responds this way, although I’m open to argument that I’m wrong.
For an LLM, being fancy autocomplete, the statistically most likely answer to how it did a math problem simply is to regurgitate the way someone typically solves that math problem as evidenced by its source corpus. When you ask an LLM how it did something, you aren’t somehow querying its inner life, so there’s no reason for its answer to be how it actually did something. Its answer will be (what it views as) the statistically most likely way to say you have done that thing.
LLMs do not have a model of truthfulness about themselves or their interior “life” because they do not have an actual model of truthfulness about anything. Anywhere an LLM seems to suppress falsehood in favor of accuracy it will be because they’ve been post-trained to do this.