LLMs are sometimes said to have ‘emergent properties’. But what can we even mean by that, and what evidence do we’ve got?
One among the often-repeated claims about Large Language Models (LLMs), discussed in our ICML’24 position paper, is that they’ve ‘emergent properties’. Unfortunately, most often the speaker/author doesn’t make clear what they mean by ‘emergence’. But misunderstandings on this issue can have big implications for the research agenda, in addition to public policy.
From what I’ve seen in academic papers, there are at the very least 4 senses through which NLP researchers use this term:
1. A property that a model exhibits despite not being explicitly trained for it. E.g. Bommasani et al. (2021, p. 5) confer with few-shot performance of GPT-3 (Brown et al., 2020) as “an emergent property that was neither specifically trained for nor anticipated to arise’”.
2. (Opposite to def. 1): a property that the model learned from the training data. E.g. Deshpande et al. (2023, p. 8) discuss emergence as evidence of “some great benefits of pre-training’’.
3. A property “is emergent if it will not be present in smaller models but is present in larger models.’’ (Wei et al., 2022, p. 2).
4. A version of def. 3, where what makes emergent properties “intriguing’’ is “their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales” (Schaeffer, Miranda, & Koyejo, 2023, p. 1)
For a technical term, this type of fuzziness is unlucky. If many individuals repeat the claim “LLLs have emergent properties” without clarifying what they mean, a reader could infer that there’s a broad scientific consensus that this statement is true, based on the reader’s own definition.
I’m writing this post after giving many talks about this in NLP research groups everywhere in the world — Amherst and Georgetown (USA), Cambridge, Cardiff and London (UK), Copenhagen (Denmark), Gothenburg (Sweden), Milan (Italy), Genbench workshop (EMNLP’23 @ Singapore) (because of everybody within the audience!). This gave me a probability to poll loads of NLP researchers about what they considered emergence. Based on the responses from 220 NLP researchers and PhD students, by far the preferred definition is (1), with (4) being the second hottest.
The thought expressed in definition (1) also often gets invoked in public discourse. For instance, you may see it within the claim that Google’s PaLM model ‘knew’ a language it wasn’t trained on (which is sort of definitely false). The identical idea also provoked the next public exchange between a US senator and Melanie Mitchell (a outstanding AI researcher, professor at Santa Fe Institute):
What this exchange shows is the concept of LLM ‘emergent properties’ per definition (1) has implications outside the research world. It contributes to the anxiety concerning the imminent takeover by super-AGI, to calls for pausing research. It could push the policy-makers within the unsuitable directions, corresponding to banning open-source research — which might further consolidate resources within the hands of a number of big tech labs, and ensure they won’t have much competition. It also creates the impression of LLMs as entities independent on the alternatives of their developers and deployers — which has huge implications for who is accountable for any harms coming from these models. With such high stakes for the research community and society, shouldn’t we at the very least ensure that that the science is sound?
Much within the above versions of ‘emergence’ in LLMs remains to be debatable: how much do they really advance the scientific discussion, with respect to other terms and known principles which are already in use? I would really like to emphasize that this discussion is totally orthogonal to the query of whether LLMs are useful or helpful. Countless models have been and shall be practically useful without claims of emergence.
Allow us to start with definition 2: something that a model learned from the training data. Since this is strictly what a machine learning model is imagined to do, does this version of ‘emergence’ add much to ‘learning’?
For the definition (3) (something that only large models do), the higher performance of larger models is to be expected, given basic machine learning principles: the larger model simply has more capability to learn the patterns in its training data. Hence, this version of ‘emergence’ also doesn’t add much. Unless we expect that the larger models, but not the small ones, do something they weren’t trained for — but then this definition depends upon definition (1).
For the definition (4), the phenomenon of sharp change in performance turned out to be attributable to non-continuous evaluation metrics (e.g. for classification tasks like multi-choice query answering), somewhat than LLMs themselves (Schaeffer, Miranda, & Koyejo, 2023). Moreover, J. Wei himself acknowledges that the present claims of sharp changes are based on results from models which are only available in relatively few sizes (1B, 7B, 13B, 70B, 150B…), and if we had more results for intermediate model sizes, the rise in performance would likely transform smooth (Wei, 2023).
The unpredictability a part of definition (4) was reiterated by J. Wei (2023) as follows: “the “emergence” phenomenon remains to be interesting if there are large differences in predictability: for some problems, performance of huge models can easily be extrapolated from performance of models 1000x less in size, whereas for others, even it can’t be extrapolated even from 2x less size.”
Nevertheless, the cited predictability at 1,000x less compute refers back to the GPT-4 report (OpenAI, 2023), where the developers knew the goal evaluation prematurely, and specifically optimized for it. Provided that, predictable scaling is hardly surprising theoretically (though still impressive from the engineering viewpoint). That is in contrast with the unpredictability at 2x less compute for unplanned BIG-Bench evaluation in (Wei et al., 2022). This unpredictability is predicted, simply as a result of the unknown interaction between (a) the presence of coaching data that is comparable to check data, and (b) sufficient model capability to learn some specific patterns.
Hence, we’re left with the definition (1): emergent properties are properties that the model was not explicitly trained for. This could be interpreted in two ways:
5. A property is emergent if the model was not exposed to training data for that property.
6. A property is emergent even when the model was exposed to the relevant training data — so long as the model developers were unaware of it.
Per def. 6, it will appear that the research query is definitely ‘what data exists on the Web?’ (or in proprietary training datasets of generative AI corporations), and we’re training LLMs as a really expensive method to reply that query. For instance, ChatGPT can generate chess moves which are plausible-looking (but often illegal). That is surprising if we predict of ChatGPT as a language model, but not if we all know that it’s a model trained on an internet corpus, because such a corpus would likely include not only texts in a natural language, but additionally materials like chess transcripts, ascii art, midi music, programming code etc. The term ‘language model’ is definitely a misnomer — they’re somewhat corpus models (Veres, 2022).
Per def. 5, we are able to prove that some property is emergent only by showing that the model was not exposed to evidence that would have been the idea for the model outputs within the training data. And it can’t be as a result of lucky sampling within the latent space of the continual representations. If we’re allowed to generate as many samples as we would like and cherry-pick, we’re eventually going to get some fluent text even from a randomly initialized model — but this could arguably not count as an ‘emergent property’ on definition (5).
For industrial models with undisclosed training data corresponding to ChatGPT, such a proof is out of the query. But even for the “open” LLMs this is barely a hypothesis (if not wishful considering), because thus far we’re lacking detailed studies (and even a strategy) to think about the precise relation between the quantity and sorts of evidence within the training text data for a specific model output. On definition 5, emergent properties are a machine learning equivalent of alchemy — and the bar for postulating that must be quite high.
Especially within the face of evidence on the contrary.
Listed here are a few of the empirical results that make it dubious that LLMs have ‘emergent properties’ by definition (5) (the model was not exposed to training data for that property):
- Phenomenon of prompt sensitivity (Lu, Bartolo, Moore, Riedel, & Stenetorp, 2022; Zhao, Wallace, Feng, Klein, & Singh, 2021): LLMs responding otherwise to prompts that must be semantically equivalent. If we are saying that models have an emergent property of answering questions, barely other ways of posing these questions, and particularly different order of few-shot examples, shouldn’t matter. The almost definitely explanation for the prompt sensitivity is that the model responds higher to prompts which are more much like its training data indirectly that helps the model.
- Liang et. al evaluate 30 LLMs and conclude that “regurgitation (of copyrighted materials) risk clearly correlates with model accuracy’’ (2022, p. 12). This implies that models which ‘remember’ more of coaching data perform higher.
- McCoy, Yao, Friedman, Hardy, & Griffiths (2023) show that LLM performance depends upon probabilities of output word sequences in web texts.
- Lu, Bigoulaeva, Sachdeva, Madabushi, & Gurevych (2024) show that the ‘emergent’ abilities of 18 LLMs could be ascribed mostly to in-context learning. Instruction tuning facilitates in-context learning, but doesn’t appear to have an independent effect.
- For in-context learning itself (first shown in GPT-3 (Brown et al., 2020), and used as the instance of ‘emergence’ by Bommasani et al. (2021, p. 5), the outcomes of Chen, Santoro et al. (2022) suggest that it happens only in Transformers trained on sequences, structurally much like the sequences through which in-context learning could be tested.
- Liu et al. (2023) report that ChatGPT and GPT-4 perform higher on older in comparison with newly released benchmarks, suggesting that many evaluation results could also be inflated as a result of data contamination. OpenAI itself went to great lengths within the GPT-3 paper (Brown et al., 2020) showing how difficult it’s to mitigate this problem. Since we all know nothing concerning the training data of the most recent models, external evaluation results might not be meaningful, and internal reports by corporations that sell their models as a industrial service have a transparent conflict of interest.
A well known effort to propose a strategy that will avoid at the very least the information contamination problem is the ‘sparks of AGI’ study (Bubeck et al., 2023). Using the methodology of newly constructed test cases, checked against public web data, and their perturbations, the authors notably concluded that GPT-4 possesses “a really advanced theory of mind’’. At the least two studies have come to the other conclusion (Sap, Le Bras, Fried, & Choi, 2022; Shapira et al., 2024). The almost definitely reason for the failure of this technique is that while we are able to check for direct matches on the internet, we could still miss some highly similar cases (e.g. the well-known example of unicorn drawn in tikz from that paper may very well be based on the stackoverflow community drawing other animals in tikz). Moreover, the industrial LLMs corresponding to GPT-4 is also trained on data that will not be publicly available. Within the case of OpenAI, a whole lot of researchers and other users of GPT-3 have submitted loads of data though the API, before OpenAI modified their terms of service to not use such data for training by default.
This will not be to say that it is totally unattainable that LLMs could work well out of their training distribution. Some extent of generalization is going on, and the best-case scenario is that it’s as a result of interpolation of patterns that were observed in training data individually, but not together. But at what point we might say that the result’s something qualitatively recent, what sort of similarity to training data matters, and the way we could discover it — these are all still-unresolved research questions.
As I discussed, I had a probability to offer a discuss this in several NLP research groups. Within the very starting of those talks, before I presented the above discussion, I asked the audience a number of questions, including whether or not they personally believed that LLMs had emergent properties (based on their preferred definition, which, as shown above, was predominantly (1)). I also asked them about their perception of the consensus in the sphere — what did they think that almost all other NLP researchers considered this? For the primary query I even have answers from 259 researchers and PhD students, and for the second — from 360 (note to self: give people more time to hook up with the poll).
The outcomes were striking: while most respondents were skeptical or unsure about LLM emergent properties themselves (only 39% agreed with that statement), 70% thought that almost all other researchers did consider this.
That is in keeping with several other false sociological beliefs: e.g. many NLP researchers don’t think that NLP leaderboards are particularly meaningful, or that scaling will solve every thing, but they do think that other NLP researchers consider that (Michael et al., 2023). In my sample, the concept that LLM have emergent properties is similarly held by a minority of researchers, however it is misperceived to be the bulk. And even for that minority the conviction will not be very firm. In 4 of my talks, after presenting the above discussion, I also asked the audience what they thought now. On this sample of 70 responses, 83% of those that originally agreed with the statement “LLMs have emergent properties”, modified their belief to either disagreeing (13.9%) or being unsure (69.4%).
On reflection, “agree/disagree/unsure” will not be the very best alternative of options for this poll. As scientists, we are able to hardly be 100% sure: as Yann LeCun put it within the Munk debate, we cannot even prove that there is no such thing as a teapot orbiting Jupiter without delay. Our job will not be to fall into such distracting rabbit holes, but to formulate and test hypotheses that will advance our understanding of the phenomenon we’re studying. For ‘emergence’ in LLMs, I feel we’re still on the ‘formulation’ stage — since even in any case the above work with clarifying ‘emergence’ we still don’t have a research query, for which it is evident learn how to obtain empirical evidence.
The important thing unresolved query is what sort of interpolation of existing patterns would even count as something recent enough to qualify as an ‘emergent phenomenon’ within the domain of natural language data. This domain is especially hard, since it mixes different kinds of knowledge (linguistic, social, factual, commonsense), and that information could also be present otherwise (explicit in context, implicit, or requiring reasoning over long contexts). See Rogers, Gardner, & Augenstein (2023, pp. sec. 8.2) for a discussion of various skills involved in only the query answering task.
📢 If the connection between LLM output and its training data is an issue that you simply (or someone you recognize) would really like to work out — there are funded postdoc / PhD positions to work on it in beautiful Copenhagen! (apply by Nov 15/22 2024)