Imagine asking a computer to make a digital painting, or a poem—and happily getting what you asked for. Or imagine chatting with it about various topics, and feeling it was a real interaction. What once was science fiction is becoming reality. In June, Google engineer Blake Lemoine told the Washington Post he was convinced Google’s AI chatbot, LaMDA, was sentient. “I know a person when I talk to it,” Lemoine said. Therein lies the rub: As algorithms are getting increasingly good at producing the kind of “outputs” we once thought were distinctly human, it’s easy to be dazzled. To be sure, getting computers to generate compelling text and images is a remarkable feat; but it is not in itself evidence of sentience, or human-like intelligence. Current AI systems hold up a mirror to our online minds. Like Narcissus, we can get lost gazing at the reflection, even though it is not always flattering. We ought to ask ourselves: Is there more to these algorithms than mindless copy? The answer is not straightforward.
AI research is converging on a way to deal with many problems that once called for piecemeal or specific solutions: training large machine learning models, on vast amounts of data, to perform a broad range of tasks they have not been explicitly designed for. A group of researchers from Stanford coined the suggestive phrase “foundation models” to capture the significance of this trend, although we may prefer the more neutral label “large pre-trained models,” which loosely refers to a family of models that share a few important characteristics. They are trained through self-supervision, that is, without relying on humans manually labeling data; and they can adapt to novel tasks without further training. What’s more, just scaling up their size and training data has proven surprisingly effective at improving their capabilities—no substantial changes to the underlying architecture required. As a result, much of the recent progress in AI has been driven by sheer engineering prowess, rather than groundbreaking theoretical innovation.
Some large pre-trained models are trained exclusively on text. In just a few years, these language models have shown an uncanny ability to write coherent paragraphs, explain jokes, and solve math problems. By now, all the large tech companies at the forefront of AI research have put big money into training their own large language models. Open AI paved the way in 2020 with GPT-3, which was recently followed by a flurry of other gargantuan models, such as PaLM from Google, OPT from Meta (formerly Facebook), and Chinchilla from DeepMind.
We ought to ask ourselves: is there more to these algorithms than mindless copy?
Other large models are trained on images or videos as well as text. In the past few months, some of these new “multi-modal” models have taken the internet by storm with their unforeseen abilities: Open AI’s DALL-E 2 and Google’s Imagen and Parti can generate coherent and stylish illustrations from virtually any caption; and DeepMind’s Flamingo can describe images and answer questions about their content. Large models are also reaching beyond language and vision to venture into the territory of embodied agency. DeepMind designed a model, called Gato, and trained it on things like button presses, proprioceptive inputs, and joint torques—in addition to text and images. As a result, it can play video games and even control a real-world robot.
It is easy to be impressed by what these models can do. PaLM, DALL-E 2, and Gato have fueled a new wave of speculation about the near-term future of AI (and a fundraising frenzy in the industry). Some researchers have even rallied behind the provocative slogan, “Scaling is all you need.” The idea is that further scaling of these models, or similar ones, might lead us all the way to AGI, or artificial general intelligence.
However, many researchers have cautioned against indulging our natural tendency for anthropomorphism when it comes to large pre-trained models. In a particularly influential article, Emily Bender, Timnit Gebru, and colleagues compared language models to “stochastic parrots,” alleging that they haphazardly stitch together samples from their training data. Parrots repeat phrases without understanding what they mean; so it goes, the researchers argue, for language models—and their criticism could be extended to their multi-modal counterparts like DALL-E 2 as well.
Ongoing debates about whether large pre-trained models understand text and images are complicated by the fact that scientists and philosophers themselves disagree about the nature of linguistic and visual understanding in creatures like us. Many researchers have emphasized the importance of “grounding” for understanding, but this term can encompass a number of different ideas. These might include having appropriate connections between linguistic and perceptual representations, anchoring these in the real world through causal interaction, and modeling communicative intentions. Some also have the intuition that true understanding requires consciousness, while others prefer to think of these as two distinct issues. No surprise there is a looming risk of researchers talking past each other.
Still, it’s hard to argue that large pre-trained models currently understand language, or the world, in the way humans do. Children do not learn the meaning of words in a vacuum, simply by reading books. They interact with the world and get rich, multi-modal feedback from their actions. They also interact with adults, who provide a nontrivial amount of supervised learning in their development. Unlike AI models, they never stop learning. In the process, they form persistent goals, desires, beliefs, and personal memories; all of which are still largely missing in AI.
Acknowledging the differences between large pre-trained models and human cognition is important. Too often, these models are portrayed by AI evangelists as having almost magical abilities or being on the verge of reaching human-level general intelligence with further scaling. This misleadingly inspires people to assume large pre-trained models can achieve things they can’t, and to be overconfident in the sophistication of their outputs. The alternative picture that skeptics offer through the “stochastic parrots” metaphor has the merit of cutting through the hype and tempering inflated expectations. It also highlights serious ethical concerns about what will happen as large pre-trained models get deployed at scale in consumer products.
Here’s the thing about mimicry: it need not involve intelligence, or even agency.
But reducing large pre-trained models to mere stochastic parrots might push a little too far in the other direction, and could even encourage people to make other misleading assumptions. For one, there is ample evidence that the successes of these models are not merely due to memorizing sequences from their training data. Language models certainly reuse existing words and phrases—so do humans. But they also produce novel sentences never written before, and can even perform tasks that require using words humans made up and defined in the prompt. This also applies to multi-modal models. DALL-E 2, for example, can create accurate and coherent illustrations of such prompts as, “A photo of a confused grizzly bear in calculus class,” “A fluffy baby sloth with a knitted hat trying to figure out a laptop,” or “An old photograph of a 1920s airship shaped like a pig, floating over a wheat field.” While the model’s training data is not public, it is highly unlikely to contain images that come close to what these prompts (and many equally incongruous ones) describe.
I suggest much of what large pre-trained models do is a form of artificial mimicry. Rather than stochastic parrots, we might call them stochastic chameleons. Parrots repeat canned phrases; chameleons seamlessly blend in new environments. The difference might seem, ironically, a matter of semantics. However, it is significant when it comes to highlighting the capacities, limitations, and potential risks of large pre-trained models. Their ability to adapt to the content, tone, and style of virtually any prompt is what makes them so impressive—and potentially harmful. They can be prone to mimicking the worst aspects of humanity, including racist, sexist, and hateful outputs. They have no intrinsic regard for truth or falsity, making them excellent bullshitters. As the LaMDA story reveals, we are not always good at recognizing that appearances can be deceiving.
Artificial mimicry comes in many forms. Language models are responsive to subtle stylistic features of the prompt. Give such a model the first few sentences of a Jane Austen novel, and it will complete it with a paragraph that feels distinctively Austenian, yet it is nowhere to be found in Austen’s work. Give it a few sentences from a 4chan post, and it will spit out vitriolic trolling. Ask it leading questions about a sentient AI, and it will respond like one. With some “prompt engineering,” one can even get language models to latch onto more complex patterns and solve tasks from a few examples. Text-to-image models respond to subtle linguistic cues about the aesthetic properties of the output. For example, you can prompt DALL-E 2 to generate an image in the style of a famous artist; or you can specify the medium, color palette, texture, angle, and general artistic style of the desired image. Whether it’s with language or images, large pre-trained models excel at pastiche and imitation.
Rather than stochastic parrots, we might call pre-trained models stochastic chameleons.
Here’s the thing about mimicry: It need not involve intelligence, or even agency. The specialized pigment-containing cells through which chameleons and cephalopods blend in their environments may seem clever, but they don’t require them to intentionally imitate features of their surroundings through careful analysis. The sophisticated eyes of the cuttlefish, which capture subtle color shades in the surroundings and reproduce them on the cuttlefish’s skin, is a kind of biological mimicry that can be seen as solving a matching problem, one that involves sampling the right region of the color space based on context.
Artificial mimicry in large pre-trained models also solves a matching problem, but this one involves sampling a region of the model’s latent space based on context. The latent space refers to the high-dimensional abstract space in which these models encode tokens (such as words, pixels, or any kind of serialized data) as vectors—a series of real numbers that specify a location in that space. When language models finish an incomplete sentence, or when multi-modal models generate an image from a description, they sample representations from the region of their latent space that matches the context the prompt provides. This may not require the sophisticated cognitive capacities we are tempted to ascribe to them.
Or does it? Sufficiently advanced mimicry is virtually indistinguishable from intelligent behavior—and therein lies the difficulty. When scaled-up models unlock new capabilities, combining novel concepts coherently, explaining new jokes to our satisfaction, or working through a math problem step-by-step to give the correct solution, it is hard to resist the intuition that there is something more than mindless mimicry going on.
Can large pre-trained models really offer more than a simulacrum of intelligent behavior? There are two ways to look at this issue. Some researchers think that the kind of intelligence found in biological agents is cut from a fundamentally different cloth than the kind of statistical pattern-matching large models excel at. For these skeptics, scaling up existing approaches is but a fool’s errand in the quest for artificial intelligence, and the label “foundation models” is an unfortunate misnomer.
Others would argue that large pre-trained models are already making strides toward acquiring proto-intelligent abilities. For example, the way large language models can solve a math problem involves a seemingly non-trivial capacity to manipulate the parameters of the input with abstract templates. Likewise, many outputs from multi-modal models examplify a seemingly non-trivial capacity to translate concepts from the linguistic to the visual domain, and flexibly combine them in ways that are constrained by syntactic structure and background knowledge. One could see these capacities as very preliminary ingredients of intelligence, inklings of smarter aptitudes yet to be unlocked. To be sure, other ingredients are still missing, and there are compelling reasons to doubt that simply training larger models on more data, without further innovation, will ever be enough to replicate human-like intelligence.
To make headway on these issues, it helps to look beyond learning function and benchmarks. Sharpening working definitions of terms such as “understanding,” “reasoning,” and “intelligence” in light of philosophical and cognitive science research is important to avoid arguments that take us nowhere. We also need a better understanding of the mechanisms that underlie the performance of large pre-trained models to show what may lie beyond artificial mimicry. There are ongoing efforts to meticulously reverse-engineer the computations carried out by these models, which could support more specific and meaningful comparisons with human cognition. However, this is a painstaking process that inevitably lags behind the development of newer and larger models.
Regardless of how we answer these questions, we need to tread carefully when deploying large pre-trained models in the real world; not because they threaten to become sentient or superintelligent overnight, but because they emulate us, warts and all.
Raphaël Millière is a Presidential Scholar in Society and Neuroscience in the Center for Science and Society at Columbia University, where he conducts research on the philosophy of cognitive science. Follow him on Twitter @raphaelmilliere.