We are going to start recording and automatically transcribing most of what we say. Instead of evaporating into memory, words spoken aloud will calcify as text, into a Record that will be referenced, searched, and mined. It will happen by our standard combination of willing and allowing. It will happen because it can. It will happen sooner than we think.
It will make incredible things possible. Think of all the reasons that you search through your email. Suddenly your own speech will be available in just the same way. “Show me all conversations with Michael before January of last year ... What was the address of that restaurant Mum recommended? ... When was the first time I mentioned Rob’s now-wife? ... Who was at that meeting again?” Robin Hanson, an economist at George Mason University and a co-author of a forthcoming book on evolutionary psychology, has speculated that we might all get in the habit of peppering our speech with keywords, to help us look it up later. Or, while you’re talking, a software agent could search your old conversations for relevant material. Details would come to your attention at just the moment that you needed them.
Much of what is said aloud would be published and made part of the Web. An unfathomable mass of expertise, opinion, wit, and culture—now lost—would be as accessible as any article or comment thread is today. You could, at any time, listen in on airline pilots, on barbershops, on grad-school bull sessions. You could search every mention of your company’s name. You could read stories told by father to son, or explanations from colleague to colleague. People would become Internet-famous for being good conversationalists. The Record would be mined by advertisers, lawyers, academics. The sheer number of words available for sifting and savoring would explode—simply because people talk a lot more than they write.
With help from computers, you could trace quotes across speakers, or highlight your most common phrases, or find uncommon phrases that you say more often than average to see who else out there shares your way of talking. You could detect when other people were recording the same thing as you—say, at a concert or during a television show—and automatically collate your commentary.
You’d think we were a strange species, if you listened to the whole of humanity’s recorded corpus today.
Bill Schilit, a Googler who did early work mining the Google Books corpus, suggested that you could even use quotations to find connections between scientific subjects. “In science you have this problem that the same thing is called different names by different people; but quotations tend to bridge the nomenclature between disciplines,” he said. He described a project where Google looked at quotations used by researchers in different fields. In each document, they’d extract the sentence just before the quotation—the one that introduced it—and then compare those two descriptions; that way they could find out what the quotation stood for: what it meant to different authors, what writers in different disciplines called the same thing.
But would all of this help or hurt us? In his book The Shallows, Nicholas Carr argues that new technology that augments our minds might actually leave them worse off. The more we come to rely on a tool, the less we rely on our own brains. That is, parts of the brain seem to behave like muscle: You either use it (and it grows), or you lose it. Carr cites a famous study of London taxi drivers studying for “The Knowledge,” a grueling test of street maps and points of interest that drivers must pass if they are to get their official taxi license. As the taxi drivers ingested more information about London’s streets, the parts of their brain responsible for spatial information literally grew. And what’s more, those growing parts took over the space formally occupied by other gray matter.
Paradoxically, long-term memory doesn’t seem to work the same way; it doesn’t “fill up.” By offloading more of memory’s demands onto the Record, therefore, it might not be that we’re making space for other, more important thinking. We might just be depriving our brains of useful material. “When a person fails to consolidate a fact, an idea, or an experience in long-term memory,” Carr writes, “he’s not ‘freeing up’ space in his brain for other functions ... When we start using the Web as a substitute for personal memory, bypassing the inner processes of consolidation, we risk emptying our minds of their riches.”
The worry, then, is twofold: If you stopped working out the part of your brain that recalls speech, or names, or that-book-that-Brian-recommended-when- you-spoke-to-him-in-the-diner-that-day-after-the- football-game, maybe those parts of your brain would atrophy. Even more pernicious, as you came to rely more on the Record as a store of events and ideas, you would decide less often to commit them to your own long-term memory. And so your mind would become a less interesting place.
If that’s frightening, consider also what it might be like to live in a society where everything is recorded. There is an episode of the British sci-fi series Black Mirror set in a world where Google Glass–style voice and video recording is ubiquitous. It is a kind of hell. At airport security, the agents ask you to replay your last 24 hours at high speed, so they can clear all the faces you interacted with. At parties, instead of making new conversation, people pore over their “redos” and ask to see their friends.’ In lonely moments, instead of rehearsing memories in the usual way—using the faulty, foggy, nonlinear recall apparatus of their own minds—people replay videos, zooming in on parts they missed the first time around. They seem to live so much in the past as to be trapped by it. The past seems distorted and refracted by the too-perfect, too-public record. In the episode’s most vividly dark moment, we see a couple passionately making love, only to realize that the great sex is happening in “redos” that they’re both watching on their implanted eye-screens; in the real present, they’re humping lovelessly on a cold bed, two drugged-out zombies.
Between these visions of heaven and hell lies the likely truth: When something like the Record comes along, it won’t reshape the basic ways we live and love. It won’t turn our brains to mush, or make us supermen. We will continue to be our usual old boring selves, on occasion deceitful, on occasion ingenuous. Yes, we will have new abilities—but what we want will change more slowly than what we can do.
Speech recognition has long been a holy grail of artificial intelligence research. “The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon,” the Bell Labs engineer J.R. Pierce wrote in 1969. He argued that we attacked and funded the problem not because it was tractable or even useful, but simply because there would be something magnificent in talking to a computer. It would be like science fiction. The machine would seem alive.
The fact that the problem of recognizing speech seemed to contain within it the whole problem of human understanding—after all, in order to parse an ambiguous sound, we bring to bear not just knowledge of language but knowledge of the world—only made it more enticing. Progress in speech recognition would stand for progress in AI more generally. And so it became a benchmark and a prize.
The earliest working systems restricted themselves to a simple vocabulary—say, the digits “zero” through “nine” spoken one at a time—and distinguished words by looking for specific features in their sound waves. As you might expect, when the vocabulary grew, the distinctions between different words’ sound waves became more subtle. The approach fell apart. Researchers realized they needed something more robust.
Their insight, arrived at in the 1970s, was to model speech as a sequence at multiple levels simultaneously. That is, at each moment, they imagined their recognition system as being in some state at the sound level, the syllable level, the word level, the phrase level, and so on. Its job was to predict, at each level, what the next state would be. To do so it used large tables of probabilities that said, in essence, “If you see state A, then state B happens 0.1 percent of the time, state C happens 30 percent of the time, state D happens 11 percent of the time,” and so on. These tables were made by training the system on labeled data (recordings that had been transcribed by hand, and were known to be correct). The trick was that if the word-level prediction was ambiguous—maybe because the environment was noisy, or the speaker’s voice was distorted—predictions from the other levels could be used to rule out possibilities and home in on the correct choice. It was a massive advance. It was like going from trying to solve the crossword one clue at a time to playing on the grid: Each clue offered hints about the others, simplifying and reducing the puzzle.
This insight, combined with the exponential growth of training data and computational power, underwrote most of the progress in speech recognition in the last four decades. It’s what got us workable but error-prone dictation software, like Dragon Naturally Speaking, the first versions of Siri, and those automated phone trees that let you to speak your selection (“billing inquiry” or “schedule maintenance”). But around 2010 it seemed as though progress might always be incremental—like there were no big ideas left in speech recognition. The field seemed to have plateaued. Then deep learning came along.
People will continue to be less concerned with how they sound than how they look. They will be far more likely to pause for a selfie than for a soliloquy.
Geoffrey Hinton and his collaborators, then at the University of Toronto and now at Google, were experimenting with deep neural nets. Neural nets are computer programs that work a little like the brain: They are made of layers of neuron-like cells that receive input from other neurons, compute some simple function (like a sum or average) over those inputs, and either fire or not based on the value of that function, spreading activation to other neurons deeper in the net. The nets are trained by entering inputs into the bottommost layer, and seeing what comes out of the topmost layer; if the output isn’t what you were expecting, you use a simple learning algorithm to adjust the strength of the connections (the “synapses”) between neurons until you get what you want. Rinse and repeat over many billions of examples, and your net might come to encode important features of the problem at hand, and work well as a recognizer.
Most neural nets are stateless, in the sense that the output for a given sense of inputs depends on that input alone. This limits their effectiveness for modeling speech. But Alex Graves, working in Hinton’s lab, wondered what would happen if you tackled the speech recognition problem using neural nets whose output could depend on sequences of inputs, known as “recurrent neural nets.” They were remarkably effective. Graves’ RNNs—which are given far less information about language than those multi-level prediction systems that had long been a mainstay of the field—were soon matching and then surpassing the performance of the old approach.
When I spoke to Hinton, and asked him how such simple programs could recognize speech so effectively, he said he was reminded of some sketches that he likes, by Leonardo da Vinci, of turbulent water going past a lock. The water is rushing and frothing and swirling in eddies, a complex mess. But its behavior, Hinton said, “It’s all described by the extremely simple Navier-Stokes equations.” A few simple rules generate all the complexity. The same thing, he argues, happens when a neural network learns to recognize speech. “You don’t need to hand-engineer lots of complicated speech phenomena into the system,” Hinton says.
At Google, Hinton and his colleagues are doing basic research in computer science, examining, as he put it, “the space of learning algorithms that work well.” Their findings will have a huge number of applications. But speech continues to be a top priority, and not just because it’s a good proving ground for their algorithms. “The thing about speech,” Hinton told me, “is that it’s the most natural way to interact with things.”
Google, Apple, Amazon, and Microsoft are not interested, today, in recording and transcribing everything we say. They are interested in voice as an interface. The Amazon Echo, for instance, sits and waits for you to issue it commands; for playing music or looking up a bit of trivia, talking is easier than typing, especially when you can do it from anywhere in the room. And as computers get smaller, and move onto our wrist or the bridge of our nose, perhaps someday into our ear, keyboards stop being practical—and yet we still need a way to tell the computer what to do. Why not just tell it? Why not just say, “Ok Google, direct me home?”
This is how it’s going to happen. Speech recognition technology is being driven both by basic research into AI—because it’s a model problem—and by the perceived need of Google and its ilk to create better voice interfaces for their new devices. Intentionally or not, the tech will soon get so good as to reach a tipping point—what the journalist Matt Thompson called the Speakularity—where “the default expectation for recorded speech will be that it’s searchable and readable, nearly in the instant.” The only question, then, will be what we decide to record.
You’d think we were a strange species, if you listened to the whole of humanity’s recorded corpus today. You’d find all the blathering radio hosts there ever were, and the many takes of voiceover actors, and you’d find journalists talking to their subjects, and pilots to their controllers—and that would all be but the tiniest speck in a vast sea of calls to customer service, “recorded for quality purposes.” You wouldn’t get a sense of what human life actually sounded like, of what we actually talked about.
Megan Robbins, an assistant professor of psychology at the University of California, Riverside, has listened to more regular talk than almost anyone in the world. Her research relies on a device, called the EAR (for Electronically Activated Recorder), designed for “sampling behavior in naturalistic settings.” Research subjects agree to wear it all day. It turns on at periodic intervals about five times an hour, and records everything the wearer says and hears for about 30 seconds. The subject can review and delete any clips they like before handing them over to Robbins for analysis.
With the EAR, Robbins can be a scientist of everyday life. For instance, she can listen to how couples refer to themselves: Do they say “she and I” or “we?” She can listen to people laugh, and try to figure out why. One study found that “Overwhelmingly, most laughs didn’t occur in the presence of humorous stimulus.” By and large, laughs are social, and used to signal things, like “I think you’re higher status than me,” or “I want to affiliate with you.”
Robbins is currently using the EAR to study couples coping with a cancer diagnosis. What do they talk about? Do they talk about the cancer? Do they laugh less? “You’d never think to run a focused study about how often do breast cancer patients laugh,” she says. But with hours and hours of transcripts and tape, a world of questions about our basic behavior opens up. As it turns out, the cancer patients are laughing in about 7 percent of their clips, comparable to college students. They talk about cancer about the same percent of the time. Robbins explains that there seems to be a robustness to the everyday, even when you’re diagnosed with cancer. “It’s really difficult for people to not carry on with their normal daily activities.”
She explains that people talk a lot—on average, about 40 percent of their waking lives. Her undergraduate research assistants, who come to her lab excited to eavesdrop on people, “are sometimes heartbroken to find that daily life, it’s sometimes mundane. It’s comprised of things like TV watching, and conversations about what you’re going to have for dinner. And conversations about TV.” Robbins says she was surprised at just how much television regular people watch. “It’s a topic that’s almost completely ignored in psych, but shows up in the EAR research ... it’s second only to talking in the cancer coping couples.”
One thing people don’t talk about, in general, is the EAR itself. “Self-reports indicate no impact on their life. They generally forgot they were wearing it.” Indeed, one can track mentions of the EAR in the transcripts. They drop off significantly after only two and a half hours. “Normal life goes on,” Robbins says.
When presented with the idea of the Record, we might imagine that people won’t be able to carry on a normal conversation because they’ll be too busy performing. But anyone who’s ever recorded someone knows that self-conscious monitoring of your own speech is just too mentally expensive to carry on for very long. Robbins’ data supports the intuition that, after a short while, you go back to normal.
Hanson also thinks “normal” would be the operative word once ubiquitous speech transcription arrives. He’s not convinced that it would change the world as much as some seem to think it would. “As soon as you see just how different our world is from 1,000 years ago, it’s really hard to get very worked up about this,” he says.
There was almost no privacy 1,000 years ago, he explains. Living quarters were dense. Rooms were tiny and didn’t lock. There were no hallways. Other people could overhear your lovemaking. When you traveled, you hardly ever went by yourself; you roamed around in little groups. Most people lived in small towns, where most everybody knew everybody else and gossiped about them. The differences in how we lived between then and now were huge. And yet we adapted. “I gotta figure the changes we’re looking at are small by comparison,” he says. People have always been able to distinguish between their close friends and their less-close friends. They’ve always been able to decide who to trust, and they’ve always found ways to communicate intimacy. They’ve always been able to lie.
“Even our forager ancestors were quite capable of not telling each other something,” he says. “Foragers are supposed to share food. But they hide a lot of food. They eat a lot on the way back to camp, they hide some at camp, they’re selective about which food they give to who.” Even in a band of 30 people, where the average person would meet a half-dozen other bands at most in their lifetime, and everybody stayed in the same camp at night—even in that environment, our ancestors were capable of being evasive, and tuning their speech and gestures to their advantage.
Having a Record will just give us a new dimension on which to map a capacity we’ve always had. People who are constantly being recorded will adapt to that fact by becoming expert at knowing what’s in the transcript and what’s not. They’ll be like parents talking around children. They’ll become masters of plausible deniability. They’ll use sarcasm, or they’ll grimace or grin or lean their head back or smirk, or they’ll direct their gaze, so as to say a thing without saying it.
It sounds exhausting, but of course we already fluidly adapt to the spectrum of private, small-group, and public conversations—just go to a workplace. Or go to a party. We are constantly asking and answering subtle questions about our audience, and tuning our speech based on the answers. (Is Jack in earshot? Is Jack’s wife in earshot?)
“There’s no way this means that everything we say is now in the open,” Hanson argues. “There’s a layer of what we say that’s in the open ... but we’re always talking at several levels at once.”
Whenever we contemplate a new technology, we tend to obsess and fixate, as though every aspect of the world must now be understood in terms of it. We are a hypochondriacal society. But the fact is that the hardware running inside our heads hardly changes at all, and the software only slowly, over the course of generations.
The Record will not turn our brains to mush. Yes, we will likely spend less energy committing great talk to our long-term memories. And transcripts will relieve us from having to track certain details that come up in conversation. But we won’t thereby lose the ability to track details—just as we didn’t lose our ability to plan when we invented the calendar, or our ability to memorize when we invented the pen. We will enrich our long-term memories in some other way (say, by poring over the vast stores of material newly made available by transcription). Our brains adapted to writing, to libraries, and to the Web. They will adapt to the Record. And people will, anyway, continue to be less concerned with how they sound than with how they look. They will be far more likely to pause for a selfie than for a soliloquy.
Nor is life like a Black Mirror episode, where every scene and line revolves, because it must, around the newest tech. Sure, the Record may enhance our narcissism, our nostalgia, our impatience and paranoia. It might even corrupt or stupefy us en masse. But even that has happened before, whether with smartphones or television or mirrors or alcohol, and somehow we have managed to end up, above all, ourselves.
James Somers is a programmer and writer based in New York. He works at Genius.