People Can’t Always Identify Deepfake Voices, and AI Evolution Will Make Phone Scams Worse

Your phone rings. It’s your company’s president requesting you transfer money to a new supplier. “Of course,” you say, glad she thought of you for the task.

Nautilus Members enjoy an ad-free experience. Log in or Join now .

The problem is, it wasn’t your boss calling from the other side of the building but a scammer, half a world away, hiding behind an AI-generated replica of her voice. Those funds are now gone.

Swindlers have been using this sort of high-tech trickery against unwitting individuals for years. In 2019, the CEO of an energy firm in the United Kingdom fell for a deepfake phone call and sent almost a quarter of a million dollars to a scammer. “The U.K. CEO recognized his boss’ slight German accent and the melody of his voice on the phone,” The Wall Street Journal reported about the incident. The next year, a Hong Kong bank manager began fulfilling a request he thought was coming from the company’s director to transfer $35 million as part of a (fabricated) merger.

This generative AI technology can fairly convincingly mimic someone’s voice, fooling people more than one out of four times, according to a new paper—even when people are being told the voice they are hearing might be a deepfake.

Nautilus Members enjoy an ad-free experience. Log in or Join now .

For the study, Kimberly Mai, a machine-learning researcher at University College London, and her colleagues had 529 people listen to recordings of voices and asked them to pick out which were real and which were AI-generated.

People rely on intuition.

Interestingly, deepfakes in English and Mandarin fooled fluent speakers at the same rate: about 27 percent of the time. When reached by email, Mai says only a lack of training data from a particular language or dialect would make the speaker harder to imitate. Even when researchers fed participants examples of fake voices before having them do the rest of the survey, they didn’t get much better at detecting the fakes (improving by less than 4 percent). And spending more time listening to clips over and over also didn’t help.

It turns out that people don’t take an entirely scientific approach to assessing the authenticity of a voice. “I was surprised that people rely on intuition,” says Mai, who, before pursuing her Ph.D., worked for an international investment bank in the financial crime department. (In the study, English speakers said they tended to spot fakes based on the “naturalness” of a voice and breathing; Mandarin speakers also based their accurate catches on naturalness as well as cadence and pacing.) This is different than a photo or video deepfake, she notes. In that iteration of trickery, “you can count the number of fingers on their hands or whether their accessories match to make a judgment.”

Nautilus Members enjoy an ad-free experience. Log in or Join now .

For the most part, these deepfakes are just the next-level version of the same old scams that have been circulating through email and phone lines for decades. As these generative AI technologies become more commonplace—Apple has announced a feature called Personal Voice in the next iOS that can replicate a user’s voice with about 15 minutes of audio recordings—people will surely apply it for any imaginable end, whether hyper-targeted phishing, disinformation, or adolescent pranks. But, concerningly, Mai notes, the tests in the new study were done using what is now more primitive technology.

Of course, detection technology is racing to stay apace of deepfakes. Mai and her colleagues note that current automatic detection performs on average about as reliably as humans. But researchers and engineers are working to fine-tune this application as a sort of “biometric authentication,” Mai says. This sort of detection aims to check whether the voice belongs to a human (or machine) and whether it belongs to the correct human. But that requires lots of logistics, such as screening voices before speaking, and comparing the caller’s voice to a vast database of voice samples.

Mai’s best suggestion for detection lands on old-fashioned common sense: Consider the content of the message. “If it involves a request to transfer a large amount of money,” for example, she says, “it is a good idea to double-check, discuss with others, and verify the source.”