A person’s voice is one of the most fundamental attributes that enables communication with others in physical proximity, or at remote locations using phones or radios, and the Internet using digital media. However, unbeknownst to them, people often leave traces of their voices in many different scenarios and contexts. it is relatively easy for someone, potentially with malicious intentions, to “record” a person’s voice by being in close physical proximity of the speaker (using, for example, a mobile phone), by social engineering trickeries such as making a spam call, by searching and mining for audiovisual clips online, or even by compromising servers in the cloud that store such audio information. The more popular a person is (e.g., a celebrity or a famous academician), the easier it is to obtain his/her voice samples.
In this work, we study the implications of such a commonplace leakage of people’s voice snippets. We show that the consequences of imitating one’s voice can be grave. Since voice is regarded as a unique characteristic of a person, it forms the basis of the authentication of the person. If voice could be imitated, it would compromise the authentication functionality itself, performed implicitly by a human in human-to-human communications, or explicitly by a machine in human-to-machine interactions. Equipped with the current advancement in automated speech synthesis, our attacker can build a very close model of a victim’s voice after learning only a very limited number of samples in the victim’s voice (e.g., mined through the Internet, or recorded via physical proximity). Specifically, the attacker uses voice morphing techniques to transform its voice – speaking any arbitrary message – into the victim’s voice.
As our case study in this work, we investigate the aftermaths of stealing voices in two important applications and contexts that rely upon voices as an authentication primitive. The first application is a voice-based biometric or speaker verification system which uses the potentially unique features of an individual’s voice to authenticate that individual. Our second application, naturally, is human communications. If an attacker can imitate a victim’s voice, the security of (remote) arbitrary conversations could be compromised. The attacker could make the morphing system speak literally anything that the attacker wants to, in victim’s tone and style of speaking, and can launch an attack that can harm victim’s reputation, his security/safety and the security/safety of people around the victim.
We develop our voice impersonation attacks using an off-the-shelf voice morphingtool, and evaluate their feasibility against state-of-the-art automated speaker verification algorithms (application 1) as well as human verification (application 2). Our results show that the automated systems are largely ineffective to our attacks. The average rates for rejecting fake voices were under 10-20% for most victims. Even human verification is vulnerable to our attacks. Based on two online studies with about 100 users, we found that only about an average 50% of the times people rejected the morphed voice samples of two celebrities as well as briefly familiar users.  The following figure shows an overview of our work.

An overview of our attack system
An overview of our attack system




  • Maliheh Shirvanian (PhD candidate)
  • Dibya Mukhopadhyay (@UAB; Master 2016; now Sr. Data Analyst at Westfield Retail Solutions)


Media Coverage