Once it has heard you speak for three seconds, VALL-E can synthesise audio of you saying anything and mimic your emotional tone.
Volish Boffins claim that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they never did), and audio content creation when combined with other generative AI models like GPT-3.
While the tech behind the idea is interesting, it does not seem that anyone has thought – this is an idiotic idea which could be used to do no good.
Microsoft calls VALL-E a “neural codec language model,” and it uses a technology called EnCodec, which Meta announced in October 2022. VALL-E generates discrete audio codec codes from text and auditory prompts, unlike other text-to-speech methods synthesising speech by manipulating waveforms.
It analyses how a person sounds, breaks that information into discrete components (called “tokens”) thanks to EnCodec and uses training data to match what it “knows” about how that voice would sound if it spoke other phrases outside of the three-second sample. Or, as Microsoft puts it in the VALL-E paper:
To synthesise personalised speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the three-second enrolled recording and the phoneme prompt, which constrain the speaker and content information. Finally, the generated acoustic tokens are used to synthesise the final waveform with the corresponding neural codec decoder.
Microsoft trained VALL-E’s speech synthesis capabilities on an audio library assembled by Meta called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.
Perhaps owing to VALL-E’s ability to fuel mischief and deception, Microsoft has not provided VALL-E code for others to experiment with, so we could not test VALL-E’s capabilities. The researchers seem aware of the potential social harm that this technology could bring. For the paper’s conclusion, they write:
Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, Vole claims it is possible to build a detection model to discriminate whether VALL-E synthesised an audio clip.