MIDS Capstone Project Fall 2022

Amplified

Voice cloning is gaining popularity for its many use cases. A recent famous example is Val Kilmer’s voice being cloned from his archived audio files in order for him to dialogue in his own voice for the movie Top Gun Maverick [1].  As was the case with Kilmer, the AI technology to generate voice clones may have positive implications for those who have lost their natural ability to speak. Another use case growing in popularity includes media companies that want to regenerate celebrity voices. For example, James Earl Jones recently gave his blessing to use his voice clone for the mini-series Obi-Wan Kenobi [2].

Other possible use cases include companies and brands which desire voice-overs for games, branding, and educational material to name a few. At least one projection of the voice-cloning market suggests growth of up to 932 million dollars by 2027, up from 224 million in 2020 [3].

Our team is proposing to research some of the latest open-source speech-to-speech technologies in order to provide a level of transparency about the state of voice-cloning which is not available through commercial companies currently working on this problem. Specifically, we implemented two AI models in order to discover the amount of voice data needed to train our models to generate quality voice clones.  The details of the two architectures we tested are included below.

Baseline

We used the Soft-VC model detailed in the paper called “A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion” [4]. We varied the amount of voice training data used for both the acoustic model and the vocoder model parts of the AI architecture found in the paper.

Novel Enhancements

We combined the HuBERT model from Soft-VC architecture with the VITS from the paper “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech” [5]. It allowed us to preserve the benefits of training with unlabeled data while improving the quality of the converted speech.

We measure the results of the cloning by using the Mean Opinion Score (MOS). We recruited 13 participants to score the voices on the three components of the MOS: adequacy, fluency and naturalness.

Links to references [datasets, papers, etc.]:

[1] Val Kilmer Speak In Top Gun

[2] James Earl Jones AI Voice

[3] Voice-Cloning Market

[4] A Comparison of Discrete and Soft Speech Units

[5] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

 

Last updated: December 8, 2022