Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion
Abstract
Voice conversion (VC) seeks to modify one speaker's voice to generate speech as if it came from
another speaker. It is
challenging especially when source and target speakers are unseen during training (zero-shot VC).
Recent work in this
area made progress with disentanglement methods that separate utterance content and speaker
characteristics from speech
audio recordings. However, these models either lack adequate disentanglement ability or rely on the
use of a trained
vocoder to reconstruct the speech from acoustic features. We propose Flow-VAE VC, which is an
end-to-end system
processing directly on the raw audio waveform for zero-shot tasks. Flow-VAE VC adopts a conditional
Variational
Autoencoder (VAE) with normalizing flows and an adversarial training process to improve the
expressive power of
generative modeling. Specifically, we learn context-invariant representations by applying
frame-level contrastive loss
to speech different augment samples. The experiments show that the proposed method achieves a decent
performance on
zero-shot voice conversion and significantly improves converted speech naturalness and speaker
similarity.