VoiceMixer Demo

ABSTRACT

Although recent advances in voice conversion have shown significant improvement, there still remains a gap between the converted voice and target voice. A key factor that maintains this gap is the insufficient decomposition of content and voice style from the source speech. This insufficiency leads to the converted speech containing source speech style or losing source speech content. In this paper, we present VoiceMixer which can effectively decompose and transfer voice style through a novel information bottleneck and adversarial feedback. With self-supervised representation learning, the proposed information bottleneck can decompose the content and style with only a small loss of content information. Also, for adversarial feedback of each information, the discriminator is decomposed into content and style discriminator with self-supervision, which enable our model to achieve better generalization to the voice style of the converted speech. The experimental results show the superiority of our model in disentanglement and transfer performance, and improve audio quality by preserving content information.

We compare VoiceMixer with several VC models as:

1. StarGAN-VC: StarGAN-based voice conversion model [Demo link]

2. AGAIN-VC: Voice conversion model using Activation Guidance and Adaptive Instance Normalization [Demo link]

3. AUTOVC: Auto-encoder based voice conversion model. We train AUTOVC in various settings with each having different sizes of information bottleneck and adversarial speaker classifier [Demo link]

4. Blow: Flow-based voice conversion model using hyper-conditioning [Demo link]

5. IDE-VC (Not implemented): Autoencoder-based voice conversion with information-theoretic disentanglement. There is no official source code for IDE-VC model. Please check the official demo page for comparision [Demo link]

MANY-TO-MANY VST

All source and target speech are not seen during training

Source Speaker	Target Speaker	Converted
p226 (Male)	p256 (Male)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)
p226 (Male)	p236 (Female)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)

Source Speaker	Target Speaker	Converted
p236 (Female)	p259 (Male)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)
p236 (Female)	p234 (Female)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)

Source Speaker	Target Speaker	Converted
p259 (Male)	p270 (Male)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)
p259 (Male)	p240 (Female)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)

Source Speaker	Target Speaker	Converted
p234 (Female)	p226 (Male)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)
p234 (Female)	p231 (Female)	StarGAN-VC	AGAIN-VC
		AutoVC (τ = 16)	AutoVC (τ = 32)
		AutoVC + L_advsc (τ = 16)	Blow
		VoiceMixer (Ours)

Zero-shot VST

All source and target speech are not seen during training

Base speaker is a speaker seen during training

Novel speaker is a speaker unseen during training

Source Speaker	Target Speaker	Converted
p259 (Male, Base speaker)	p246 (Male, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)
p259 (Male, Base speaker)	p340 (Female, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)

Source Speaker	Target Speaker	Converted
p248 (Female, Novel speaker)	p241 (Male, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)
p248 (Female, Novel speaker)	p293 (Female, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)

Source Speaker	Target Speaker	Converted
p302 (Male, Novel speaker)	p274 (Male, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)
p302 (Male, Novel speaker)	p340 (Female, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)

Source Speaker	Target Speaker	Converted
p335 (Female, Novel speaker)	p255 (Male, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)
p335 (Female, Novel speaker)	p248 (Female, Novel speaker)
		AGAIN-VC	AutoVC (τ = 16)
		AutoVC (τ = 32)	AutoVC + L_advsc (τ = 16)
		VoiceMixer (Ours)

ABLATION STUDY

All source and target speech are not seen during training

Source Speaker	Target Speaker	Converted
p302 (Male, Novel speaker)	p274 (Male, Novel speaker)	VoiceMixer (Ours)	w/o L_advsc
		w/o L_pos + L_neg	w/o L_neg
		w fixed-length IB (τ = 16)	w fixed-length IB (τ = 32)
		w/o GAN	w/o disentangled discriminator
		w/o L^_style^-*
p302 (Male, Novel speaker)	p340 (Female, Novel speaker)	VoiceMixer (Ours)	w/o L_advsc
		w/o L_pos + L_neg	w/o L_neg
		w fixed-length IB (τ = 16)	w fixed-length IB (τ = 32)
		w/o GAN	w/o disentangled discriminator
		w/o L^_style^-*

MORE AUDIO SAMPLES

All source and target speech are not seen during training

Source Speaker	Target Speaker	Converted
p259	p225	VoiceMixer
	p226	VoiceMixer
	p227	VoiceMixer
	p228	VoiceMixer
	p231	VoiceMixer
	p232	VoiceMixer
	p233	VoiceMixer
	p234	VoiceMixer
	p236	VoiceMixer
	p237	VoiceMixer
	p239	VoiceMixer
	p240	VoiceMixer
	p245	VoiceMixer
	p254	VoiceMixer
	p256	VoiceMixer
	p257	VoiceMixer
	p259	VoiceMixer
	p267	VoiceMixer
	p270	VoiceMixer
	p275	VoiceMixer

Source Speaker	Target Speaker	Converted
p231	p225	VoiceMixer
	p226	VoiceMixer
	p227	VoiceMixer
	p228	VoiceMixer
	p231	VoiceMixer
	p232	VoiceMixer
	p233	VoiceMixer
	p234	VoiceMixer
	p236	VoiceMixer
	p237	VoiceMixer
	p239	VoiceMixer
	p240	VoiceMixer
	p245	VoiceMixer
	p254	VoiceMixer
	p256	VoiceMixer
	p257	VoiceMixer
	p259	VoiceMixer
	p267	VoiceMixer
	p270	VoiceMixer
	p275	VoiceMixer