Diff-HierVC
Overall framework of Diff-HierVC
We compare Diff-HierVC with several Voice Conversion models as:
1. AutoVC [K. Qian et al., 2019] : [Official Demo page]
2. VoiceMixer [S.-H. Lee et al., 2021]: [Official Demo page]
3. Speech Resynthesis [A. Polyak et al., 2021]: [Official Demo page]
4. DiffVC [V. Popov et al., 2022]: [Official Demo page]
Real-world Dataset
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Diff-HierVC |
|||
Diff-HierVC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Diff-HierVC |
|||
Diff-HierVC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
p225_007 |
Diff-HierVC |
||
Diff-HierVC |
|||
Diff-HierVC |
|||
Diff-HierVC |
|||
Diff-HierVC |
|||
Diff-HierVC |
|||
Diff-HierVC |
All speakers are unseen during training
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
GT (p241) |
GT (p239) |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC * |
DiffVC * | |||
DiffVC |
DiffVC | |||
Diff-HierVC |
Diff-HierVC | |||
Diff-HierVC Finetune |
Diff-HierVC Finetune | |||
GT (p226) |
GT (p229) |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC * |
DiffVC * | |||
DiffVC |
DiffVC | |||
Diff-HierVC |
Diff-HierVC | |||
Diff-HierVC Fintune |
Diff-HierVC Fintune | |||
GT (p240) |
GT (p234) |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC * |
DiffVC * | |||
DiffVC |
DiffVC | |||
Diff-HierVC |
Diff-HierVC | |||
Diff-HierVC Finetune |
Diff-HierVC Finetune |
All speakers are seen during training
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
GT (1571) |
GT (3526) |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC * |
DiffVC * | |||
DiffVC |
DiffVC | |||
Diff-HierVC |
Diff-HierVC | |||
GT (3699) |
GT (374) |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC * |
DiffVC * | |||
DiffVC |
DiffVC | |||
Diff-HierVC |
Diff-HierVC |
Unseen language from CSS10 multi-lingual dataset
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
French |
Hungarian |
Diff-HierVC |
|
French |
Greek |
Diff-HierVC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Finnish |
Dutch |
Diff-HierVC |
|
Finnish |
Russsian |
Diff-HierVC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Russian |
Dutch |
Diff-HierVC |
|
Russian |
French |
Diff-HierVC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Spanish |
French |
Diff-HierVC |
|
Spanish |
Russian |
Diff-HierVC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
German |
French |
Diff-HierVC |
|
German |
Dutch |
Diff-HierVC |
Results of ablation study on zero-shot VC tasks with unseen speakers from VCTK dataset.
For all methods, the number of sampling iterations is 6.
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
GT |
GT |
Diff-HierVC (Ours) | ||
Denorm. + DiffVoice |
F0 Encoder + DiffVoice |
DiffPitch + SF Encoder | ||
w.o Masked Prior |
w.o Data-driven Prior |
w.o SF Encoder | ||
GT |
GT |
Diff-HierVC (Ours) | ||
Denorm. + DiffVoice |
F0 Encoder + DiffVoice |
DiffPitch + SF Encoder | ||
w.o Masked Prior |
w.o Data-driven Prior |
w.o SF Encoder | ||
GT |
GT |
Diff-HierVC (Ours) | ||
Denorm. + DiffVoice |
F0 Encoder + DiffVoice |
DiffPitch + SF Encoder | ||
w.o Masked Prior |
w.o Data-driven Prior |
w.o SF Encoder |