Diff-HierVC Diff-HierVC Demo

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

 

main Overall framework of Diff-HierVC


Introduction

 

We compare Diff-HierVC with several Voice Conversion models as:

1. AutoVC [K. Qian et al., 2019] : [Official Demo page]

2. VoiceMixer [S.-H. Lee et al., 2021]: [Official Demo page]

3. Speech Resynthesis [A. Polyak et al., 2021]: [Official Demo page]

4. DiffVC [V. Popov et al., 2022]: [Official Demo page]

 

One-shot Speaker Adaptation

Real-world Dataset

Source Speaker Target Speaker Converted

Emma Watson
(03:30 ~ 03:40)

Gollum
(00:30 ~ 00:40)

Diff-HierVC
(One-shot, 6 iter.)

Tom holland
(00:45 ~ 00:55)

Diff-HierVC
(One-shot, 6 iter.)

Source Speaker Target Speaker Converted

Glados
(00:00 ~ 00:10)

Benedict Cumberbatch
(05:00 ~ 05:10)

Diff-HierVC
(One-shot, 6 iter.)

Tom holland
(00:45 ~ 00:55)

Diff-HierVC
(One-shot, 6 iter.)

Source Speaker Target Speaker Converted

p225_007
(VCTK)

Benedict Cumberbatch
(05:00 ~ 05:10)

Diff-HierVC
(One-shot, 6 iter.)

Emma Watson
(03:30 ~ 03:40)

Diff-HierVC
(One-shot, 6 iter.)

Tom holland
(00:45 ~ 00:55)

Diff-HierVC
(One-shot, 6 iter.)

Gollum
(00:30 ~ 00:40)

Diff-HierVC
(One-shot, 6 iter.)

Glados
(00:00 ~ 00:10)

Diff-HierVC
(One-shot, 6 iter.)

Heung-min Son
(00:04 ~ 00:14)

Diff-HierVC
(One-shot, 6 iter.)

Steve Jobs
(00:55 ~ 01:05)

Diff-HierVC
(One-shot, 6 iter.)

Zero-shot Voice Conversion (VCTK)

All speakers are unseen during training

Source Speaker Target Speaker Converted

GT (p241)

GT (p239)

  AutoVC

  VoiceMixer

Speech Resynthesis

  DiffVC *
(6 iter.)

  DiffVC *
(30 iter.)

  DiffVC
(6 iter.)

  DiffVC
(30 iter.)

Diff-HierVC
(6 iter.)

Diff-HierVC
(30 iter.)

Diff-HierVC Finetune
(6 iter.)

Diff-HierVC Finetune
(30 iter.)

GT (p226)

GT (p229)

  AutoVC

  VoiceMixer

Speech Resynthesis

  DiffVC *
(6 iter.)

  DiffVC *
(30 iter.)

  DiffVC
(6 iter.)

  DiffVC
(30 iter.)

Diff-HierVC
(6 iter.)

Diff-HierVC
(30 iter.)

Diff-HierVC Fintune
(6 iter.)

Diff-HierVC Fintune
(30 iter.)

GT (p240)

GT (p234)

  AutoVC

  VoiceMixer

Speech Resynthesis

  DiffVC *
(6 iter.)

  DiffVC *
(30 iter.)

  DiffVC
(6 iter.)

  DiffVC
(30 iter.)

Diff-HierVC
(6 iter.)

Diff-HierVC
(30 iter.)

Diff-HierVC Finetune
(6 iter.)

Diff-HierVC Finetune
(30 iter.)

Many-to-Many Voice Conversion (LibriTTS)

All speakers are seen during training

Source Speaker Target Speaker Converted

GT (1571)

GT (3526)

  AutoVC

  VoiceMixer

Speech Resynthesis

  DiffVC *
(6 iter.)

  DiffVC *
(30 iter.)

  DiffVC
(6 iter.)

  DiffVC
(30 iter.)

Diff-HierVC
(6 iter.)

Diff-HierVC
(30 iter.)

GT (3699)

GT (374)

  AutoVC

  VoiceMixer

Speech Resynthesis

  DiffVC *
(6 iter.)

  DiffVC *
(30 iter.)

  DiffVC
(6 iter.)

  DiffVC
(30 iter.)

Diff-HierVC
(6 iter.)

Diff-HierVC
(30 iter.)

Zero-shot Cross-lingual Voice Conversion

Unseen language from CSS10 multi-lingual dataset

Source Speaker Target Speaker Converted

French

Hungarian

Diff-HierVC

French

Greek

Diff-HierVC

Source Speaker Target Speaker Converted

Finnish

Dutch

Diff-HierVC

Finnish

Russsian

Diff-HierVC

Source Speaker Target Speaker Converted

Russian

Dutch

Diff-HierVC

Russian

French

Diff-HierVC

Source Speaker Target Speaker Converted

Spanish

French

Diff-HierVC

Spanish

Russian

Diff-HierVC

Source Speaker Target Speaker Converted

German

French

Diff-HierVC

German

Dutch

Diff-HierVC

Ablation study

Results of ablation study on zero-shot VC tasks with unseen speakers from VCTK dataset.
For all methods, the number of sampling iterations is 6.

Source Speaker Target Speaker Converted

GT
p225 (female)

GT
p230 (female)

Diff-HierVC
(Ours)

Denorm. + DiffVoice

F0 Encoder + DiffVoice

DiffPitch + SF Encoder

w.o Masked Prior

w.o Data-driven Prior

w.o SF Encoder

GT
p233 (female)

GT
p243 (male)

Diff-HierVC
(Ours)

Denorm. + DiffVoice

F0 Encoder + DiffVoice

DiffPitch + SF Encoder

w.o Masked Prior

w.o Data-driven Prior

w.o SF Encoder

GT
p245 (male)

GT
p232 (male)

Diff-HierVC
(Ours)

Denorm. + DiffVoice

F0 Encoder + DiffVoice

DiffPitch + SF Encoder

w.o Masked Prior

w.o Data-driven Prior

w.o SF Encoder