What's the difference between speaker embeddings and d_vector ? #1171

Ca-ressemble-a-du-fake · 2022-01-31T20:13:58Z

Ca-ressemble-a-du-fake
Jan 31, 2022

Hi,

I am new to voice cloning. I understood that a given voice properties are called embeddings. When fine-tuning a model I read that we can choose to compute speaker embeddings or use a d_vector. Are speaker embeddings computed on the fly while training for each wav file, and d_vector computed once and for all prior to training ? What are the pros and cons of each method ?

Thanks

Answered by erogol

Feb 1, 2022

speaker embeddings are computed using a speaker embedding layer.

d_vectors are computed externally from a speaker encoder model.

speaker embedding model is harder to expand for more speakers once trained since each new speaker needs to be added to the speaker embedding layer

d_vectors do not have this issue but you need a high-quality pre-trained speaker encoder to make this work well.

View full answer

erogol · 2022-02-01T12:41:22Z

erogol
Feb 1, 2022
Maintainer

speaker embeddings are computed using a speaker embedding layer.

d_vectors are computed externally from a speaker encoder model.

speaker embedding model is harder to expand for more speakers once trained since each new speaker needs to be added to the speaker embedding layer

d_vectors do not have this issue but you need a high-quality pre-trained speaker encoder to make this work well.

2 replies

Ca-ressemble-a-du-fake Feb 3, 2022
Author

Thank you @erogol . Can I use the default speaker encoder, is it considered as high quality enough ?

jaggukaka Feb 3, 2022

Sorry, for commenting on an answered topic, but I want to know if same applies for language embedding too? Because let's say if I want to fine tune already trained yourtts model provided with the installation, I see it is trained on french, english and brazil-portugese. So if I fine tune it with a new language, does the language embedding layer gets messed up, as the dimensions would change?

I understand there's no language encoder equivalent to speaker encoder, so if training for a new language is harder similar to as you mentioned for speaker embedding, then the only way to add a new language is to start afresh?

But as far as I saw in the yourtts paper, looks like training has been done incrementally over various languages and not all at once. So it means there's hope transfer learning a pretrained model for a new language right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the difference between speaker embeddings and d_vector ? #1171

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

What's the difference between speaker embeddings and d_vector ? #1171

Uh oh!

Ca-ressemble-a-du-fake Jan 31, 2022

Replies: 1 comment · 2 replies

Uh oh!

erogol Feb 1, 2022 Maintainer

Uh oh!

Ca-ressemble-a-du-fake Feb 3, 2022 Author

Uh oh!

Uh oh!

jaggukaka Feb 3, 2022

Ca-ressemble-a-du-fake
Jan 31, 2022

Replies: 1 comment 2 replies

erogol
Feb 1, 2022
Maintainer

Ca-ressemble-a-du-fake Feb 3, 2022
Author