[Question] Implement preprocessing on datasets? #1236

iamanigeeit · 2022-01-11T11:55:32Z

iamanigeeit
Jan 11, 2022

Coming from TensorFlowTTS, i find Coqui to be more functional and well-maintained. (I still encounter nan losses after 50k+ iterations, but i can leave that for later.)

One main issue is that each iteration seems to take about double the time and memory consumption is higher compared to TensorFlowTTS. From dataset.py, i can see collate_fn computes the spectrograms while batching and does not cache them (unlike the phoneme_cache).

I will rewrite some parts to save the preprocessed phonemes and spectrograms so i can train different models on the same dataset, and visually compare the ground truth spectrograms against the TTS output.

Also, i think LongTensors are not needed as sequence lengths won't exceed 2 billion.

erogol · 2022-01-11T13:19:06Z

erogol
Jan 11, 2022
Maintainer

Welcome to 🐸 then :)

Unfortunately, our data pipeline is not flexible enough yet. So we are working on a new DatasetAPI. You can see the initial work here #983

If you have some ideas feel free to share them under the PR or send your own PR changes. It'd be great to have some oversight on the changes. It is always welcome.

0 replies

vince62s · 2022-01-11T17:20:05Z

vince62s
Jan 11, 2022

@iamanigeeit how many num_loader_workers are you using ? because when I look at the GPU usage it rarely goes down during training, so the CPU-based spectrogram computing does not seem to be a bottleneck.

0 replies

iamanigeeit · 2022-01-12T03:20:23Z

iamanigeeit
Jan 12, 2022
Author

@iamanigeeit how many num_loader_workers are you using ? because when I look at the GPU usage it rarely goes down during training, so the CPU-based spectrogram computing does not seem to be a bottleneck.

I used the default 4. You are probably right -- one ap.melspectrogram(wav).astype("float32") averages 16.5ms for me so a 8 wavs / CPU should take less than 0.2s.

@erogol Will submit a PR if i succeed. Is there a test process before submitting a PR?

0 replies

iamanigeeit · 2022-01-18T17:19:35Z

iamanigeeit
Jan 18, 2022
Author

@erogol @vince62s I think i've found the bottleneck. For some reason, creating a phonemizer in gruut is very slow.

import gruut
text = 'this is a very very very very long sentence that you havent handled before'
language = 'en-us'

def testme(text, language, phone_sep='', word_sep=' '):
    phonemizer_args = {
        "remove_stress": True,
        "ipa_minor_breaks": False,  # don't replace commas/semi-colons with IPA |
        "ipa_major_breaks": False,  # don't replace periods with IPA ‖
    }
    ph_list = gruut.text_to_phonemes(
            text,
            lang=language,
            return_format="word_phonemes",
            phonemizer_args=phonemizer_args,
        )
    phones_words = [phone_sep.join(word_phonemes) for word_phonemes in ph_list]
    phones = word_sep.join(phones_words)
    return phones

%timeit testme(text,language)
144 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

def testme(text, language, phone_sep='', word_sep=' '):
    phonemizer_args = {}
    ph_list = gruut.text_to_phonemes(
            text,
            lang=language,
            return_format="word_phonemes",
            phonemizer_args=phonemizer_args,
        )
    phones_words = [phone_sep.join(word_phonemes) for word_phonemes in ph_list]
    phones = word_sep.join(phones_words)
    return phones

%timeit testme(text,language)
1.19 ms ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

With 8 text samples per CPU, this would slow down every batch by over 1s. If we simply create phonemizer before text_to_phonemes the bottleneck goes away.

phonemizer_args = {
    "remove_stress": True,
    "ipa_minor_breaks": False,  # don't replace commas/semi-colons with IPA |
    "ipa_major_breaks": False,  # don't replace periods with IPA ‖
}
phonemizer = gruut.get_phonemizer(language, **phonemizer_args)
def testme(text, language, phone_sep='', word_sep=' ', phonemizer=phonemizer):
    ph_list = gruut.text_to_phonemes(
            text,
            lang=language,
            return_format="word_phonemes",
            phonemizer=phonemizer,
        )
    phones_words = [phone_sep.join(word_phonemes) for word_phonemes in ph_list]
    phones = word_sep.join(phones_words)
    return phones

%timeit testme(text,language)
1.2 ms ± 898 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

0 replies

erogol · 2022-01-19T11:10:15Z

erogol
Jan 19, 2022
Maintainer

@iamanigeeit how many num_loader_workers are you using ? because when I look at the GPU usage it rarely goes down during training, so the CPU-based spectrogram computing does not seem to be a bottleneck.

I used the default 4. You are probably right -- one ap.melspectrogram(wav).astype("float32") averages 16.5ms for me so a 8 wavs / CPU should take less than 0.2s.

@erogol Will submit a PR if i succeed. Is there a test process before submitting a PR?

https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md

Phonemizer API is gonna change soon #1079

So if you send a PR make sure you check the new API first.

0 replies

iamanigeeit · 2022-01-20T07:43:13Z

iamanigeeit
Jan 20, 2022
Author

https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md

Phonemizer API is gonna change soon #1079

So if you send a PR make sure you check the new API first.

@erogol Thanks for the update! I am rushing a paper for Interspeech 2022 so i might only review the latest version end March... meanwhile, i have found that gruut.Phonemizer can't be pickled (i.e. i cannot pass it as an argument to _phoneme_worker, so every worker needs to create its own phonemizer).

My current hack is to create a global phonemizers where num_phonemizers = num_workers, then pass the worker_idx to _phoneme_worker.

phonemizers = []
def set_phonemizers(phoneme_language, phonemizer_args, use_espeak_phonemes, num_workers):
    if use_espeak_phonemes:
        # Use a lexicon/g2p model train on eSpeak IPA instead of gruut IPA.
        # This is intended for backwards compatibility with TTS<=v0.0.13
        # pre-trained models.
        phonemizer_args["model_prefix"] = "espeak"
    global phonemizers
    phonemizers = []
    if phonemizer_args:
        for i in range(num_workers):
            phonemizers.append(gruut.get_phonemizer(phoneme_language, **phonemizer_args))
    else:
        for i in range(num_workers):
            phonemizers.append(gruut.get_phonemizer(phoneme_language))
    return phoneme_language

...

tqdm(
        p.imap(_phoneme_worker,
               [(item, cache_path, cleaner_name, phoneme_language, i % num_workers,
                 custom_symbols, character_config, add_blank) for i, item in enumerate(items)]),
        total=len(items)

0 replies

erogol · 2022-01-20T11:08:47Z

erogol
Jan 20, 2022
Maintainer

happy that you at least found a workaround 👍

0 replies

vince62s · 2022-02-08T15:03:32Z

vince62s
Feb 8, 2022

@iamanigeeit I am a bit confused, where exactly is the bottleneck (with the phonemizer instanciation) during the training loop ?
EDIT: oh your code base is before Oct 31 commit in this file https://github.com/coqui-ai/TTS/blob/main/TTS/tts/utils/text/__init__.py

So I don't know if your fix is applicable to the new Gruut 2.0 api

0 replies

iamanigeeit · 2022-02-08T17:21:47Z

iamanigeeit
Feb 8, 2022
Author

@iamanigeeit I am a bit confused, where exactly is the bottleneck (with the phonemizer instanciation) during the training loop ? EDIT: oh your code base is before Oct 31 commit in this file https://github.com/coqui-ai/TTS/blob/main/TTS/tts/utils/text/__init__.py

So I don't know if your fix is applicable to the new Gruut 2.0 api

Yes, unfortunately i'm using an older version. I believe the bottleneck can still be tested with (1) checking whether the new phonemizer can be passed to _phoneme_worker and (2) checking phonemizer instantiation speed.

0 replies

vince62s · 2022-02-08T17:29:52Z

vince62s
Feb 8, 2022

I think the new Gruut and the way TTS uses it is different.
before: gruut.text_to_phonemes() was creating a phonemizer by default unless you passed one (as you did in your hack)
after the commit: it is another function gruut.sentences() which does not instanciate a phonemizer https://github.com/coqui-ai/TTS/blob/main/TTS/tts/utils/text/__init__.py#L60

0 replies

iamanigeeit · 2022-02-08T17:37:47Z

iamanigeeit
Feb 8, 2022
Author

I think the new Gruut and the way TTS uses it is different. before: gruut.text_to_phonemes() was creating a phonemizer by default unless you passed one (as you did in your hack) after the commit: it is another function gruut.sentences() which does not instanciate a phonemizer https://github.com/coqui-ai/TTS/blob/main/TTS/tts/utils/text/__init__.py#L60

Thanks for the explanation... i'll check again together with all the updates after i'm done with my paper :)

0 replies

vince62s · 2022-02-08T19:20:10Z

vince62s
Feb 8, 2022

@erogol I must admit there is something still linked to phonemes. Wall time is much higher.

When training on characters the GPU usage is almost always 100%
When training with phonemes there is a lots overheads, even though the step_time is similar which means the measure does not take into account the batching/dataloading but maybe only the foraward/backward passes.

As mentioned above the code base has changed and I can't pinpoint where the phonmizer intanciation could impact this.

0 replies

erogol · 2022-02-11T11:16:37Z

erogol
Feb 11, 2022
Maintainer

there should not be any overhead after the first epoch as all the phonemes are cached and loaded statically afterward.

Do you also observe it after the first epoch?

0 replies

iamanigeeit · 2022-02-11T16:44:48Z

iamanigeeit
Feb 11, 2022
Author

Was that

there should not be any overhead after the first epoch as all the phonemes are cached and loaded statically afterward.

Do you also observe it after the first epoch?

I might be wrong on this one... there doesn't seem to be any difference between loader_time in the first or subsequent epochs but it is low enough (0.01-0.02 sec) per batch of 32. This is quite strange because ap.melspectrogram(wav).astype("float32") already takes that much time...

I did move the preprocessing out from Dataset so i could cache the mels and char_ids for reuse in different models.

0 replies

erogol · 2022-02-14T10:06:26Z

erogol
Feb 14, 2022
Maintainer

The difference is only obvious when you enable phonemes and the phoneme computation takes a relatively long time and pushes the loader time a bit in the first epoch.

0 replies

vince62s · 2022-02-14T10:10:56Z

vince62s
Feb 14, 2022

Also I can confirm that num_workers does not make a big difference, so there is somehow a bottleneck but without timing tracing difficult to figure out.

0 replies

[Question] Implement preprocessing on datasets? #1236

Uh oh!

iamanigeeit Jan 11, 2022

Replies: 16 comments

Uh oh!

erogol Jan 11, 2022 Maintainer

Uh oh!

vince62s Jan 11, 2022

Uh oh!

iamanigeeit Jan 12, 2022 Author

Uh oh!

Uh oh!

iamanigeeit Jan 18, 2022 Author

Uh oh!

erogol Jan 19, 2022 Maintainer

Uh oh!

Uh oh!

iamanigeeit Jan 20, 2022 Author

Uh oh!

erogol Jan 20, 2022 Maintainer

Uh oh!

Uh oh!

vince62s Feb 8, 2022

Uh oh!

iamanigeeit Feb 8, 2022 Author

Uh oh!

vince62s Feb 8, 2022

Uh oh!

iamanigeeit Feb 8, 2022 Author

Uh oh!

vince62s Feb 8, 2022

Uh oh!

erogol Feb 11, 2022 Maintainer

Uh oh!

iamanigeeit Feb 11, 2022 Author

Uh oh!

erogol Feb 14, 2022 Maintainer

Uh oh!

vince62s Feb 14, 2022

iamanigeeit
Jan 11, 2022

erogol
Jan 11, 2022
Maintainer

vince62s
Jan 11, 2022

iamanigeeit
Jan 12, 2022
Author

iamanigeeit
Jan 18, 2022
Author

erogol
Jan 19, 2022
Maintainer

iamanigeeit
Jan 20, 2022
Author

erogol
Jan 20, 2022
Maintainer

vince62s
Feb 8, 2022

iamanigeeit
Feb 8, 2022
Author

vince62s
Feb 8, 2022

iamanigeeit
Feb 8, 2022
Author

vince62s
Feb 8, 2022

erogol
Feb 11, 2022
Maintainer

iamanigeeit
Feb 11, 2022
Author

erogol
Feb 14, 2022
Maintainer

vince62s
Feb 14, 2022