[Question] Implement preprocessing on datasets? #1236
Replies: 16 comments
-
|
Welcome to 🐸 then :) Unfortunately, our data pipeline is not flexible enough yet. So we are working on a new DatasetAPI. You can see the initial work here #983 If you have some ideas feel free to share them under the PR or send your own PR changes. It'd be great to have some oversight on the changes. It is always welcome. |
Beta Was this translation helpful? Give feedback.
-
|
@iamanigeeit how many num_loader_workers are you using ? because when I look at the GPU usage it rarely goes down during training, so the CPU-based spectrogram computing does not seem to be a bottleneck. |
Beta Was this translation helpful? Give feedback.
-
I used the default 4. You are probably right -- one @erogol Will submit a PR if i succeed. Is there a test process before submitting a PR? |
Beta Was this translation helpful? Give feedback.
-
|
@erogol @vince62s I think i've found the bottleneck. For some reason, creating a With 8 text samples per CPU, this would slow down every batch by over 1s. If we simply create |
Beta Was this translation helpful? Give feedback.
-
https://github.com/coqui-ai/TTS/blob/main/CONTRIBUTING.md Phonemizer API is gonna change soon #1079 So if you send a PR make sure you check the new API first. |
Beta Was this translation helpful? Give feedback.
-
@erogol Thanks for the update! I am rushing a paper for Interspeech 2022 so i might only review the latest version end March... meanwhile, i have found that My current hack is to create a global |
Beta Was this translation helpful? Give feedback.
-
|
happy that you at least found a workaround 👍 |
Beta Was this translation helpful? Give feedback.
-
|
@iamanigeeit I am a bit confused, where exactly is the bottleneck (with the phonemizer instanciation) during the training loop ? So I don't know if your fix is applicable to the new Gruut 2.0 api |
Beta Was this translation helpful? Give feedback.
-
Yes, unfortunately i'm using an older version. I believe the bottleneck can still be tested with (1) checking whether the new phonemizer can be passed to |
Beta Was this translation helpful? Give feedback.
-
|
I think the new Gruut and the way TTS uses it is different. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the explanation... i'll check again together with all the updates after i'm done with my paper :) |
Beta Was this translation helpful? Give feedback.
-
|
@erogol I must admit there is something still linked to phonemes. Wall time is much higher. When training on characters the GPU usage is almost always 100% As mentioned above the code base has changed and I can't pinpoint where the phonmizer intanciation could impact this. |
Beta Was this translation helpful? Give feedback.
-
|
there should not be any overhead after the first epoch as all the phonemes are cached and loaded statically afterward. Do you also observe it after the first epoch? |
Beta Was this translation helpful? Give feedback.
-
|
Was that
I might be wrong on this one... there doesn't seem to be any difference between I did move the preprocessing out from Dataset so i could cache the mels and char_ids for reuse in different models. |
Beta Was this translation helpful? Give feedback.
-
|
The difference is only obvious when you enable phonemes and the phoneme computation takes a relatively long time and pushes the loader time a bit in the first epoch. |
Beta Was this translation helpful? Give feedback.
-
|
Also I can confirm that num_workers does not make a big difference, so there is somehow a bottleneck but without timing tracing difficult to figure out. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Coming from TensorFlowTTS, i find Coqui to be more functional and well-maintained. (I still encounter
nanlosses after 50k+ iterations, but i can leave that for later.)One main issue is that each iteration seems to take about double the time and memory consumption is higher compared to TensorFlowTTS. From
dataset.py, i can seecollate_fncomputes the spectrograms while batching and does not cache them (unlike thephoneme_cache).I will rewrite some parts to save the preprocessed phonemes and spectrograms so i can train different models on the same dataset, and visually compare the ground truth spectrograms against the TTS output.
Also, i think LongTensors are not needed as sequence lengths won't exceed 2 billion.
Beta Was this translation helpful? Give feedback.
All reactions