Text to Speech Modeling on Youtube Sourced Single Speaker Data Set with Tacotron2 and Waveglow: Part 2

shaun Big Data Enthusiast and lover of all things distributed and scalable.

It has been a few weeks since we posted results from our text to speech model.  We have learned some things about our data, and our model.  These findings are meant to add to the discussion on this model so don’t take them as absolute.  We learned a lot about the model in online forums so we want to give back to the TTS community by posting our results.


If you read the previous blog Text to Speech Modeling on Youtube Sourced Single Speaker Data Set with Tacotron2 and Waveglow: Part I you will recall that both Tacotron2 and Waveglow converge well with our data.   We have spent some time cleaning our data and both our models seem to do a good job of converging.  They both achieve validation loss very close to what you see in the white papers.


Unfortunately, in TTS models converging is not the best indicator that you have reached intelligible speech.  The ground truth in TTS is the human ear.  I imagine some some sort of passive listening version of the touring test would be the best indicator of how your model performs.  There appears to be a major gap between the MSE loss function used in wav-glow and the human area analysis of the wav file.  In the future we will be exploring some experimental loss functions that are more dynamic than MSE but for now we limited ourselves to the MSE that was used in most of the Tacotron2 literature.


What was Interesting

We took a LJspeech pre-trained Waveglow checkpoint and our subject trained Tacotron2 checkpoint and did some inference.

LJSpeech Pre-trained Checkpoint

Astonishingly the results were not terrible.  The words were not intelligible but we were excited because the inferred wav had the same tone, cadence, and pitch as our subject.  You can see a comparison of the Mel spectrograms below.

Original Mel


Inferred Mel



The two spectrograms looked similar and the wav files audibly had less noise than we experienced in any of our previous evaluations.  Strangely, when listening to the wav file, it sounded like the phonemes were scrambled.  We found very interesting given the time series from the wav file also appeared to be out of order as well.

Original Wav File

its called the office of public policy(original)

Inferred Wav File

its called the office of public policy(inferred)

What confused us

Because of the results we got from LJSpeech pre-trained Waveglow checkpoint I wanted to try and train our own Waveglow checkpoint on the LJspeech.  After it stopped converging on 3000 epochs, we inferred on the same file as above and the results were worse.  As you can see below the there is a lot of noise that was not there in the pre-trained LJspeech checkpoint inference above.

Original Wav File


Inferred Wav File


This was confusing because we trained out model on the exact same data with almost the same machines(8 Tesla V100s) just a little less memory.  The only conclusion that I can begin to draw is maybe the hyper parameters that they give on the pre-trained model download do not tell the entire story.  In other words maybe there was some tuning that went on under the hood that changed the default parameters.  I would love to hear anyones feedback on similar issues.

Going Forward

We feel like there may be some real promise in finding a way to pre-train both our models with LJSpeech our even a multi speaker dataset such as in the paper https://github.com/CODEJIN/multi_speaker_tts .   We will be working through some different methods of doing this as well as some promising methods for generating synthetic data.

Please feel free to reach out if you would like to collaborate our have been experiencing these issues.

Twitter: @shaun_atx


Big Data Enthusiast and lover of all things distributed and scalable.