Text to Speech Modeling on Youtube Sourced Single Speaker Data Set with Tacotron2 and Waveglow: Part 3

shaun Big Data Enthusiast and lover of all things distributed and scalable.

We have made quite a lot of progress since our last post.   After  an intense period of training and experimenting, we decided that we would use a LJspeech pre-trained Waveglow and focus on our Tacotron2 performance.


In our previous blogs we focused a lot of our efforts on getting our models to converge.  We had quite a bit of success in this area and even began to get speech production that had recognizable speech patterns from our subject speaker.   But we continued to fall short of our model producing intelligible and distinct words.  No matter how much we trained we were producing mel and wav spectrograms that clearly had a way to go.  Below were our inferred results.

Original Wav File


Inferred Wav File


Original Wav File

its called the office of public policy(original)

Inferred Wav File

its called the office of public policy(inferred)

New Strategy

From our experiments we were able to deduce that most of the speaker specific information is contained in Tacotron2 model.  We had previously had no issues getting our Tacotron2 to converge but we had not spent time focusing on the attention plot.  When we looked back on our attention plots it it was clear that there were issues.  Even with our cleaned data set the attention no matter how much it was trained would max out looking something like this.



When a properly trained model should look something like this.

We could see we had a long way to go and we were not sure we had enough of our subjects speaker data to get there on our own so we attempted to take a pre-trained Tacotron2 model(6000 epochs) and continue training the model to see if we could transfer some of the success from that model.  Below is the attention plot from the LJSpeech pre-trained Tacotron2.

Attention Plot From Pt Model

pretrained tacotron2 attention plot

We then trained the pre-trained model with our data for 5000 epochs and obtained the below results.

Pt Model after additional 5000 Epochs with Our Data

attention plot tacotron2 trained on pretrained ljspeech

If you look closely at the bottom left hand corner of the attention plot this model is starting to show some signs of possibly converging onto a diagonal but even after continuing to train with our data we could not get a better attention plot.  The wav file that was generated coincided with our understanding of attention plots in that we could understand the first word only but nothing after.

We continued to train this model with our data but we could not seem to achieve better results.  After a little more experimenting we had a hunch that we were still limited by our data.

Synthesizing New Data

We had an idea that we could get some “new” data by stretching the audio files just slightly.  This doubled our data set to roughly 20 hours.  We also added some additional noise reduction to our data pipeline to further clean up what we had.


After training Tacotron2 on our our new partially synthetic data set, our attention plot actually looked better than the attention plot from the pre-trained model.

Pt + data w/ synthesized data

attention plot pretrained tacotron2 + synthesized sunject speaker data

The speech inference was very clear and sounded close to the speech patterns of our target speaker.   There was still some work to be done on the tone of the inferred speech but we felt like we had made a huge step as this was the first time we got our Tacotron2 model loss to converge and also the attention plot represented by a clear diagonal line.

Tags : machine learning tacotron2 text to speech waveglow

Big Data Enthusiast and lover of all things distributed and scalable.