Text to Speech Modeling on Youtube Sourced Single Speaker Data Set with Tacotron2 and Waveglow: Part I

shaun Big Data Enthusiast and lover of all things distributed and scalable.

We sourced 12 hours of publicly available speeches of a single speaker from youtube and trained the Tacotron2/WaveGlow TTS neural network on it.  Here are our preliminary results.


There is an abundance of academic white papers that use the latest “novel approaches” to achieve state of the art results on complex machine learning language problems.    You do not have to go far to find blogs and tutorials of people attempting to duplicate the experiment and often falling short.

To make matters worse these results are obtained by using unreasonably clean data sets and leave little hope for us who do not have our own proprietary data. If one actually wants to find examples of real world applications of those models it is almost impossible.  One can assume the primary reason for this is that most companies hold their cards pretty close to their chest when applying state of the art models to their proprietary data.

We have spent the last 12 months adapting the Tacotron2/WaveGlow Text to Speech Model to train on a data set that we sourced from publicly available Youtube data.  The speaker is a single speaker

In the spirit of giving back to the forums that got helped get us this far we wanted to share what we could about our journey through our TTS project. While I can not share every detail, I would like to share the progress and the high level methods that we used to develop our model.


I will assume that anyone reading this blog has researched the TTS model we are using.  If not here is a high level overview of how the two models work with a link to the respective white papers.

Our model is made up of two Neural Networks.

Tacotron2– A sequence to sequence model that converts text to a mel spectrograms

WaveGlow – A vocoder that uses a flow-based network capable of generating high quality speech from mel-spectrograms.



Both models were trained on 8 NVIDIA TESLA V100 GPUs with mixed precision using Tensor Cores. We were able to get results 2.0x faster for Tacotron 2 and 3.1x faster for WaveGlow than training without Tensor Cores, while experiencing the benefits of mixed precision training.  It worth noting that depending on the memory you have on your box you may have to train on smaller batches to keep from CUDA memory issues.

Because the cost of the GPU was so high we utilized a secondary CPU machine to do all our data preparation and analysis.


We installed Ubuntu on bare metal along with all the related CUDA libraries.  Although most of the papers and demos we saw used a Docker container to do all the training we decided that it was easier for our process to install Pytorch and all the needed libraries in the conda virtual environment.


The original WaveGlow and Tacotron2 papers use a recorded single speaker data set called LJSpeech.  It was recorded specifically for the purpose of TTS.  There is no background noise and the speaker speaks very calm and clear in what can be thought of as the American version of “The Queens English” so we realized quickly that the novel aspect to what we were doing was the publicly sourced youtube audio.  We were not able to find anyone who had successfully done this and achieved respectable results.

Because the data cleaning and noise removal would be essential to the success of our project we decided that we would use iterations our data cleaning for training.  This allowed us to train and learn stuff about our model while we were improving the cleaning.  We refer to our first iteration as data 1.0 and any minor changes as 1.1 etc.


We sourced 12 hours of publicly available data from a single speaker on youtube.  This original data was pulled from public speeches and other appearances by our subject.  Our data was cut into 2-10 seconds clips, human transcribed and labeled.


We used both human transcription and Amazon Transcribe.  The human transcription was a time consuming process but the results were better than Amazon Transcribe, but I believe in the future a tool like transcribe could be good enough if you want to automate this process.

Text and Wav Cleaning

We applied all the traditional data cleaning techniques to our text files and wav files.  Additionally a few of the most notable task were:

  • manually clipping silence from beginning and end of clips
  • down sample the files from 44.1k sample rate to 24.1k.
  • convert them from stereo to mono for noise removal techniques


With over 12 hours of text we knew there was some low hanging fruit.

With data 1.0 we used a naive approach.  We identified the SNR(Signal to Noise) ratio in all our wav files. Because research has shown that WaveGlow is very sensitive to noise we needed to figure out at what threshold we would consider a file too noisy to use.

Initially, we looked at the distribution and determined what files looked like outliers and sent them over to our audio engineer to listen to them.  The ones that were inaudible would be labeled as such and removed and the ones that we felt could be cleaned were kept.  This allowed us to determine a threshold for removed files from our data set.

Once we had what deemed a usable data we had our sound engineer compose a noisy data set from our data.  With that noisy data we used a naive approach to remove those different types of noise from our wav files.


Alas, because removing signal from noise is what we do as Data Scientist we were sure that we needed a smart way to remove the noise.  We started developing a neural network for cleaning the noise from our data.  This was based on a previous paper.

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Our first run through the model(Data 1.0) would not included data cleaned by this method.

Quality Assurance

Eventually we used Amazon Transcribe as a QA tool.  This actually proved to be very valuable as it was able to be a last line of defense against transcription errors, noisy files, and also wav files that were inaudible.


quality assurance TTS model

Training(data 1.0)

We trained the two models separately so essentially if you have the resources you can train them in parallel.


training time: 2 days

epochs: 1500

tacotron2 training results

We used similar hyper parameters as the Tacotron2 white paper and we were happy that it converged quickly and had very little issues.   We trained it for 1500 epochs as the paper suggested but with our data we realized it only needed to train for around 800.


training time: 2 days

epochs:  750

validation loss: -5.01

In the original wave glow paper they

Waveglow Training Results on Data 1.0

After first iteration of wave glow training, we were very happy to see that model had in fact converged, and actually began to look a lot like the paper.   But after about 500 epochs, the wheels fell off the bus and we could see that the model became unstable and the loss seemed to explode, catch traction and start converging again only to explode again very quickly.

After reviewing these charts, we were convinced that we still had some training we could do on this data 1.0.  We had a hunch that we could retrain with a smaller learning rate and possibly avoid this issue.  We also agreed that the model should be trained for much longer than 750 epochs.

Lowering Learning Rate

training time: 2 days

epochs:  2200

validation loss: -5.57

 wavenet 2000 epochs

We started training from checkpoint 150 so this gives a zoomed in view of our model.  It appeared that our hunch was right.  We can see here that there are no exploding loss and the model appears to be stable and still converging even at 2200 epchs.

Because of this we were anxious to see our inference results.

Inference Results

In any text to speech model the true test of success is how the speech sounds to the human ear.  Given the amount of noise that was in our audio files and the results we have seen in forums online, we were skeptical about being able to reproduce anything even resembling a voice. So we were pleasantly surprised when were able to produce a wav file that was the same length of the original and although inaudible, sounded somewhat like our speaker in cadence and tone.

While I can not share our actual speech generated wav file due to client non disclosure I can share the time series chart of the original and artificially produced wav file.

The Original Wav File


Inferred Wav File

You can see from comparing the two time series plots that the inferred wav file had a lot of low frequency noise compared to the original but there is definitely a voice signal in there.


The next steps will be to apply our ML Noise reduction model to our data, and then focus on retraining WaveGlow only.   We will be posting our results on this blog.  If you have any questions or want to reach out and chat with us regarding your experience with this model feel free to contact us through the contact form on the home page or linked in.


Big Data Enthusiast and lover of all things distributed and scalable.