DNN TTS

Deep Neural Networks Employing Multi-Task Learning and Stacked Bottleneck Features for Speech Synthesis

This page contains following samples:

Copy-Synthesis: Use STRAIGHT to extract 60-dimensional Mel-Cepstral Coefficients (MCCs), 25 band aperiodicities (BAPs) and fundamental frequency (F0), and then pass these parameters directly to STRAIGHT-vocoder to reconstract waveform.
HMM-GV: Baseline HMM system, implemented by HTS. Global variance is applied when generating the acoustic parameters.
DNN: Baseline DNN system.
MTLDNN (Formants): Multi-Task Learning (MTL) DNN using formant tracks (F1-F4) as the secondary task.
MTLDNN (LSF): MTLDNN using 40-dimensional linear spectrum frequencies (LSFs) as the secondary task.
MTLDNN (Gammatone): MTLDNN using 64-dimensional Gammatone spectra as the secondary task.
MTLDNN (STEP): MTLDNN using 55-dimensional spectro-temporal excitation pattern (STEP) feature as the secondary task.
DNN-DNN: DNN with stacked bottleneck features. A first DNN is used to extract 128-dimensional bottleneck features. They are stacked with linguistic features as input to a second DNN to predict acoustic features.
MTLDNN-MTLDNN: Similar to DNN-DNN, but both the first and second networks are using MTLDNN.