Deep Neural Networks Employing Multi-Task Learning and Stacked Bottleneck Features for Speech Synthesis


Samples to support the ICASSP 2015 submission titled above. More details can be found in our paper. If you have any questions, please drop me an email: zhizheng.wu {at} ed.ac.uk (Zhizheng Wu)


This page contains following samples:

  1. Copy-Synthesis: Use STRAIGHT to extract 60-dimensional Mel-Cepstral Coefficients (MCCs), 25 band aperiodicities (BAPs) and fundamental frequency (F0), and then pass these parameters directly to STRAIGHT-vocoder to reconstract waveform.
  2. HMM-GV: Baseline HMM system, implemented by HTS. Global variance is applied when generating the acoustic parameters.
  3. DNN: Baseline DNN system.
  4. MTLDNN (Formants): Multi-Task Learning (MTL) DNN using formant tracks (F1-F4) as the secondary task.
  5. MTLDNN (LSF): MTLDNN using 40-dimensional linear spectrum frequencies (LSFs) as the secondary task.
  6. MTLDNN (Gammatone): MTLDNN using 64-dimensional Gammatone spectra as the secondary task.
  7. MTLDNN (STEP): MTLDNN using 55-dimensional spectro-temporal excitation pattern (STEP) feature as the secondary task.
  8. DNN-DNN: DNN with stacked bottleneck features. A first DNN is used to extract 128-dimensional bottleneck features. They are stacked with linguistic features as input to a second DNN to predict acoustic features.
  9. MTLDNN-MTLDNN: Similar to DNN-DNN, but both the first and second networks are using MTLDNN.

Samples Copy-Synthesis HMM-GV DNN MTLDNN (Formants) MTLDNN (LSF) MTLDNN (Gammatone) MTLDNN (STEP) DNN-DNN MTLDNN-MTLDNN
1
2
3
4
5
6