Abstract
The performance of speech processing models trained on clean speech deteriorates significantly in the presence of noisy speech. While training with a noisy dataset is the most straightforward solution, procuring such datasets is not always feasible. Noisy speech simulation techniques that generate noisy speech from clean speech play an indispensable role in training robust models for audio processing tasks. In our work, we focus on studying the ability of GANs to simulate a variety of noises. In particular, we study noise from the categories of UHF/VHF, Additive Stationary and Non-Stationary as well as Codec distortion. We study the performance of four GANs including SpeechAttentionGAN, SimuGAN and MaskCycleGAN which perform non-parallel translation and Speech2Speech-Augment which performs parallel translation. Our experiments provide a comprehensive guideline for researches to pick the right kind of GAN for the right type of noise to generate simulated noisy speech. We compare our models against a baseline which generates noisy speech using aggregated noise and a codec. We get improvements of 55.8% , 28.9% and 22.8% in terms of MSSL as compared to the baseline for the RATS, TIMIT Cabin and TIMIT Helicopter datasets respectively.
Speech Samples
A few generated samples from all the models for each dataset are presented.
Please put on your earphones when linstening to these samples.
Dataset | Clean speech | SpeechAttentionGAN | MaskCycleGAN Augment | SimuGAN | Speech2Speech Augment | Actual Noisy Speech |
---|---|---|---|---|---|---|
RATS | ||||||
Cabin | ||||||
Helicopter | ||||||
Codec2 |