Online multichannel speech enhancement combining statistical signal processing and deep neural networks
- Antonio Miguel Peinado Herreros Director/a
- Ángel Manuel Gómez García Director/a
Universidad de defensa: Universidad de Granada
Fecha de defensa: 25 de enero de 2021
- Inmaculada Hernáez Rioja Presidente/a
- José Luis Pérez Córdoba Secretario/a
- Isaac Manuel Álvarez Ruiz Vocal
- Jesper Jensen Vocal
- Antonio Miguel Artiaga Vocal
Tipo: Tesis
Resumen
Speech-related applications on mobile devices require high-performance speech enhancement algorithms to tackle challenging real-world noisy environments. These speech processing techniques have to ensure good noise reduction capabilities with low speech distortion, thus improving the perceptual speech quality and intelligibility of the enhanced speech signal. In addition, current mobile devices often embed several microphones, allowing them to exploit the spatial information during the enhancement procedure. On the other hand, low latency and efficiency are requirements for extensive use of these technologies. Among the different speech processing paradigms, statistical signal processing offers limited performance under non-stationary noisy environments, while deep neural networks can lack generalization under real conditions. The main goal of this Thesis is the development of online multichannel speech enhancement algorithms for speech services in mobile devices. The proposed techniques use multichannel signal processing to increase the noise reduction performance without degrading the quality of the speech signal. Moreover, deep neural networks are applied in specific parts of the algorithm where modeling by classical methods would be, otherwise, difficult or very limited. This allows for the use of more capable deep learning methods in real-time online processing algorithms. Our contributions focus on different noisy environments where these mobile speech technologies can be applied. First, we develop a speech enhancement algorithm suitable for dual-microphone smartphones used in noisy and reverberant environments. The noisy speech signal is processed using a beamforming-plus-postfiltering strategy that exploits the dual-channel properties of the clean speech and noise signals to obtain more accurate acoustic parameters. Thus, the temporal variability of the relative transfer functions between acoustic channels is tracked by using an extended Kalman filter framework. Noise statistics are obtained by means of a recursive procedure using the speech presence probability. This speech presence is estimated through either statistical spatial models or deep neural network mask estimators, both exploiting dual-channel features from the noisy speech signal. Then, we propose a recursive expectation-maximization framework for online multichannel speech enhancement. The goal is the joint estimation of the clean speech statistics and the acoustic model parameters in order to increase robustness under non-stationary conditions. The noisy speech signal is first processed using a beamformer followed by a Kalman postfilter, which exploits the temporal correlations of the speech magnitude. The speech presence probability is then obtained using a deep neural network mask estimator, and its estimates are further refined through statistical spatial models defined for the noisy speech and noise signals. The resulting clean speech and speech presence estimates are then employed for maximum-likelihood estimation of beamformer and postfilter parameters. This also allows for an iterative procedure with positive feedback between the estimation of speech statistics and acoustic parameters. Scenarios with multiple overlapped speakers are also analyzed in this Thesis. Thus, beamforming with the model parameters obtained from deep neural network mask estimators is also explored. To deal with interfering speakers, we study the use of adapted mask estimators that exploit spectral and spatial information, obtained through auxiliary information, to focus on a target speaker. Therefore, additional speech processing blocks are integrated into the mask estimators so that the network can discriminate among different speakers. As an application, we consider the problem of automatic speech recognition in meeting scenarios, where our proposal can be used as a front-end processing. Finally, we study the training of deep learning methods for speech processing using perceptual considerations. Thus, we propose a loss function based on a perceptual quality objective metric. We evaluate the proposed loss for training deep neural network-based singlechannel speech enhancement algorithms in order to improve the speech quality perceived by human listeners. The two most common approaches for single-channel processing using these networks are considered: spectral mapping and spectral masking. We also explore the combination of different objective metric-related loss functions in a multi-objective learning training approach. To conclude, we would like to highlight that our contributions successfully integrate signal processing and deep learning methods to jointly exploit spectral, spatial, and temporal speech features. As a result, the set of proposed techniques provides us with a manifold framework for robust speech processing under very challenging acoustic environments, thus allowing us to improve perceptual quality, intelligibility, and distortion measures.