Exploiting Nonnegative Matrix Factorization with Mixed Group Sparsity Constraint to Separate Speech Signal from Single-channel Mixture with Unknown Ambient Noise

This paper focuses on solving a challenging speech enhancement problem: improving the desired speech from a single-channel audio signal containing high-level unspeciﬁed noise (possibly environmental noise, music, other sounds, etc.). Using source separation technique, we investigate a solution combining nonnegative matrix factorization (NMF) with mixed group sparsity constraint that allows exploiting generic noise spectral model to guide the separation process. The experiment performed on a set of benchmarked audio signals with different types of real-world noise shows that the proposed algorithm yields better quantitative results in term of the signal-to-distortion ratio than the previously published algorithms.


Introduction
Speech enhancement is a process of removing unexpected audio signals (noise) from their mixture with a desired speech signal.This subject has been widely studied for decades as it brings huge impact in many different domains such as communication, speech-based control systems, medical surveillance, audio post-processing in movies and entertainments, etc., [1].Recent scientific research [2][3][4] has shown that the performance of speech recognition systems in practical noisy and reverberant environments degraded dramatically.This situation demonstrates the need for improving speech quality in such noisy recordings.Popular approaches for speech enhancement includes beamforming [5,6], spectral subtraction [7], and source separation [8][9][10].
Considering speech and noise as two independent sources to be separated, audio source separation technique can be used to isolate the desired speech from high level noise.Some recent work has developed methods for single-channel speech enhancement based on e.g., NMF [11,12], Gaussian mixture model (GMM) [13], or deep neural network [14,15].The two former methods first learn the characteristics of speech and noise signals, then such learned models were used to guide the signal separation process.The deep learning based approaches can learn the separation mask or the separation model by end-to-end training and gain a significant impact.However deep learning based systems require a lot of training data and processing power.For cases with only few training examples available, the work of Sun and Mysore [16] proposed the use of NMF [17] to establish the general spectral model for speech signals from some other voices.Studies of El Badawy et al. [18][19][20] employed the similar NMF-based spectral models learned from source examples obtained by a search engine to guide the separation algorithm.
In this paper, we focus on a slightly different setting compared to the existing works [16][17][18], where the speaker is assumed to be known but the noise signal is non deterministic.This speaker-dependent situation is very popular in practice.
For instance, when speech is used to control robots or devices, the operator/speaker is often known so that his/her voice can be collected in advance for training the system.Concerning noise, it is highly non-stationary and if the operating environment is changed (different moments or different locations), it will vary accordingly.Therefore, noise should not be well-identified in the training process.From this intuition, we propose a novel approach that first constructs the general spectral noise model from some noise examples in advance.Such noise examples can beeasily pre-collected in some environments.Then the general noise model is used to guide the separation process.Within the considered NMF based approach, we investigate the combination of the existing block sparsity proposed in [16] and component sparsity proposed in [18] in order to improve the source separation performance.Developing further from our preliminary studies [21,22], this paper presents more detail about the algorithm and extends the experiments using large test database containing various types of noise signals to confirm the effectiveness of the proposed approach.Furthermore, we report the investigation of the algorithm's convergence and stability.
The paper is organized into five sections.We first summarize the baseline audio source separation algorithm using the NMF model in Section 2. We then present the proposed approach in Section 3. Section 4 discusses experiment settings, algorithm analysis, and speech enhancement results.Finally, we conclude in Section 5.

Baseline Supervised NMF-based Speech Separation Method
To extract the desired speech signal from the single-channel noisy signal (referred to as mixture), we consider the mixture as a signal which created by mixing two audio sources: the desired speech and the noise.Noise can be environmental noise and any other unwanted sounds.In general, the source separation processing is done in the time-frequency domain after the short-time Fourier transform (STFT) so that the 1D waveform is represented by the 2D spectrogram.Then this 2D spectrogram is modeled by the NMF, which is a widely used model in audio signal processing in general and in audio separation in particular [23][24][25].
Let X ∈ C F×M , Y ∈ C F×M , and Z ∈ C F×M are the complex-valued matrices of the short-time Fourier transform (STFT) coefficients of the observed mixture signal, the speech signal, and the noise signal, respectively, where F is the number of frequency bins, M is the number of time frames, then the mixing model writes: Denoting by V = |X| .2 the power spectral matrix of the mixture signal, where X .n is the matrix whose elements are where * is the normal matrix multiplication, B ∈ R F×K + is the spectral basis matrix whose column vectors are spectral characteristics appearing in V, A ∈ R K×M + is the activation matrix whose row vectors are times of appearance of spectral components in B, K is the number of spectral components to be synthesized.Depending on the applications and properties of input data, K is usually chosen such that B is able to represent most spectral characteristics of the input signal [26].To estimate the latent matrices, B and A are initialized with random non-negative values and are updated in an iterative process such that the cost function (3) representing the divergence between V and B * A is minimized: where f and m denote frequency bin index and time frame index, respectively, and is the Itakura-Saito divergence.This divergence is commonly used in ausio source separation as it offers the scale-invariant property.In each iteration steps, B and A are updated via the well-known multiplicative update (MU) rules [26] as in which C T is the transposition of matrix C, denotes the element-wise Hadamard product, the power and the division is also element-wise.Suppose that B Y and B Z are spectral basis matrices of speech and noise, respectively.In training process of the supervised approach, they are learned from the corresponding training examples by optimizing similar criterion as (11), then the spectral model for two sources B is obtained by In the speech enhancement process, this spectral model B is fixed, and the time activation matrix A is estimated via the MU rule by iterating ( 5) and ( 6).Note that A also consists of two blocks as A Y and A Z , which are block characterizing the time activations for speech and noise, respectively, as After the parameters B and A are obtained, the speech STFT coefficients are determined by Wiener filtering as the following Finally, the estimated speech signal in time domain is obtained via the inverse STFT.

Proposed Method
In the unspecified noise scenario, clean speech example from a desired speaker is assumed to be available a priori for training but exact noise example is not available.
where A Y is the time activation matrix. ( where A (iii) Spectral model for all sources The spectral model for all speech and noise is computed by In the speech enhancement phase, this spectral model B is fixed, and the time activation matrix A is estimated via the MU rule.Matrix A includes the speech activation matrix A Y and noise activation matrix A Z as

Proposed Mixed Group Sparsity-inducing penalty for Noise model fitting
The generic spectral model for noise B Z become a larger matrix when the number of noise examples P increases.Moreover, it is actually redundant when different examples share the similar spectral patterns [27][28][29].Thus, in the NMF model fitting for the signal separation, sparsity constraint is naturally needed so as to fit only a subset of the large matrix B Z to the actual noise representing in the mixture [28].In other words, the mixture spectrogram V is decomposed by solving the following optimization problem where Ω( A Z ) denotes a penalty function imposing sparsity on the activation matrix A Z , λ is a trade-off parameter determining the contribution of the penalty.
Recent work in audio source separation has considered two penalty functions.The first one is block sparsity-inducing penalty [16] formulated as the following where is a non-zero constant, A Z is a subset of A Z representing the activation coefficients for p − th block, . 1 is 1 -norm operator, and P denotes the total number of blocks.In this case, a block represents one training example and P is the total number of noise examples.This penalty enforces the activation for relevant examples only while omitting the poorly fitting examples since their corresponding activation block will likely converge to zero.The second one is named component sparsity-inducing penalty [18] formulated as where a  source in the mixture, while the remaining components in the model do not.Thus instead of activating the whole block, this penalty allows selecting only the more likely relevant spectral components from B Z .
However, the component sparsity-inducing penalty also quite slowly removes unsuitable parts, because it carefully considers each row in the large matrix.Inspired by the advantage of these two state-of-the-art penalty functions, in our recent works [21,22], we proposed to combine them in a more general form as where the first term on the right hand side of the equation presents the block sparsity-inducing penalty, the second term presents the component sparsity-inducing penalty, and α ∈ [0, 1] weights the contribution of each term.Proposed penalty function (18) can be seen as the generalization of ( 16) and (17) in the sense that when α = 1, (18) is equivalent to (16) and when α = 0, (18) is equivalent to (17).In order to derive the parameter estimation algorithm optimizing (15) with the proposed penalty function (18), one can rely on MU rules and the majorization-minimization algorithm.The proposed algorithm is summarized in Algorithm 1, where E (p) is a uniform matrix of the same size

Experiment
We start by describing the data set and parameter settings in Section 4.1.We then describe evaluation metrics in Section 4.2.The performance of the proposed speech enhancement algorithm and its sensitivity with respect to the choice of the hyper parameters are presented in Section 4.3.

Dataset and parameter settings
To validate the performance of the proposed approach, we select noise examples from DEMAND1 dataset for training the generic noise spectral model, and perform the test on the benchmarked dataset from SISEC campaign2 .These datasets were carefully designed by researchers in the audio source separation community and widely used.
Training speech example is five-second long and is spoken by the same person with speech in the tested mixtures.We use five types of environmental noise: kitchen sound, waterfall, metro, field sound, cafeteria to train the generic end for E = [E T (1) , . . ., E T (P ) ] T // Taking into account component sparsity-inducing penalty for k = 1, ..., K do noise spectral model (see Section 3.1).They are extracted from DEMAND with duration varying from 5 to 15 seconds.The performance of the proposed algorithm was evaluated over a test set containing 15 single-channel mixtures of two sources artificially mixed at 0 dB signal to noise ratio (SNR).Note that with this 15 mixtures with various types of noise could be sufficient to access the performance of the proposed algorithm.During the mixing process, we made sure that in all mixtures both sources appear all the time.The mixtures were sampled at 16000 Hz and their duration varies between 5 and 10 seconds.The speech samples include female speech and male speech in English, they were obtained from SiSEC data set.The noise samples were obtained from DEMAND from one channel out of the 16 channels.Some of them were mixed two noises, e.g., traffic + wind sound, ocean waves + birdsong, restaurant + guitar, forest birds + car, square + music, ect.,.
The parameters were set as follow.The STFT was calculated using a sliding window with a frame length of 1024, 50% overlap.The number of NMF components were set to 32 and 16 for speech and noise, respectively.The number of iterations for MU updates was 100 for the training step and was tested with values from 1 to 100 in the testing in order to investigate the convergence of the algorithm.

Evaluation method
We compare the separation performance obtained by proposed algorithm with several state-of-the-art algorithms as follows: • Baseline NMF -without training: The NMF-based algorithm was described in Section 2. This test did not use training data, instead, the spectral models for both speech and noise were initialized with random nonnegative values and were iteratively updated via ( 5) and ( 6).
• Baseline NMF -speech training: The algorithm based on NMF was described in Section 2. In this experiment, the spectral model for speech signal was learned by speech examples that were five-second long and were made by the same person with the speech in the tested mixtures.The spectral model for noise was initialized with random non-negative values and was iteratively updated via ( 5) and ( 6).
• NMF non-sparsity: The algorithm based on NMF was described in Section 2. The spectral model for speech was also learned by five-second long that was spoken by the same person with the speech in the tested mixtures.The noise spectral model was learned by one noisy file which was made by pairing five noise samples in the noise training set described in Section 4.1.
Separated speech results were evaluated using the source-todistortion ratio (SDR) measuring overall distortion as well as the source-to-interference ratio (SIR) and the source-toartifacts ratio (SAR).They were measured in dB and averaged over all sources where the higher is the better.These criteria, known as BSS-EVAL metrics, have been mostly used in the source separation community [30].

Results and Discussion
The results are averaged over all 15 testing mixtures for six different algorithms and indicated in Table 1.   as a function of the parameters λ and α is shown in Figure 3.
It is interesting to see in Table 1 that the results obtained by the "NMF non-sparsity" method even were lower than the results of "Baseline NMF -speech training" method.It reveals that the generic noise spectral model itself is redundant and contains some irrelevant spectral patterns with the actual noise in the mixture.Thus the importance of such sparsity penalty is explicitly confirmed by the fact that the results obtained by three algorithms based on the NMF with group sparsity-inducing penalties were far more better than the remaining three algorithms.It is also not surprising to see that the baseline NMF method yielded quite good results when using training data for speech signal (i.e."Baseline NMF -speech training" method gained 5.8 dB SDR), but without training data, the result is very low (i.e."Baseline NMF -without training" method gained -0.5 dB SDR).Finally, the "Proposed NMF -Mixed sparsity" algorithm offers the best speech enhancement performance in terms of all SDR, SIR, and SAR compared to the five existing ones.More specifically, compared to two algorithms based on the NMF with group sparsity-inducing penalties, the proposed NMF -Mixed sparsity method gained 0.4 dB and 0.3 dB SDR higher than those of the "NMF -Block sparsity" method and the "NMF -Component sparsity" method, respectively.The proposed method's results were also far better than results of three first methods.This proves the effectiveness of the successful combination of two state-of-the-art group sparsity-inducing penalties we have proposed.
Investigating the convergence of proposed method, Figure 2 shows that all measure SDR, SIR, and SAR increases with more number of MU iterations.This confirms that the derived algorithm converges correctly and saturates after about 20 MU iterations.
The average speech separation performance over all mixtures in the test set, as a function of λ and α, is shown in Figure 3.As can be seen, the proposed algorithm is less sensitive to the choice of α and more sensitive to the choice of λ.It is quite stable with the small value of λ, and the result is best with 1 ≤ λ ≤ 25 and 0 ≤ α ≤ 0.4.Overall the proposed algorithm is not very sensitive to the choice of such hyper-parameters and thus in the practical implementation one can set them quite easily.

Conclusions
In this paper, we have presented a speaker-dependent single-channel speech separation method based on the matrix factorization framework.Our method employed some different noise signal files to build the general spectral model for noise.For the estimation of the speech and noise signal from their mixture, we proposed the combination of NMF with two types of sparsity constraints.Experimental results showed the effectiveness of the proposed algorithm.Our further investigation showed the algorithm's convergence and its robustness to the choice of hyper-parameters λ and α.These properties are very useful for setting parameter in practical installation of the algorithm.Future work could be devoted to extend the work to multi-channel case where the spatial model, such as the one considered in [31], for audio sources is incorporated.Additionally, validating the effectiveness of the proposed denoising approach for automatic speech recognition (ASR) would be a particular interest.

pZ
is the time activation matrix.After all spectral model B p Z , p = 1, . . ., P , are learned from noise examples, the generic noise spectral model, denoted by B Z , is constructed as the following

Z
denotes k − th row of A Z .This penalty is motivated by the fact that only a part of the spectral model learned from an example may fit well with the targeted 3 EAI Endorsed Transactions on Context-aware Systems and Applications 07 2017 -03 2018 | Volume 4 | Issue 13 | e5

Figure 1 .
Figure 1.General workflow of the proposed speech enhancement approach.

Z
, and g (k) a uniform row vector of the same size as a (k) Z .

1 .
Figure 2 shows the convergence of the proposed algorithm as a function of the number of MU iterations.Performance of the algorithm Algorithms SDR (dB) SIR (dB) SAR (dB) sparsity [18] (λ = 1, α = 0) Proposed NMF -Mixed sparsity (λ = 1, α = 0.2) Average performance of speech enhancement obtained on the test set.

Figure 2 .
Figure 2. Speech enhancement performance of the proposed method as a function of MU iterations.

Figure 3 .
Figure 3. Average speech enhancement performance of the proposed method as a function of λ and α.

6
EAI Endorsed Transactions on Context-aware Systems and Applications 07 2017 -03 2018 | Volume 4 | Issue 13 | e5 However, some general noise examples could be collected easily from different noisy environments for training also.For example, in order to separate speech and environmental noise, we collect some environmental sounds such as wind sound, street noise, cafeteria, etc., for noise training.The global workflow of the proposed approach for speech separation is shown in Fig.
1.In the following, we first present the training for both speech spectral model B Y and generic spectral noise model BZ in Section 3.1.We then describe the model fitting with the proposed mixed group sparsity constraint for the source separation process in Section 3.23.1.Training Spectral Models for Speech and Noise(i) Speech spectral model Let V Y = |Y| .2 is the spectrogram of a clean speech example obtained by the STFT transform.Speech spectral model B Y is learned given V Y by optimizing the divegence between V Y and B Y * A Y as min