A NOVEL REFERENCE-FREE OBJECTIVE SPEECH QUALITY PERCEPTION MEASUREMENT USING MULTI-INSTANCES FEATURES OF DEGRADED SPEECH SIGNAL
1Rajesh Kumar Dubey, Arun Kumar
In modern telecommunication networks, it is an important requirement to measure the quality of speech objectively and continuously at different nodes of the network. Reference-free (non-intrusive) speech quality estimation algorithms measure the quality of speech signals without using the original clean speech signal as a reference. In this work, reference-free speech quality assessment is done for telephony band speech signal using multi-instance features which are probabilistically modelled using Gaussian Mixture Model (GMM). The use of single-instance features, as in existing algorithms, is not accurate in capturing the time localized information of short-time transient distortions and their distinction from plosive sounds of a speech signal. Hence, the importance of estimating features at multi-instances that are relevant for objective speech quality measurements. The silence segments are removed from the speech signal and only active speech segments are considered for features computation using frame by Lyon’s auditory model. The features thus computed are combined by taking mean, variance, skewness and kurtosis over the frames to obtain the features of the active speech segment. A principal component analysis is done to reduce the dimensionality of features. In a similar manner, mel-frequency cepstral coefficients (MFCC) and line spectral frequencies (LSF) are also computed on a per-frame basis and combined by taking mean over the frames to obtain the features. Then, the active speech segments are combined across the segments across an increasing number of active segments till all the segments of complete speech utterance are accounted for. The features of the combination of active speech segments are computed in a similar manner to obtain the resultant features of the combination of active segments. For training of the algorithm, the subjective Mean Opinion Score (MOS) of the speech signal that is available from a suitably large and varied training database is taken as the subjective Mean Opinion Score (MOS) for each active speech segment or the combination of active speech segments. These features along with the subjective MOS are used for the training of a joint GMM probability density function and then used to measure the objective MOS of each active speech segment or the combination of active speech segments. The overall objective MOS of the speech utterance is obtained by taking average of the objective MOS of the segments. A results in terms of correlation coefficient of subjective MOS and objective MOS and their comparison with the ITU-T Recommendation P.563 has been presented here.
Speech quality, Degraded signal, Gaussian mixture model (GMM), Mel-frequency cepstral coefficients (MFCC), Line spectral frequencies (LSF).