Removing traces of other phono-objects from a speech signal by decomposing them into microwave elements..
Zhenilo Valery Romanovich, Doctor of Technical Sciences
REMOVING TRACES OF OTHER PHONOOBJECTS FROM A SPEECH SIGNAL BY DECOMPOSING THEM INTO MICROWAVE ELEMENTS
Source: «Special Technology» magazine
Removing traces of other phono-objects from a speech signal that interfere with the correct perception of speech or reduce the quality of its sound is a rather urgent task. Unfortunately, many filtering methods, including adaptive ones, sometimes prove insufficiently effective for its solution. This issue is especially acute in forensic science, which requires preserving the authenticity of traces of the phono-objects under study at all stages of the study and, in particular, at the stage of cleaning the speech signal from noise and interference.
Currently, speech signal filtering systems available to forensic experts have one very negative property. All of them, one way or another, distort traces of phono-objects, based on which the expert subsequently makes a decision on a particular issue of the examination.
In this regard, special studies were conducted at the Expert-Criminalistic Center of the Ministry of Internal Affairs of Russia (I.N. Timofeev, T.I. Goloshchapova, I.V. Dokuchaev. Possibilities of using multichannel signal noise reduction tools in identification studies: Abstracts of the International Conference «Informatization of Law Enforcement Systems», Moscow, 1997. pp. 194 — 196.) to determine which speech signal filtering systems for improving speech intelligibility can be used to clear speech traces of noise, interference, and layers of traces of other background objects, and which cannot. It turned out that virtually all existing speech signal filtering systems shift (distort) those acoustic parameters of speech signals by which decisions are made about the identity of voice traces and the speaker's articulation, about diagnostics and identification of sound recording conditions, etc. This is because the original goal of any such automatic systems is to increase the intelligibility of speech perceived by ear, by any means possible, even to the point of distorting the speech signal itself, making it completely unlike a natural speech signal.
Undoubtedly, the task of increasing the intelligibility of speech signals is as difficult as any other in speech technology. Therefore, the developers of the named systems are forced to make certain sacrifices in the quality of the resulting purified speech signal. A live speech signal is a trace of an intelligent phono-object in real terrestrial conditions — with all the effects of wave reflection, reverberation, detonation of tape recorders, etc. It is very difficult to describe such a trace mathematically with a sufficient degree of reliability a priori. Therefore, developers of intelligibility enhancement systems neglect possible changes in a number of qualitatively important characteristics of the filtered speech signal for the sake of solving the main issue — what is said in conditions of interference or noise.
In practice, the following situation most often arises: in order to measure the parameters of speech signals for conducting identification studies, it is necessary to first clear the signals from interference and noise, but after this procedure, conducting identification studies on the cleared traces of speech signals often becomes simply unacceptable.
In order to somehow move the solution of this problem from a dead point, it is necessary, in the author's opinion, to include the expert's intelligence in the speech signal cleaning system itself, and not leave the automatic signal processing system the right to «decide» itself — what is important to leave in the speech signal and what can be deleted. This approach will be very unpopular at first, since the productivity of the «labor» of automatic computer systems in the computing part is incomparably higher than the productivity of human labor. However, it should be remembered that the productivity of their labor in the intellectual part differs just as much, but in the opposite direction.
All the first technologies of human-machine signal filtering systems will have low productivity. However, as they develop and are typified, they will undoubtedly accelerate due to the transfer of all obviously labor-intensive computing processes to the computer.
Below is one approach to creating such a technology. It is based on the stratification of all traces of background objects present in a given signal into two groups, followed by the restoration of two signals in the time domain so that each of them contains traces of only one group and, most importantly, remains authentic.
It should be noted that in phonoscopy this approach is possible because before the stage of converting an acoustic wave into an electrical one (in a microphone), all acoustic signals behave like ordinary waves with all the properties, advantages and disadvantages that follow from this. Therefore, if, for example, in fingerprinting, superimposing a new fingerprint covered in paint on an old one can completely cover individual elements of the latter, then in phonoscopy superimposing two acoustic waves (speech signals) leads to their interference. In phonoscopy, if traces of several phonoobjects are mixed, they “live” independently of each other, without destroying each other, but only interfering. Superimposing one trace on another does not destroy the latter, as happens when several fingerprints are superimposed.
In the work “Computer Phonoscopy” (1995) the author proposed a classification of the main types of phono-objects most often encountered in forensic practice. There are not many of them: a person, a tape recorder (or, more generally, a speech recorder), a harmonic, a voice, a series of identical pulses, and others. With a small number of classification elements, each of them has a complex specific system of mathematical description of their properties. Each of these types of phono-objects has its own technology for their analysis, processing, or purification. It is easiest to create such a technology for phono-objects of the harmonic type. However, since such in their pure form are relatively rare in practice, the development of corresponding technologies is less relevant than, for example, the development of technologies for a phono-object of the “person” type. Therefore, it is advisable to start with the implementation of a technology for analyzing and processing traces of phono-objects of the “other” type, primarily according to the criterion “the ratio of development costs to practical relevance”. However, if the number of developers of phonoscopic systems in the scientific departments of the Ministry of Internal Affairs system grows, then in the near future we can expect a solution to a similar issue in relation to other types of phonoobjects.
Developing a technology for separating traces of different background objects related to the “other” category is also more useful because this technology should be extremely universal, due to which, most likely, in some cases it will be able to help separate traces of the above-mentioned types of background objects. Of course, with its universality it may become less technologically effective, but, we repeat, this will allow us to move the solution of a fundamentally difficult issue for forensics from dead center.
The category of the phono-object “other” assumes that the real properties of this phono-object are unknown. Consequently, there is no mathematical description of its properties, and therefore, perhaps, the only currently acceptable mathematical apparatus for representing and analyzing traces of these phono-objects is the classical spectral description based on the Fourier transform.
Since the basic functions of the Fourier transform are harmonics, it is clear that the proposed technology will be most adequate for those signals whose main information elements are harmonic components. In fact, all tonal sections of speech signals are such.
Representation of the traces of a background object using traditional sonograms allows an expert to visually distinguish individual frequency-time components of even several mixed signals — traces of different background objects. Let us recall that it is practically impossible to do the same in the time domain using the original oscillogram.
For convenience of further description, we will introduce several auxiliary definitions.
The construction of sonograms has many degrees of freedom. Since a sonogram is actually a sequential series of amplitude spectra following each other with a constant time step, we will speak of it as a film consisting of many frames. Each such film is described by two main parameters: frame rate and resolution of each frame. But if in movies the connection between these parameters is usually not specified, then in phonoscopy it is. The essence of such a connection is as follows. Let us assume that the spectra for constructing a sonogram are calculated using a Gaussian time window, providing a frequency resolution equal to sf Hz. In this case, constructing a film with a frequency greater than 2psf frames per second does not make sense due to the occurrence of information redundancy of adjacent spectra or, as we will also call them, frames of the sonogram film (further, for brevity, the sonogram film will simply be called a sonofilm).
If we stick to the optimal ratio of frame rate and their resolution, it turns out that the sonofilm task has only one degree of freedom. This can be either the frame rate or the resolution of an individual frame. Let's assume that the frame rate was chosen as the degree of freedom for constructing a sonofilm. What should it be?
The choice of the sonofilm frame rate depends on the properties of the visualized signal or, in other words, the properties of the studied background object. If these are bats or dolphins, the sonofilm frequency should be relatively high. If these are periodic sounds of sea waves or other slowly flowing processes (electrocardiogram signals, breathing, etc.), the sonofilm frequency will be relatively low. For different mechanisms emitting signals that have a periodic form, the frame rate of sonofilms that most clearly reflect their properties will be different.
What the frequency of the sonofilm of a human speech signal should be has not yet been precisely established. It may well be close to the minimum frequency of a regular movie, at which frame flickering is not yet visible, but if the movie is viewed at a lower frequency, the flickering becomes noticeable. For the results presented below, the author empirically selected a sonofilm construction frequency of 150 frames per second. For the sake of objectivity, it should be noted that in order to resolve this issue, special studies of the optimal sonofilm construction frequency must be conducted. The optimality criterion will depend on the problem being solved. It may be, for example, the degree of difference between the original speech signal and the synthesized one based on all traces of the same speech signal reflected on the sonofilm.
When speaking about the reflection of a trace of a phono-object with unknown properties on a sonofilm, it should be taken into account that its nature can be completely arbitrary — a pulse, a series of individual harmonic components, a series of pulses that transform (or not) into a voice, etc. In all these cases, their mathematical description will be different. The choice of the basic description is actually determined by the type of transformation of the original signal from the oscillographic form to the sonographic one. Therefore, if a sonofilm is constructed using the Fourier transform and the Gaussian time window, then the basic elements into which the traces of phono-objects will be decomposed must belong to the same category. In this case, the reflection of the simplest element of the trace of a phono-object on a sonofilm will be presented in the following form:
, (1)
where
sf is the resolution of the spectra by frequency;
t0 – position of the Gaussian window in time (in fact, this is the time of the current frame of the sonofilm);
w0 – frequency harmonic component of the trace of the background object in the current frame;
j0 – initial phase of the harmonic component of the trace of the background object in the current frame;
a0 – rate of change of amplitude of the harmonic component of the trace of the background object in the current frame;
d0 – rate of change of frequency of the harmonic component of the trace of the background object in the current frame;
A0 – amplitude of the harmonic component of the trace of the background object in the current frame.
The decomposition of the original speech signal into a series of microwave elements (1) in each frame of the sonofilm is actually carried out when calculating any traditional amplitude sonogram.
The elementary harmonic component of the signal, presented by formula (1), in some properties resembles the concept of “wavelet”, which is currently popular in speech information technologies. For example, like a wavelet, the function e(t) very quickly tends to zero when t deviates from t0 by more than several sf.
In the current frame, each elementary harmonic component of the trace of a background object of a priori unknown type can have five degrees of freedom: A0, w0, j0, a0 and d0. If this component is relatively powerful enough, then all its parameters can be calculated with a sufficient degree of accuracy from one instantaneous frame of the complex spectrum.
The last statement significantly distinguishes the technology of working with a background object of unknown nature (background objects of a “different” group) from background objects of a certain group. If we are dealing with background objects of a certain group, then the dynamics of its trace can be modeled and predicted over a relatively large period of time. In these cases, in individual frames of the sonofilm, the trace of the background object of interest to us may completely disappear in the traces of more powerful background objects. However, due to the interference of traces of different background objects, their stratification is theoretically possible.
When working with background objects of the “other” group, one cannot hope that the expert will be able to clearly distinguish traces of the desired or unnecessary background object in the interference pattern. Therefore, traces can be correctly separated only if the trace of one of the background objects is significantly more powerful than the trace of the other. In this case, when assessing the parameters of the microwave elements (1) of the trace of the most powerful background object in the sonofilm frame, interference phenomena can be neglected.
The following experiments were conducted to test the technological validity of the hypotheses expressed.
Experiment 1. Testing the quality of speech signal decomposition into microwave elements (1) in each frame of the sonofilm. In Fig. 1, the top part shows the dynamics of the experimental speech signal power level, and the bottom part shows the corresponding sonofilm.
Fig. 1. Power level and sonofilm of the original speech signal.
The experimental conditions suggest that all traces in all frames of the resulting sonofilm should be included in further analysis. Therefore, all traces on the sonofilm are decomposed into microwave elements (1). Then, a new speech signal was created from these microwave elements. The result of this assembly is shown in Fig. 2. Analysis of the resulting synthetic speech signal showed the following.
The synthetic signal is practically indistinguishable from the original by ear. It does not contain any overtones, “metal” or other unnatural synthetics.
Fig. 2. Power level and sonofilm of the speech signal synthesized from all traces reflected in the sonofilm in Fig. 1
From the dynamics of the power level and the sonofilm of the synthesized signal it is clear (Fig. 2) that the synthetic signal differs from the original speech signal in the noise pattern accompanying the speech signal, which is clearly visible in speech pauses.
The greatest differences are observed in the region of zero frequency and Nyquist frequency. This is explained by the fact that it is impossible to estimate the microwave components of the original signal with sufficient accuracy in the specified regions. Therefore, they are lost for analysis and synthesis. But in reality, this is a small loss, since, for example, in this case, the frequency components from zero to 25 Hz were lost. These losses are not perceived by the ear. Losses in the same narrow frequency band of 25 Hz, but near the Nyquist frequency, also cannot be called such in the strict sense of the word, since in the process of analog-to-digital conversion (ADC) of the phono-object traces, all their frequency components lying near the Nyquist frequency are always removed (for example, by the input filters of the ADC).
Comparison of the oscillographic form of the description of the original signal with the newly obtained one from microwave elements (Fig. 3) also shows their good agreement. The greatest difference is noticeable only in the unvoiced section of the speech signal in the leftmost part of the oscillogram, and in the tonal sections of the speech signal the original and synthetic signals are almost indistinguishable from each other. And this is despite the high degree of natural non-stationarity of the speech signal in the shown section.
Fig. 3. Fragments of the oscillograms of the signals shown in Fig. 1 and 2 (at the top is the original signal, at the bottom is the artificial one, made up of individual microwave elements)
Experiment 2.Checking the correctness of the calculation and subtraction of powerful harmonic interference. To conduct this experiment, a powerful frequency-modulated interference was mixed into the original signal. A sonofilm of the resulting signal is shown in Fig. 4. The interference power level was 5 dB weaker than the most powerful section of the original speech signal. The speech signal was quite intelligible by ear despite the presence of a strong interfering interference.
Fig. 4. Sonofilm of the original speech signal with a powerful frequency-modulated interference mixed into it
During the experiment, the expert noted a small section of interference traces on the sonofilm, which were automatically decomposed into microwave elements (1) and subtracted from the signal being studied.
The result of this experiment is shown in Fig. 5, where it is clearly seen that the algorithm for calculating microwave elements and removing them worked extremely correctly. This was reflected in the purity of the sonofilm of the new processed signal. In the place where the interference was located, its traces are practically invisible.
Fig. 5. Sonofilm of the result of Anna's removal of interference traces from individual microwave elements, reflected on the sonofilm in Fig. 4
It is important to emphasize that in this experiment the speech signal itself was not subjected to decomposition of all traces into microwave elements and subsequent assembly of a new signal. The speech signal remained completely authentic (equal to the original). Only traces of interference were decomposed into microwave elements, which were subtracted from the experimental signal.
The results of this experiment illustrate well the difference between trace cleaning technologies in phonoscopy and fingerprinting. In phonoscopy, by removing microwave elements of powerful interference, it is possible to see (and therefore hear) traces of the useful signal in its original (authentic) form underneath them. This is fundamentally possible due to the additivity of acoustic signals. Therefore, it is quite possible to extract a speech signal from under a series of powerful harmonic interferences that completely drown it out. To test this statement, the following experiment was conducted.
Experiment 3. Extracting a speech signal from a mixture of frequency-modulated interferences that exceed the most powerful sections of the speech signal by more than 20 dB.
Powerful frequency-modulated interference was mixed into the pure speech signal, due to which the speech traces became completely inaudible. Fig. 6 shows the power level of the original signal (upper part of the figure) and the corresponding fragment of the sonofilm (lower part of the figure). The expert identified traces of interference in a small section of the experimental signal. These traces were automatically decomposed into microwave elements and subtracted from the signal under study. As a result, a new signal was obtained, the power level and sonofilm of which are shown in Fig. 7.
Fig. 6. Sonofilm of the original speech signal with powerful frequency-modulated interference mixed in, due to which the speech traces are not heard at all
Fig. 7. Sonofilm of the result of data removal from individual microwave elements of all traces of interference reflected on the sonofilm in Fig. 6
Let us note the following important results of the experiment. First, the sound of the restored fragment of the speech signal after its amplification has a very satisfactory quality. Second, in the processed section of the signal the noise level decreased by about 60 dB. The contour of the power level of the extracted signal (in Fig. 7 above) differs little from the original (in Fig. 1 above). In order to see all the subtleties of the new signal, it was necessary to raise the rendering level of its sonofilm by more than 20 dB. Third, comparison of the sonofilms in Fig. 6 and 7 shows that completely masked traces of speech became visible in the processed signal. At the same time, however, some remnants of traces of noise are visible.
It should be noted that such results can be achieved only in cases where traces of the speech signal are actually present in the signal being studied, and not lost as a result of clipping, strong nonlinear distortions, or when the useful signal is so weaker than the interference that it cannot be recorded by the equipment in principle.
The results of the experiments indicate the prospects of the proposed technology for decomposing signals into microwave elements. However, in practice, a number of difficulties may arise, among which room reverberation is one of the most significant. Another experiment was conducted to identify them.
Experiment 4.Extraction of speech signal from a set of frequency-modulated interferences that make speech completely unintelligible when recording them in a room with unknown reverberation properties.
Figure 8 shows the dynamics of the power level and a sonofilm of the experimental sound recording.
Fig. 8. Sonofilm of the original speech signal with powerful frequency-modulated interference mixed into it in a room with unknown reverberation properties
Listening to this experimental phonogram, one can only recognize a male voice, but not understand a word due to the complete masking of the speech signal. The level of interference exceeded the most powerful sections of the speech signal by 6-10 dB.
After the expert had noted the interference traces on the sonofilm, they were decomposed into microwave elements and subtracted from the original experimental signal. The result of this processing is shown in Fig. 9.
Fig. 9. Sonofilm of the result of removing data from individual microwave elements of interference traces reflected on the sonofilm in Fig. 8.
The speech signal cleared in this way became completely intelligible, but some interference remained. However, the level of interference did not decrease as significantly as in experiment 3. It became only 20 dB lower than the most powerful sections of the speech signal (i.e., the interference decreased by 26 – 30 dB, and not by 60 dB, as it was in experiment 3).
In the lower part of Fig. 9 it is evident that the interference traces are considerably weakened, but have not disappeared completely. Only those traces that remained at their constant frequencies in adjacent frames have completely disappeared. If the trace moved in frequency from frame to frame at a high speed, it weakened the least. This is explained by the fact that in reality in the latter case in one frame there are two traces of harmonic interference in close proximity at the same time. The first trace is of the main wave, and the second is of the wave reflected from the walls of the room. Since the reflected wave arrives slightly later than the direct wave, its frequency will always lag behind the frequency of the main wave by an amount depending on the frequency modulation rate of the interference signal. If you carefully examine the trace of the interference signals in Fig. 8, you will notice that in addition to the main frequency-modulated interference, weakened traces of its reflected wave, shifted to the right, are visible next to it.
Since the amplitude of the reflected waves in this experiment is significantly lower than that of the direct waves, it can be stated that in reality, after one stage of signal processing, only powerful direct waves were removed from the signal, while all reflected waves remained unchanged. The only exceptions were those sections of interference where they practically did not change in frequency. In this case, both the direct and reflected waves were completely removed simultaneously.
You can re-mark the traces of the remaining reflected waves, decompose them into microwave elements and subtract them from the signal under study. But in this case, the limitations on the correctness of all such operations begin to affect, since the level of the interference signal becomes too close to the level of the speech signal itself. In this case, you can mistakenly remove traces of the speech signal itself, as a result of which empty light stripes will appear on the sonofilm. But, it should be emphasized once again that in all other sections of all frames of the sonofilm, traces of the speech signal under study are guaranteed to remain authentic, which is very important for forensic phonoscopic studies and examinations.
The proposed technology can be used to solve a wide range of problems, from the restoration of archival phonograms while preserving their authenticity to recording speech muffled by powerful interference.
Добавить комментарий