Noise reduction of phonograms.
Noise reduction of phonograms
In some areas of human activity, accurate transcription of speech recordings is necessary. This article will discuss the problems that arise when deciphering phonograms and how to solve them.
Sound recording of speech is actively used as one of the means of ensuring security in various areas of human activity. In particular, oral orders and negotiations of aircraft and large ship crews, personnel of complex technical facilities are recorded on magnetic tape, telephone conversations of various medical, emergency and law enforcement services, and some commercial organizations are recorded. In some cases, it is necessary to record speech and other acoustic signals in industrial premises. Overt and covert means of sound recording are used to record operational information during negotiations, meetings, lectures, conferences, as well as during search, control and other actions to ensure security and safety.
In some cases, the low quality of the resulting sound recording (phonogram) creates certain difficulties for deciphering the necessary information. The reasons for this usually lie in both the unsuccessful or inept choice and use of means of transmitting and recording acoustic information, and in the objective difficulty and even impossibility of obtaining a high-quality, «clean» sound recording in some specific circumstances.
In this article we will not dwell on the issues of choosing microphones and other sources of acoustic signal pickup, their placement, organizing communication channels with recording equipment, or the optimal choice of sound recording equipment. This certainly important topic should be devoted to a separate publication. I would only like to note that in most cases it is cheaper, easier and more useful to invest money and time in the technical and organizational support of the process of good quality sound recording, than to try to «squeeze» the necessary information from low-quality recordings. For example, we believe that miniature voice recorders such as the «Pearicorder» of the Japanese company Olimpus are of little use for obtaining prompt sound recording in serious situations, since they do not have the necessary performance characteristics for recording quiet and noisy speech.
Sound recording processing tasks
When processing a sound recording, a technical specialist faces the following tasks:
— to establish and print on paper the text content of the recorded speech, that is, to obtain a «text»;
— to prepare the necessary demonstration audio material from the original sound recordings;
— to carry out the maximum possible cleaning of speech with the removal of interference and distortion; — to conduct an examination of the sound recordings.
When conducting an examination, it is usually necessary to establish the presence of traces of deliberate editing or copying of the sound recording; determine the type or identify a specific example of the sound recording device; establish the circumstances of the sound recording, the method, the surrounding environment, the placement of the sound recording equipment, the type of source of audible noise, etc.; establish the identity or difference of the voice on this (disputed) sound recording with the voice, a sample of which is presented on another (comparative) phonogram. In some cases, it is necessary to determine the distinctive features of the speaker from the sound recording of oral speech (for example, gender, age, place of birth, profession, presence of diseases, etc.).
When working with a high-quality sound recording and a small amount of recorded material, the first two tasks can be solved by almost any user of an ordinary tape recorder. With a large volume of material, only establishing the exact textual content of the sound recording may require an unacceptably large amount of time.
But any work with low-quality audio recordings requires a technically competent, trained specialist and various technical means.
In turn, to solve any expert task, a qualified specialist is required, who has special techniques, and who must have the necessary additional equipment at his disposal.
Texting of recorded speech
Despite its apparent simplicity, this task has its pitfalls. To obtain the text, you need to listen to the audio recording and simultaneously record the text on paper. Since it is usually impossible to write or type the text at the same pace as the speech, you often have to stop the recording and rewind the tape to listen again. If the creation of texts is carried out episodically, then all you need is a regular tape recorder and a pen with a sheet of paper or a typewriter. But if such work is performed frequently and in large volumes, requires urgency or is generally a daily routine, then it is appropriate to use special equipment.
It is clear that the best option would be to use a purely automatic device capable of recognizing speech and printing text on its own. The authors of this article have long been supporters and developers of such devices. However, there is currently no such equipment on sale that has acceptable characteristics (at best, it requires lengthy tuning to a specific speaker), and it is probably not worth counting on automatic recognition of arbitrary noisy speech in this century.
What is a transcriber tape recorder?
For a long time now, a special transcriber tape recorder has been used to obtain transcripts of audio recordings. It is controlled by a foot pedal with an electronic switch «Stop»/»Play»/»0tck» (Philips, Grundig, Sanyo). Working with a transcriber tape recorder is more convenient and productive than with conventional sound reproduction equipment. However, it is necessary to point out the serious shortcomings of transcribers. Among them are rather slow mechanical rewinding, short service life (constant mechanical switching soon leads to failures), tape stretching (this becomes unacceptable if the given cassette is the only material evidence), low sound quality, requiring multiple listening to key fragments. Verification, i.e. comparing the soundtrack and text after printing, takes a lot of time (in important cases this is a mandatory procedure). In addition, when printing the text of recordings in real time (for example, when negotiations are still ongoing, but the text of statements from their initial stage is already required), a serious problem arises with repeated changes of tapes in tape recorders and their urgent delivery to the typist. Finally, to clean the sound recording from noise and interference, it is necessary to connect additional devices.
Possibilities of a computer transcriber
Modern computer transcribers (for example, «Caesar-16») are free from the above-mentioned shortcomings. Usually, a computer transcriber is a board inserted into a personal computer, or a small unit connected to the printer port of the PC.
Before work, a voice recorder is connected to the input of the transcriber, and headphones to one of the outputs. The operator inserts a cassette with the dictated text into the connected voice recorder, turns on the transcriber program and types the text of the audio recording, listening to the necessary fragments in a convenient mode. At the same time, the audio recording from the cassette is simultaneously entered into the computer memory.
The transcriber program provides an optimal combination of the properties of a modern text editor (for fast text typing) and the functions of a digital tape recorder (this ensures the quality of sound listening). In addition, only digital transcribers have access to some completely new features. Thus, with the help of «hot» keys, control is performed («Stop», «Rollback», instant «Rewind, repeat playback»), markers are set in the text (to mark a fragment to which you should return for more careful listening). During work, the screen displays the playback and recording time counters of sound in the PC. The entire system of menus and «hot» keys allows you to organize your work in the most convenient way.
Since during operation the audio recording from the cassette is converted into digital form and continuously copied to the PC hard drive, then for optimal sound quality (frequency range up to 4.5 kHz, dynamic range over 70 dB) on the PC hard drive 36 MB of free space is required for every half hour of speech. Practice shows that the use of any methods of compression of the speech information flow for the transcriber is impractical, since this inevitably worsens the sound quality and tires the operator.
The software «highlight» of the product is the provision of a duplex mode, i.e. a mode of continuous recording of sound to the PC hard drive while simultaneously playing back sound from the same disk, but from any arbitrarily selected place of this or any other previously entered digital sound recording. The duplex mode allows you to immediately begin processing the sound recording, without waiting for the PC to complete the input.
The advantage of a computer transcriber is that the tape is rewritten only once, and important recordings, like the tape recorder itself, do not wear out from numerous switching on, off and rewinding. At the same time, poorly legible parts of the recording can be correctly printed due to repeated listening in a «ring» with slow playback of the recording without distortion of the sound timbre.
In addition, the latest models of digital transcribers have another extremely useful feature — they automatically link the typed text with the corresponding audio sequence. This provides a unique opportunity to quickly compare text printouts and original audio recordings, which is absolutely necessary when documenting important materials. All controversial issues can be easily resolved, since the corresponding audio fragment is automatically called up for any text fragment, and the text is immediately found for the selected section of the audio recording.
The use of digital transcribers allows not only to reduce the time of document preparation and reduce the likelihood of errors, but also to increase the overall efficiency and comfort of professional activity. They have undergone thorough practical testing and have proven their effectiveness for operational office work in the apparatus of law enforcement agencies and the prosecutor's office, in banks, hospitals and other organizations.
Transcribers with noise cancellation
Additional advantages for the user are provided by computer transcribers with built-in hardware noise reduction. For example, in the product «Caesar16SH» special signal processors are located directly on the speech input/output board, which allow you to automatically remove interference. Control of such noise reduction is quite simple and accessible to an ordinary typist. Despite the fact that such transcribers do not have all the capabilities of specialized automated workplaces for noise reduction, they are quite sufficient and extremely useful in many situations. It can be noted that in their «noise reduction» qualities, these transcribers are superior to such domestic products as AF-512, KORS, PAKORS, Zolushka-Mono, foreign DAC-256.
Decoding low-quality recordings of oral speech
The procedure of improving the quality of the speech signal of the sound recording pursues two goals. Firstly, to help the operator to establish the exact textual content of the original sound recording. Secondly, to remove noise and distortion so that untrained listeners (for example, detectives and security guards, managers, judges, public audience) could satisfactorily understand the content of the sound recording. Different methods and means are used to solve these problems.
When working with recorded speech, the operator tries to extract the maximum amount of useful information, he or she strains their attention and is ready to tune in to unusual or even unpleasant sound for a relatively long time. The operator can tolerate strong noise loads, repeatedly listening to the same material in different modes. Therefore, noise reduction here is not aimed at removing noise as such, but at helping the operator, at maximizing the use of his or her hearing characteristics, that is, at “unmasking” the useful signal in noise. For example, if the noise is noticeable, but does not interfere with understanding speech, then it is not always advisable to remove it, since this may damage the useful signal. At the same time, for operators who constantly listen to noisy recordings for a long time, noise reduction is desirable, since the higher the quality of the recording, the less tired the operator is and the longer he or she can work effectively.
But from the point of view of the amateur customer, who uses the soundtrack differently than the operator, the “cleaned” sound recording should contain a minimum of noise in pauses in speech and a sound timbre as close to natural as possible, sometimes even to the detriment of intelligibility.
How to use the properties of hearing
To work successfully with sound recordings, one should use the advantages and take into account the disadvantages of human hearing. Therefore, we will note its specific properties: sensitivity and selectivity, masking, adaptation, fatigue, individuality, associativity of the organization of speech memory. Now we can consider how each property is taken into account when processing low-quality sound recordings.
For example, when optimizing and equalizing the volume of all lines to the same level, it is necessary to take into account that the ability of hearing to detect signals in the surrounding background and distinguish their small changes nonlinearly depends on the volume and frequency of the signal. Thus, for quiet speech, the level of detectability of a signal at a frequency of 100 Hz can be 40 dB higher than the level of detectability for 2000 Hz, and for loud sounding, the frequency is practically insignificant. Therefore, two typical listening options are recommended, which allow you to hear the maximum possible in most situations. The first option is 80-90 (±6) dB when equalizing the average amplitude of all spectral components of the recorded signal. The second is 40 dB when raising the equalized average amplitude spectrum by 25 dB for 100 Hz and by 8 dB for 5000 Hz in relation to 2000 Hz.
Since speech self-masking exists (for example, after loud sounds, quiet sounds are not perceived for some time), it should be remembered that the operator gets tired less when listening to a recording at a consistently low volume level. Auditory masking of some components of the sound signal by others leads to the fact that weaker frequency components are not heard next to strong spectral peaks. Therefore, to improve intelligibility, one should strive to remove noise pulses from the signal, equalize the amplitude and smooth the spectrum. Moreover, sometimes it is important to do this even in areas of individual words, since under certain conditions individual sounds in a word can mask their «neighbors».
It is also necessary to take into account that targeted «listening» increases the ability to distinguish a distorted speech signal from the background. At the same time, experienced operators-«spinners» of phonograms develop ready-made auditory stereotypes for effectively recognizing useful signals in interference of a familiar type. This is due to the adaptation of hearing, which leads to short-term (fractions of a second) and long-term (minutes) adjustment of the parameters of the auditory system to the characteristics of the signal being listened to. However, it should be taken into account that short-term adaptation of hearing sometimes leads to an undesirable shift in perception thresholds at the junctions of difficult-to-understand words. Therefore, it is useful to listen to such phrases in parts, gradually changing the boundary of the signal «cut» in fragments of speech.
The average duration of a section of a noisy phonogram, deciphered by the operator, is given in the following table:
Complexity | for 1 hour | for a full time. day |
low | 4-10 min | 25-60 min |
average | 30-50 sec | 3-6 min |
high | 5-10 sec | 20-50 sec |
It should not be forgotten that after listening to loud sounds for a long time, the ability of hearing to distinguish small sound changes decreases, and it is restored rather slowly (from 2 minutes to a day). Therefore, the operator should avoid auditory overload and work with “heavy” noisy speech for 20-40 minutes with the same breaks, but not more than 4-6 hours a day.
Since the human auditory system is designed to process a stereo signal, a certain gain in recognizing complex recordings is provided by some «diversification» of the signal properties, fed to each ear separately for listening. This allows for more specialized processing of each sound signal. This is possible with the «pseudo-stereo» mode, which introduces some change in the spectral or time domain into one sound mono signal in relation to another.
Of no small importance is the ability of human memory to automatically associate heard sound combinations with known words. If the signal is unintelligible, unconsciously going through all possible candidate words can take an unacceptably long time. Therefore, the operator must know the content of the text and have additional information to narrow the range of probable words for better recognition of the recording.
In general, the sensitivity and ability of hearing to recognize speech in noise (especially at high frequencies) is individual and, as a rule, decreases with age. Therefore, elderly experienced operators should be able to equalize the frequency characteristics of the listening path. But for deciphering a noisy phonogram, operators under 35 years of age with high hearing sensitivity and a flexible psyche are preferable,
How to decipher an incomprehensible sound recording?
We suggest considering two options for working with an incomprehensible phonogram. The first is a solution to a relatively simple problem of text transcription and noise reduction for a signal of low/medium-low complexity (initial intelligibility of 70-80% of words). The second is noise reduction of a “heavy” sound recording with zero initial intelligibility.
Noise reduction in relatively simple situations
Performed by an operator with technical education and experience working with audio equipment. To perform the work, the following equipment is required: a high-quality tape recorder, headphones, a power amplifier with speakers (top class), a personal computer for preparing text transcripts and a computer transcriber (preferably with a built-in adaptive filter), a graphic equalizer (with at least 16 bands in the range of 100-6000 Hz with a dynamic adjustment range of at least 30 dB, a signal-to-noise ratio in the end-to-end channel of at least 70 dB and an appropriate analyzer of the current spectrum), an adaptive filter with a number of coefficients for a signal band of up to 4000 Hz of at least 500. It is optimal to use a computer workstation based on a PC with a computing accelerator board.
A method for solving the problem of noise reduction in relatively simple situations
Determine the average spectrum of the audio recording signal using a spectrum analyzer (for the entire processed tape or for its initial section of 1-2 minutes).
Cut the signal spectrum from the top and bottom (usually to the range of 200-3900 Hz). Align the spectrum within this range using an equalizer to the most flat possible appearance, then select by ear the optimal sounding position of the equalizer sliders on the initial section of the recording.
Set the optimal volume level using an amplifier.
Send a signal to the adaptive filter, select the best adjustment position by ear (adaptation speed, delay, adaptation start threshold).
Feed the signal from the tape recorder through the installed chain (amplifier, equalizer, adaptive filter) to the input of the computer transcriber (it is convenient to use the adaptive filter located on the transcriber board). Input the sound into the PC, placing it on the hard drive using a quantization frequency of 10,000 Hz and 16-bit digital representation of the signal without any compression.
Print the text of the audio recording in the transcriber program. Hardly intelligible sections of speech can be listened to repeatedly in the «ring», adjusting the volume of the pronunciation and the speed of sound reproduction. The ring should be «sliding» along the signal with an adjustable duration within at least 0.5-15 sec.
At the end of the work, the text is compared with the soundtrack and the final recording of the cleaned sound on the tape recorder is carried out.
Noise reduction in difficult situations
Performed by a noise cleaning engineer-operator with a technical education, experience working with audio equipment and knowledge of the basics of spectral analysis, signal filtering and hearing physiology. High individual hearing sensitivity is also desirable.
The most effective work is in pairs: engineer + operator-«listener» (no older than 35 years old, with a stable psyche and sufficient experience). Note that with obtaining transcripts of sound recordings as a permanent duty, as a rule, women cope better than men.
To perform the work, the following equipment is required: a high-quality tape recorder with the ability to adjust to the characteristics of the tape, high-quality headphones (2-3 pairs), a power amplifier with speakers (top class), a noise-cleaning workstation based on a PC (not lower than 486DX2-66, VLB/PCI, 8 MB RAM, 540 MB HDD) with an additional computing accelerator for processing signals in real time, an autonomous external adaptive filter for the number of coefficients in a band of up to 4000 Hz of at least 2000, providing the ability to set the speed, algorithm, adaptation modes, delay, listening modes (pseudo-stereo, prediction, filtering). The software of the automated workplace should include a powerful signal editor, programs for converting the signal in time, removing impulsive interference, a real-time software equalizer with at least 500 bands, a program for removing broadband interference, a computer transcriber with an adaptive filter, programs for energy, spectral and autocorrelation analysis of the signal. Of course, in some cases additional hardware and software may be useful, but for solving most practical problems, the above are quite sufficient.
Method of working with a complex phonogram (for example, with zero initial intelligibility)
A complex sound recording may require the operator to use a different approach to processing individual lines. This type of work is conveniently done using computer tools. Therefore, the entire sound recording is first entered onto the PC hard drive and stored there as a data file, which is then processed by various programs (when entering a soundtrack into a PC, the bit depth of the digital representation is usually 16 bits, the sampling frequency is 10,000 or 16,000 Hz).
The text is then printed in the transcriber program. Since one or another technique is used to achieve intelligibility, it is convenient to switch between different signal processing programs in the Windows environment in multitasking mode.
Further actions can be performed on the entire sound recording, but if the sound characteristics vary greatly from section to section, then the sound processing adjustment is selected separately for each homogeneous section.
It is necessary to optimize the volume of the sound. Manual normalization and amplitude nonlinear transformations are used to equalize the volume of individual replicas of the phonogram. The signal is listened to at different volumes with a consistently changing frequency response; for this purpose, typical filters for bringing the sound spectrum on a given sound recording to the required form are “prepared” in the computer equalizer, which are then used during listening. A quick transition from one type of spectral signal transformation to another occurs in real time due to the use of an additional computing accelerator as part of the noise cleaning workstation.
To bring the signal to an optimal form in the time and spectral representation and remove additive interference, «unmask» the useful signal. In this case, the signal is processed directly in the time representation, impulsive interference (clicks) and large differences in signal amplitude (steps) are eliminated. In addition, the spectral characteristics of the signal are processed using a spectrum analyzer and a multiband equalizer: — the frequency band is limited from above and below (usually up to 200-3900 Hz or 100-6300 Hz); — the spectrum is smoothed (spectral peaks and sharp differences of more than 6 dB within each half-octave are removed); — the dynamic range of the average spectrum is equalized (no more than 10 dB within an octave), distortions of the frequency response of the signal in the recording channel are compensated.
Also very effective is the automatic inverse filtering mode, which is installed within the IKAR complex or on the ultra-modern and extremely expensive audio processor from Digital Audio Corp.
Adaptive filtering can eliminate narrow-band stationary and slowly changing interference (network and telephone interference, beeps, traffic interference, mechanical noise, smooth music). The number of adaptive filter coefficients is 400-4000 depending on the type of interference. The use of an external autonomous filter is also quite justified.
Broadband stationary and slowly changing interference is eliminated using frequency subtraction. Lastly, noise in pauses is eliminated or reduced.
Naturally, in each specific situation only the necessary operations are performed with the signal, and the user is forced to rely on the technical means that are actually available. Among them, we can note the expert's workstation «IKAR» and the only autonomous device on the market so far — the new generation digital adaptive filter «Cinderella-31» (with a number of filter coefficients up to 5600, as well as a set of diverse, truly necessary and useful services).
It should also be noted that systems designed to clean musical audio recordings and prepare master discs and cassettes (for example, «NoNoise» by Sonic Solutions) are expensive and ineffective for noise cleaning of operational sound recordings, since they allow «good» recordings to be made «very good», but are fundamentally not designed to handle strong interference and distortion.
Stereo noise cleaning
It is necessary to mention an alternative approach to obtaining cleared phonograms, which is based on the use of sound recordings from two or more sound sources. For example, this can be a stereo recording, in which a signal mixed with interference is recorded in one mono channel, and the same interference in another mono channel. Or it can be a stereo recording of a signal from two microphones, which are differently located relative to the sources of the useful signal and interference.
Thus, with the help of some systems (for example, «Cinderella-S» and «Cinderella31»), it is possible to automatically construct an optimal filter that eliminates interference from the sound recording (two signals are fed to the inputs: the useful signal mixed with interference to the main input and the interference signal to the second, reference input). For example, a quiet conversation in a room where a radio broadcast or TV is playing loudly can be recorded this way. In this case, the sound from a microphone located in the room and transmitting a mixture of the useful signal and interference is recorded on one track of the tape recorder, and either a signal directly from the interference source (from the radio broadcast point or TV sound output) or a signal from a second microphone aimed at recording the interference sound is recorded on the other track of the same tape recorder. Then, based on such a stereo sound recording, an intelligible signal for any type of interference is obtained using an adaptive stereo filter, and this is possible even when speech is almost inaudible in the original sound recording.
It is interesting to note that the louder the «music plays» and the less the speakers are afraid of being overheard, the better the interference is removed from the noisy signal.
Adaptive stereo filtering allows you to increase intelligibility and improve the quality of sound recording when using a stereo pair of conventional or directional microphones. However, in this case, a certain amount of experience in using microphones in a specific recording environment is required, and it is desirable that one microphone receives the useful speech signal and the noise sound, and the other, mainly, only the sound of the same noise. The better this rule is followed, the higher the result of stereo noise removal.