Acoustic protection of confidential negotiations.
In the information age, when the principle applies: whoever owns the information, owns the world.
There are more than enough people who want to take over the world in this way, which means there is a steady demand for information obtained in an unauthorized way.
In such a situation, the headache of the information owner is its reliable protection.
In other words, in the information field, there is an eternal struggle between the projectile and the armor, the attacking side and the defending side.
The attacker pays worthy attention to information, the carrier of which is a speech signal or speech information.
In general, speech information is a set consisting of semantic information, personal, behavioral, etc.
As a rule, the greatest interest is in semantic information, the loss of which can be indirectly estimated by the loss of speech intelligibility. In the future, by speech information we will understand only semantic information.
There are quite a lot of methods for protecting speech information and technical means implementing these methods, and they are constantly being improved, because as scientific and technical thought develops, the attacking side has increasingly sophisticated technical means that allow not only to improve the quantitative characteristics of known technical channels for information leakage, but also to create new channels.
The maximum number of technical channels for information leakage can be organized to intercept acoustic speech information, for example, during confidential meetings and negotiations.
The most effective protection in this case
is acoustic protection of the room in which confidential negotiations are held and, at best, acoustic protection of the acoustic speech signal.
This article is devoted to the problems that arise when organizing such protection and their technical solutions.
PROTECTION METHODS.
Terms, definitions, brief description.
Acoustic protection of an acoustic speech signal (speech) here means reliable masking of speech by an acoustic masking signal (noise) operating in the speech frequency band and having a «smooth» spectral characteristic.
Reliable masking is achieved when the resulting additive acoustic mixture of speech and noise, otherwise noisy speech, at any point in the controlled volume has a verbal intelligibility of no more than 20% (in practice, this corresponds to the perception of individual exclamations and individual «familiar» words)
In this case, the participants in the negotiations should be provided with the most comfortable acoustic conditions possible under the given circumstances.
It should be noted that the concept of «comfort» as applied to the perception of speech information does not yet have an unambiguous interpretation among specialists, and even more so, a scale for assessing comfort has not been developed.
We traditionally associate comfort with such a human reaction to unnatural conditions as fatigue, which can be theoretically assessed.
However, to date, due to the large volume of research required, correlations between the types and depth of speech signal distortions and the degree of fatigue have not yet been determined.
Therefore, when assessing comfort, we will use exclusively qualitative intuitive assessments, while relying on common sense.
There are several types of acoustic speech protection equipment, which can be divided into two groups: equipment for acoustic protection of premises and equipment for acoustic speech protection proper.
Equipment related to the first group is the producer of a barrier acoustic interference along enclosing structures and, as a rule, is used together with vibration protection equipment.
In this case, relatively comfortable acoustic conditions are created for the personnel, but the entire volume of the room is not acoustically protected with all the ensuing consequences.
The second group of equipment includes acoustic noise generators, which are located near the place where negotiations are taking place and mask the speech of the negotiators with their noise.
In this case, the negotiators are not protected from the effects of acoustic noise. Comfort and the level of masking in this case leave much to be desired.
A higher level of camouflage and comfort is provided by equipment that, in addition to an acoustic noise generator, uses headsets of intercom devices designed for work in high noise.
This type includes the TF-011D equipment, which uses telephone-microphone headsets, and the OKP-6 with telephone-laryngophone headsets.
When using these devices, the hearing of negotiators is protected from acoustic noise by the ear cushions of the headphones, through which the speech of their partners is presented to the negotiators. The speech of the negotiators is perceived by a microphone located near the speaker's mouth, or by a laryngophone from the speaker's throat.
The reliability of speech masking is high, especially for OKP-6, but the need to use headsets may not always be convenient for users.
To maintain high reliability of speech masking, while getting rid of headsets and replacing them with headphones, this is the task that the employees of «Business Security» set for themselves and solved.
SOLVING THE PROBLEM
When solving the task, the developers encountered the following problem.
It is known that during a telephone conversation, the average speech level on the microphone membrane of the telephone handset is 97.5 dB in the frequency band of 100-10000 Hz.
The headset microphone is located at approximately the same distance from the speaker's mouth as the telephone handset microphone and, accordingly, the speech level on the headset microphone will be approximately the same.
In acoustic speech masking, to create, on the one hand, sufficiently reliable masking, and on the other — an acceptable level of comfort (satisfactory speech transmission quality, see Table 1), it is necessary to create a noise field with a level of 86 dB around the speakers.
In this case, the speech/noise ratio on the headset microphone is +10 — +12 dB.
Table 1.
Speech transmission quality |
W,% |
S,% |
A,% |
Bsh, dB |
Speech/noise, dB |
perfect |
100-99 |
||||
excellent |
99-98 |
||||
good |
98-93 (95) |
58 |
35 |
81 |
+16 |
satisfactory |
93-87 (90) |
47 |
25 |
86 |
+11 |
maximum permissible |
87-77 (82) |
33 |
18 |
92 |
+5 |
connection failure |
77-60 (68) |
18 |
12 |
97 |
+0 |
reliable camouflage |
15 |
4 |
5 |
104 |
-17 |
Explanations to Table N1. The table contains the results obtained on the basis of the data of Sapozhkov M.A. and Pokrovsky N.B. for the speech level of 97.5 dB and «white» noise.
Here: W — word intelligibility (average values for the presented range are given in brackets); S — syllabic intelligibility; A — formant intelligibility; Bш — «white» noise level.
At a distance of 1.2-1.5 m, the speech signal level decreases to 72-78 dB (measurements were made on a traditional test phrase in rooms of 600 cubic meters and 50 cubic meters).
If the noise level is maintained at 86 dB throughout the protected area within a radius of approximately 1.3 m from the speaker's mouth, the speech/noise ratio will be on average -10 dB and will deteriorate further as the distance increases.
Based on the data in Table 1, we can conclude that at a distance of more than 1 m from the speaker's mouth, speech quality is below the «communication breakdown» level, and at a distance of more than 2 m, reliable masking will be guaranteed.
Here it is necessary to make some explanations and clarifications to what has been said.
1. The quality of speech «communication breakdown» is characterized by complete incomprehensibility of the main text and, according to various sources, corresponds to W values in the range from 77% to 60%, and in some publications the lower limit of the range is equal to 50% of words.
2. The data given in Table 1 correspond to a speech level of 97.5 dB and it is not entirely correct to use these data for other levels, but for illustrative purposes it is quite acceptable.
From the above reasoning it is clear that moving the microphone away from the speaker's mouth, i.e. refusing to use a headset, leads to a decrease in speech intelligibility up to a breakdown of communication at distances of more than one meter from the speech source.
In other words, placing the microphone away from the speaker's mouth simulates the situation with an eavesdropping microphone.
And this is what acoustic masking is aimed against. For this purpose, «white» noise is used as masking noise.
The fact is that there are no algorithms or hardware and software implementations that can actually improve the intelligibility of speech contaminated with «white» noise with a negative speech-to-noise ratio.
The McCulley algorithm and other modifications of the spectral subtraction algorithm, aimed at combating «white» noise, can improve listening comfort, but not speech intelligibility for positive speech-to-noise ratios.
Thus, noisy speech intercepted by acoustic monitoring devices cannot be noise-cleaned.
Under certain conditions, it is possible to compensate with a high degree of suppression any stationary noise, including «white» noise.
Such compensation can be implemented using a digital two-channel adaptive filter (in the article «Adaptive Filtering Equipment», «Confidential», N1-2, 1999, the author considers this possibility).
In relation to the problem under consideration, a two-channel adaptive filter (DAF) was used in the developed equipment for acoustic protection of confidential negotiations — Confidential Negotiations Digital System (CNDS).
OPERATING PRINCIPLE OF THE EQUIPMENT FOR PROTECTION OF CONFIDENTIAL NEGOTIATIONS (CNDS)
The main principle is that the generated masking noise n is fed not only to the electroacoustic emitter, but also to the reference input of the DAF (the structural diagram of the CNDS is shown in Fig. 1).
The second, main, input of the DAF receives a signal x from the output of the receiving microphone, which acts as a headset microphone.
This signal is an additive mixture of the speech of the negotiating participants s and noise n1, which is noise n, but has undergone changes during conversion to an acoustic signal and due to the acoustics of the room where the negotiations are taking place.
Fig. 1
If these changes are linear (the power amplifier and the emitter do not limit the noise signal), then n and n1 are correlated.
The speech/noise ratio in this mixture is worse, the further the receiving microphone is from the speaker's mouth.
By the way, we note that this scheme uses only one microphone for all participants in the negotiations, i.e. we have abandoned headsets.
Using the reference channel signal, in accordance with the adaptive algorithm, the noise component in the mixture is compensated in the DAF
speech and noise, and the speech, thus purified, is presented to the participants of the negotiations via headphones.
The convergence of the algorithm is carried out using the steepest descent method, and, to simplify the calculations, the stochastic approximation of the gradient according to Widrow-Hopf is used.
To speed up the convergence, the minimum of the filter error modulus is used as the optimality criterion.
CNDS IMPLEMENTATION.
Description of the algorithm, some operational features, test results.
The CNDS equipment is based on a specialized digital processor that implements the functions of a generator, a digital two-channel adaptive filter (DTAF), and control functions.
Compensation for the noise component in a speech-noise mixture is provided by the DTAF.
The DTAF operation algorithm can be represented by the following expressions:
where:
x(j) — next sample of the main signal;
n(j — next sample of the reference signal;
y(j) — next sample of the noise estimate;
s(j) — next sample of the output signal (error signal);
v(j,i) — next value of the filter weighting coefficient;
m — coefficient determining the adaptation speed;
p — number of filter weighting coefficients;
r — acoustic signal delay;
j — discrete time value, j= 1,2,3,…;
i — filter weighting coefficient number, i= 1,2,3,…,p;
sgn[.] — signal sign [.].
The cleaned speech (filter error signal), according to (1), is defined as the difference between the main channel signal and the predicted noise value (noise estimate), which is calculated as the convolution of the reference channel signal (noise) with the weight coefficients of the transversal filter according to (2).
The impulse response of this filter (or the vector of weight coefficients of dimension p) is updated at each discrete moment of time j according to (3).
Adaptation (automatic tuning) of the weight vector is carried out at a certain rate (the adaptation rate) until the minimum of expression (1) is reached, i.e. until the masking noise in the signal arriving at the headphones is practically completely suppressed.
Thus, during the period of time when adaptation occurs (adaptation time), the headphones will be able to hear a decreasing level of masking noise and the speech of the negotiators appearing against it.
Later, in the absence of changes in the acoustic environment in the room, the values of the weighting coefficients will stabilize and the tracking process will begin, which is characterized by the presence of a variable sign in the gradient (the second term in (3)) and its minimum absolute value.
During this period of time, the headphones will contain practically «clear» speech. If the acoustic environment changes, for example, the participants in the negotiations will begin to allow themselves sharp gestures, the CDAF will again switch to adaptation mode and noise will again be heard in the headphones.
To reduce the influence of this effect, the adaptation speed, which is regulated by the coefficient m, is chosen to be maximum (from above it is limited by the condition of convergence of the algorithm).
Of course, with intense gesticulations of all negotiating participants, maximizing the adaptation speed will not lead to the desired effect and will force the negotiating participants to cool their ardor. This is an unplanned useful limitation.
The following can be attributed to the planned limitations (protections) that exclude the possibility of situations that allow interception of speech information that is not properly masked.
- Protection from unauthorized reduction of the acoustic noise level.
This protection is implemented as follows. The device itself with a built-in microphone is located at a distance of 1-1.5 meters from the speakers (in the middle between them).
Two speakers are located to the left and right of the speakers at a certain distance, which emit masking noise at such a level that the speech/noise ratio at the microphone cutoff is about -15- -19 dB. When
for some reason the level of masking noise to values when the speech/noise ratio improves to approximately -10- -12 dB, the adaptation process will be switched off and masking noise will appear in the headphones. For the participants in the negotiations, this will indicate the occurrence of an abnormal situation.
2. Protection against exceeding the specified upper limit of the speech level.
When talking in raised tones, a sound resembling a crack will be heard in the headphones of the participants in the negotiations, which will also indicate an abnormal situation.
- Protection against violation of the topology of the equipment placement.
When deploying the equipment or during its operation, the distance between the speakers or between the speakers and the main device may be set to less than the specified limit.
In this case, the space in the rear of the speakers may be weakly masked.
To prevent this from happening, the parameter r (2) and (3) is introduced into the calculation process, determining the maximum value of the acoustic delay of the noise signal.
If the acoustic delay (i.e. distance) in the passage of masking noise from the speaker to the microphone is less than the specified one, then there will be no adaptation and there will only be noise in the headphones.
The measurements and tests carried out showed the following results.
1. The measured formant intelligibility of a noisy speech signal in the working frequency band (5 kHz) at the microphone cutoff is 3-5%. The following should be noted here. Since the masking noise is «white» noise and undergoes insignificant spectral changes when forming an acoustic field using speakers, it is entirely acceptable to measure formant intelligibility and draw conclusions based on the results of these measurements.
2. The depth of suppression of masking noise in the signal presented to the headphones is 26 — 30 dB with the number of weighting factors equal to 1300.
3. The speech of the negotiators, recorded on a dictaphone located in the breast pocket of one of the negotiators, is completely unintelligible.
In conclusion, it should be noted that since the speech signal presented to the negotiators via headphones in the CNDS equipment is practically cleared of masking noise, the comfort of the working conditions of the negotiators will be determined only by the degree of muffling of external noise by the ear pads of the headphones used.
VALERIY IVANOVICH ZOLOTAREV
Ph.D., senior researcher
Source: «Special Equipment» magazine