On the use of relative prosodic and spectral characteristics in the task of forensic identification of a person by spoken speech.
Kaganov A.Sh.
Russian Federal Center for Forensic Examinations
(the article has been submitted for publication in the collection of the Philological Faculty of Moscow State University)
Forensic identification is the establishment of the presence or absence of identity of a particular material object — in this case, a person — based on its images [1].
It is intuitively clear that only stable individualizing features can be used as identification features to establish this identity.
Thus, the problem of identifying stable identification features of the speaker becomes a key problem of the task of forensic identification of a person by spoken speech.
This problem is specified through the possibilities of the auditive (more precisely, the auditive-linguistic) and instrumental parts of a single comprehensive study [2] at this stage of the development of applied linguistics.
Without dwelling in detail in this work on the analysis of auditory and linguistic identification features that characterize the personality of the speaker, we will touch only on such important aspects of the instrumental part of a complex identification study as the analysis of some relative prosodic and spectral characteristics of speech.
The instrumental analysis of the specified characteristics in the task of forensic identification of the speaker's personality described below includes the identification of those relative parameters of the fundamental tone that act as stable identification features characterizing the source of excitation of the speech signal;
- obtaining stable criteria for assessing the acoustic quality of an individual's speech sounds using formant ratios;
- comparative analysis of the «weight» of absolute and relative prosodic and formant indicators that act as identification features of the speaker.
Let us proceed to the consideration of the questions posed. Turning to the forensic foundations of the speaker identification task, we note that, in the system of material sources of information used in proving, a significant volume is occupied by reflections of functional-dynamic complexes (FDC) of skills, the bearer of which is a person [3].
FDC of skills is a phenomenon of psychophysiological nature. Its essence consists of skills or systems of skills for performing certain actions (carrying out activities).
Skill is usually understood as «the ability to perform purposeful actions, brought to automatism as a result of conscious multiple repetitions of the same movements or solving typical problems in industrial or educational activities»[1].
Such are, in particular, the skills of speech, writing, walking, etc.
Being materially reflected in the environment of the event under investigation, FDCs turn out to be sources of forensic information.
Communicative (speech) FDC skills are the main means of human communication. Note that there is a certain correlation between oral and written FDC skills.
At the same time, each of the mentioned subgroups has autonomy, due to the difference in analyzers implementing FDC and including different effector blocks of functional systems (in oral speech — the articulatory apparatus, in written speech most often the hand).
Oral-speech FDC are the subject of study of the branch of forensic science — forensic phonography, which studies spoken language, the sound environment, conditions, means, materials and traces of sound recordings, and also develops methods for their study in order to solve problems of forensic examination of sound recordings.
Moving on to the scientific and historical foundation of the instrumental aspects of forensic identification of a person by spoken speech, we note that although the first scientific attempt to construct an acoustic model of the sounds of human speech was apparently made in 1779 by Kratzenstein (when he presented a similar model to the competition of the St. Petersburg Imperial Academy of Sciences [4]), it was only in 1870, i.e. almost 100 years later, that the acoustic theory of speech production received serious scientific formulation in the fundamental work of G. Helmholtz [5].
The fundamental provisions of this work have remained virtually unchanged to this day and are shared by most specialists.
Let us immediately stipulate that the modern interpretation of Helmholtz's work takes into account, of course, a number of mathematical and methodological-technological improvements introduced into it by modern researchers (we will mention here the classical works of S.N. Rzhevkin [6], J. Flanagan [7] and G. Fant [8]).
According to G. Helmholtz, the process of speech production consists of two independent components: excitation of the sound itself and the formation of the acoustic quality of the sound due to excitation of the resonant frequencies of the articulatory tract (according to Helmholtz) or filtration (in modern terms).
Determining the characteristics of the sound excitation source is a rather complex and labor-intensive task and requires separate consideration.
It is known that during the process of voice formation, the air stream escaping from the glottis, due to the Bernoulli effect, causes the vocal cords, which are brought together quite closely, to vibrate.
As a result, air vibrations are formed at the exit of the larynx, perceived by the ear as vocal sounds, which are characterized by pitch, strength and timbre.
If the strength and timbre, passing through the supraglottic cavities, change significantly depending on the parameters of these cavities, then the pitch of the voice — the frequency of closure of the folds [2] — is preserved, representing one of the main individual features of the voice [9].
The pitch of the voice reflects the frequency of oscillation of the vocal folds, which depends on the length, thickness, tension and degree of convergence of the folds.
Long, thick and weakly stretched vocal folds provide low-pitched sounds.
An increase in the tension of the folds, carried out with the help of the muscular apparatus of the larynx, entails an increase in the pitch of the sound.
According to the generally accepted theory of voice formation (phonation) today, the sound signal is obtained by quasi-periodic modulation of the constant air flow blown out by the lungs, carried out by changing the width of the gap between the vocal folds.
The main parameters that characterize the process of periodic opening and closing of the glottis are the volume of exhaled air per unit of time and the subglottic pressure.
The impulses of the vocal source obtained as a result of the described process are repeated with the frequency of the fundamental tone.
The frequency of the fundamental tone (FPT) of the voice is inversely proportional to the period of oscillations of the vocal folds and is determined mainly by their mass and elasticity, the magnitude of the subglottic pressure and the degree of convergence of the vocal folds.
All these parameters, as well as stable dynamic stereotypes of voice source control, i.e. functional-dynamic complexes (FDC) of skills according to the terminology [3], are individual indicators and, therefore, can act as a source of identification features characterizing the personality of the speaker.
In order to determine stable identification features characterizing the work of the vocal folds of an individual, we will consider a comparison of the characteristics of the average value of the fundamental tone frequency and the relative range of change of the fundamental tone D[3] using the example of real examinations.
For comparative analysis, materials from those examinations were selected in which the speech situation of the original recordings (mainly telephone conversations) did not coincide with the speech situation typical for obtaining samples of the voice and speech of the subjects of the examinations (the samples were, as a rule, a conversation with the investigator or the interrogation of the subject in a court hearing).
Statistical analysis conducted on the basis of 10 assessments showed that the average weighted relative deviation [4] of the average values of NOR of the original and comparative recordings was 12.8%. At the same time, the average weighted relative deviation of the relative range of change in the fundamental pitch D in this sample was less than 5.4%.
Although both indicators are within the limits of intra-speaker variability, it is clear from the presented results that the relative range of change in the fundamental pitch D was in this case more«stronger»identification feature than the average value of the fundamental tone frequency.
In other words, it can be said that as an identification feature, the relative range of change of the fundamental tone D has a greater “weight” than the average value of the fundamental tone frequency.
(It is important to clarify that, according to the results of a comprehensive identification study in each of the examinations included in the analyzed sample, the findings revealed as a result of auditory, linguistic and instrumentalparts of the comparative identification study, the features constituted a stable complex, sufficient to establish an individual-specific identity between the voice and speech of the speaker whose speech production was recorded on the phonograms of the original conversations, and the voice and speech of the participant whose voice and speech samples were presented for comparison).
Next, from the analyzed sample, those examinations were selected in which the original recordings were telephone conversations of the defendants, conducted on mobile phones in conditions of the presence of noise and interference in the telephone line.
Statistical analysis of the relative deviations of the average value of the fundamental tone frequency of the speech material of the original recordings from the average FOT of the sample, carried out based on the results of four selected examinations, showed that the average weighted relative deviation of the average value of FOTamounted to 18.3%, i.e. it turned out to be close to the maximum permissible intra-speaker variability.
At the same time, the average weighted relative deviation of the relative range of change in the fundamental tone D in this sample was only 3.8%.
The obtained data can be explained by large differences in the speech situation of the original and comparative recordings: the presence of noise and interference in the mobile communication paths, which entailed forcing the voice and, as a consequence, significant changes in the parameters of CHOT(we chose the average value of the fundamental tone only as an example), while the samples of the defendants' spoken speech were obtained in the investigator's office and did not require forcing the voice.
Let's expand the scope of the analysis and move on to examining those stable identification features of the instrumental group that are associated with the spectral characteristics of speech.
In the process of solving the problem of forensic identification of a person by voice and speech, it is necessary to take into account the work of the organs of the speech-forming apparatus, which give the voice an individual timbre coloring and form the flow of speech sounds, i.e. to analyze the second independent component of the speech production process in the model of G. Helmholtz.
Let us turn to the mechanism of formation and criteria for assessing the acoustic quality of sound due to the excitation of resonant frequencies of the speaker's articulatory tract in order to determine those relative spectral characteristics of speech that can be used as stable identification features.
Back in the mid-50s of the last century, Russian researchers L.A. Varshavsky and I.M. Litvak put forward a hypothesis that the acoustic quality of sounds is determined by the ratio of signal levels in the spectrum bands [10].
In this case, formants (i.e., maxima in the spectrum of the speech signal) are only a method available to the speech-producing apparatus for achieving the necessary band ratios.
Time has shown that the idea expressed in [10] turned out to be fair, fundamental and possessing great explanatory power.
Later, with the expansion of applied research into spoken speech, new questions arose. It was necessary to develop this theory in relation to speech material of limited volume (i.e., solving the problem in the presence of restrictions) [11].
Such development made it possible to extend the above-mentioned hypothesis of L.A. Varshavsky and I.M. Litvak to solving the problem of forensic identification of the speaker.
Thus, the speaker's individuality is determined by the general form of the spectrum[12], i.e. by the ratio of signal levels in spectral bands[10].
It is important to note that formants serve as a way of implementing the said band ratios. This is the starting point for solving the problem of forensic identification of the speaker.
This solution is still based on the search for stable identification features (which in this case are revealed by stable spectral characteristics).
It is known that stable identification features can have different natures [2]. Among such stable features are also formant ratios – F2/F1, F3/F1, F3/F2, etc.
Analysis of these ratios is necessary when identifying a speaker in different speech conditions (business telephone conversation, speaking to an audience, conversation with an investigator, etc.), in different emotional states (calm, excited, depressed, frightened, animated, etc.).
In these situations, formant ratios are more stable than absolute formant values and are therefore more demonstrative identification features.
This conclusion is based on the experience of solving the problem of speaker recognition using real phonograms, which shows that when the absolute values of formants change due to one reason or another (for example, depending on the situation of speech communication, the emotional state of the speaker, etc.), the formant ratios remain virtually unchanged.
Let us consider the comparative characteristics of speech (based on the F2/F1 formant ratio) in different speech conditions and in different emotional states of the speaker. The differences are summarized in Table I and are titled situation 1 and situation 2.
In the first case, the speaker is emotionally antic, cautious, brief. The voice sounds dry, businesslike, muffled. In the second case, the articulatory realization of vowels is clearly and fully presented, consonants are not tense, speech is unhurried (the speech rate is reduced by 10% compared to situation 1).
As can be seen from Table I, in situation 1 and in situation 2, the average absolute formant indicators of speech diverge significantly, but the F2/F1 ratio remains practically unchanged — stable, i.e., as noted above, the formant ratios change insignificantly or practically do not change.
Thus, the ratio of formants remains stable, and this identification feature – formant ratios – proves to be stable even on limited volume material.
Thus, the position that the acoustic quality of sounds is determined by the ratio of signal levels in spectrum bands [10] is further developed when comparing the speech of the same speaker (i.e. when establishing the fact of the presence (or absence) of identity in the forensic sense).
table I
formant analysis parameters | F2/F1 | ||
vowel | average formant frequency, Hz | situation 1 | situation 2 |
a | origin. – F1=535 and F2=1390 compare. – F1= 580 and F2=1500 |
2.6 | 2.6 |
and | origin. – F1=310 and F2=2015 compare – F1=300 and F2=1970 |
6.5 | 6.6 |
o |
ref. – F1=457 and F2=945 compare – F1=390 and F2=840 |
2.0 | 2.2 |
Assessing the presented results of the study of the acoustic quality of speech sounds using absolute and relative formant indicators, we note that if the average weighted relative deviation of the original (situation 1) and comparative (situation 2) speech material, calculated by the average values of the formants, in this example was 8.4%, then the average weighted relative deviation, calculated by the F2/F1 ratios (columns 2 and 3 of Table I), is only 3.5%.
Although both indicators are within the limits of intra-speaker variability, it is clear from the presented results that the formant ratio was in this case a stronger identification feature than the average values of the formants.
As in the case of the relative range of the fundamental frequency D, it can be argued that as an identification feature the ratio of formants has more “weight” than the absolute values of the formants.
In conclusion of the discussion of Table I, we note that the best match of the relative formant indices of the original and comparative recordings for the sound [a] can be explained by the fact that among all the vowels of the Russian language, the vowel [a] is the most stable in spectral terms to noise in the acoustic environment and distortions.
For the above reasons, the spectral characteristics of the vowel [a] turned out to be the most stable to differences in the speech situation, in the contextual environment [13], in the volume and quality of the speech material of the original recording and the sample phonogram.
Thus, the fundamental idea of determining the acoustic quality of sounds through the ratio of signal levels in the spectrum bands[10] in the task of forensic identification of the speaker was implemented with the help of ratios of average frequency values of formants (F2/F1), i.e. formants were an accessible way for the speech-forming apparatus to achieve the necessary band ratios.
Thus, a combination of two methods of spectral analysis makes it possible to identify those stable identification features of an instrumental group that are associated with the spectral characteristics of an individual’s speech.
The conducted analysis of speech material in a specific variety of forensic examinations shows that when the absolute values of the parameters of the fundamental tone and formants change due to one reason or another (for example, depending on the situation of speech communication, the emotional state of the speaker, etc.), the relative phonation and articulation indicators — the relative range of the fundamental tone D and the ratio of formants — remain stable and can be used as stable identification features of the speaker.
Thus, functional-dynamic complexes (FDC) of skills, the carrier of which is a person, serve as sources for identifying individual speech characteristics. They allow us to identify those stable identification features that are associated with relative prosodic and spectral characteristics of speech.
Literature
Belkin R.S. et al. Forensic Science. Moscow: Legal Literature. 1968. – 695 p.
Kaganov A.Sh. Audio and video equipment as a source of evidentiary information //Material evidence. Information technologies of procedural proof./edited by Doctor of Law, Professor V.Ya. Koldin — M .: Norma., 2002. — 742 p.
Fundamentals of forensic examination. Ed. by Korukhov Yu.G. Part 1., General theory. RFCFS under the Ministry of Justice of the Russian Federation. M., 1997. — 430 p.
Kratzenstein Ch. G. Qualis sit nature et character sonorum litterarum vocalium a, e, i, o, u tam insigniter inter se diversorum/– St.– Petersburg, 1779.
Helmholts H. The lessons of the tone finding methods in the physiological theory of music, Braunschweig, 1870.
Rzhevkin SN Hearing and speech in the light of modern physical research. – M.: – L.: ONTI, 1936. – 311 p.
Flanagan J. L. Analysis, Synthesis, and Perception of Speech/Translated from English. – Moscow: Svyaz, 1968. –292 p.
Fant G. Acoustic Theory of Speech Production/Translated from English. – Moscow: Nauka, 1964. – 284 p.
Methodological Recommendations for the Practical Use of the SIS Program When Working with Speech Signals. STS – D106.1. Speech Technology Center. – St. Petersburg, 1998.
Varshavsky L.A., Litvak I.M. Study of formant composition and some other physical characteristics of sounds of Russian speech //Problems of physiological acoustics. — 1955. — Vol.3. — P. 5-17.
Kaganov A.Sh. Instrumental study of spectral characteristics in the problem of forensic identification of a person by spoken speech. //The article is in print.
Galunov V.I., Garbaruk V.I. Acoustic theory of speech production and the system of phonetic features. Proceedings of the international conference 100 years of experimental phonetics in Russia. St. Petersburg: Philological faculty of St. Petersburg University. 2001. pp. 58–62.
Zlatoustova L.V. Phonetic units of Russian speech. — Moscow: Moscow State University, 1981. — 108 p.
[1]Soviet Encyclopedic Dictionary. – Moscow, 1979. – p. 863.
[2] It seems more accurate to speak specifically about the vocal folds, and not about the vocal cords, since anatomically the vocal cord is just a thin membrane running along the edge of the vocal fold.
[3] The value of the relative range of change in the fundamental tone frequency D is taken to be a value equal to the ratio of the maximum value of FOR (the average value of FOR plus twice the standard deviation) to the minimum (the average value of FOR minus twice the standard deviation).
[4] The relative deviation is understood as the modulus of the difference between the average FOR of the original and comparative recordings, divided by the value of the average FOR of the comparative recording.