Identification of users of computing systems.

Identification of users of computing systems

Identification of users of computing systems
based on modern speech technologies

In connection with the increased informatization of modern society and the increase in the number of objects and information flows that need to be protected from unauthorized access, as well as the need to intellectualize all forms of interaction between users of automated control systems and technical means, the problems of using speech technology mechanisms to restrict access to the IVS are becoming increasingly urgent. Today we are at the next stage of the technical evolution of these systems. The first commercial versions of software using speech technologies have already appeared. However, if in the middle of this century speech recognition and voice identification systems were predicted to be widely used in the near future, today they are only functional in certain areas of public life, and have not become widespread.

The Problem of Personal Identification by Voice

Over the past few decades, there has been an increased interest in the problem of voice identification. First of all, it is due to the advantages of establishing and verifying the authenticity of a person based on a segment of a speech wave: a voice cannot be stolen, and the identification process does not require direct contact with the access control system. With the growing use of modern speech technologies (inquiries about the status of bank accounts and bank settlements by phone; requesting information from databases by phone; automatic payment for long-distance telephone calls, etc.), the need to establish the authenticity of the subscriber increases. An important application of speaker identification methods by voice is checking access rights to various objects (information and physical): communication channels; computing systems; databases; ASOD; bank accounts; official and individual premises of limited use (secrecy, material assets, etc.).It should be noted that many modern means of protection are not reliable enough, as they are based on the use of some password, physical key or code that can be lost, picked up, spied on, broken, given to an intruder under the influence of force. Therefore, there is always the possibility of an attack on any information processing system. An intruder will try to take possession of the key or password before hacking the system in other ways. Therefore, the user must always have a reliable and permanent «key» with him, easy to use and inaccessible to intruders. Obviously, such a universal «key» can be biometric parameters of the individual: fingerprints, dynamics and type of signature, image, voice. Research has shown that at the present stage, the developed methods of modeling the speech signal and methods for identifying individual parameters of the speaker open up the possibility of creating reliable systems of personal identification based on speech. However, it should be noted that the probability of hacking such a system by an intruder will depend on the selected parameters characterizing the individuality of the person's voice, on the selected training mode, on the concept of building the identification system, etc.

To date, dozens of different voice identification systems have been created, each with different parameters and requirements for the identification process depending on specific tasks. In our country, a number of complete software products have been developed that have already found application in various departments, for example, the software and hardware system for text-independent speaker identification «SIG», the software and hardware system for restricting access to information resources based on speaker verification by password phrases «Voice Key» (used in the Ministry of Defense), the automated system for identifying individuals by Russian speech phonograms «Dialect» (used in the Ministry of Internal Affairs).

Unfortunately, the programs developed to date are not easy to learn, convenient to use, or inexpensive. They are most often used as additional means of authentication where it is necessary to ensure a high degree of reliability of identification systems. Therefore, today, work continues on improving speech signal processing algorithms in order to create mechanisms for automatic human voice recognition that are more adequate to the process of human speech perception.

Speech signal parameters and the individuality factor

A person constantly encounters the problem of identifying the personality of his relatives or friends in his life. He does this unconsciously and quickly based on his life experience and a fairly large amount of information (appearance, gait, voice, behavior), which makes the problem of identification quite transparent and obvious at first glance. Therefore, the question «What allows us to distinguish the voice of one person from another?» led the first researchers to purely speculative theories. This was mainly due to the underestimation of the complexity of speech as a multifunctional act of communication between people, including both information about the individual voice of the speaker and information about the phonetic quality. Therefore, it is very important to ensure the correct choice and justification of the system of features, which will then determine the principle of constructing an identification system. The question is as follows: what are the objective prerequisites for recognizing a person by voice? What physical phenomena underlie the process of speaker recognition? What acoustic characteristics can be used to build an identification system?

Based on the data obtained through experiments using subjective methods, the main manifestation of the individuality of human speech should be sought in two main groups of features. They are associated with the physiological (anatomical) features of the human speech production mechanism and the unique nature of its activation (articulatory activity), conditioned by the work of the central nervous system.

The first group of features is based on the well-known model of the vocal tract [3], consisting of the transfer function of the resonance system and the generator of excitation signal pulses. The transfer function almost completely characterizes the individual geometric shape of the cavities of the speech apparatus: the posterior pharyngeal cavity, the constriction between the tongue and the palate, the anterior oral cavity, the constriction between the lips, etc. The main parameters here are the characteristics of the four formant regions (average frequency, frequency range, energy), the spectrum envelope, formant trajectories and derivatives of these parameters. The frequency of excitation pulses is directly dependent on the vibrations of the vocal cords, which, in turn, depend on the length, thickness and tension of the latter. The main parameters here are the frequency of the fundamental tone, the tone/noise parameter, sonority, the rise of the fundamental tone and derivatives of these parameters.

To calculate parameters related to physiological characteristics of the vocal tract, methods of spectral-temporal analysis are most often used. Such methods of speech signal analysis are adequate to the natural mechanism of speech perception [2], which makes understandable the tendency of many researchers to look for individual characteristics in instantaneous spectral distributions of individual phonemes and in the distributions of the current spectrum. Such methods are based on classical Fourier analysis [3] or parametric autoregressive analysis (linear prediction as a special case) [4,5].

Closely related to the spectral representation of the speech signal is the homomorphic method [4], which has been used quite frequently in recent times. This method represents the speech signal as a sequence of cepstral coefficient vectors, which require significantly less memory to store reference images. A small number of cepstral coefficients (usually 8 or 16) can approximate a formant section with high spectral resolution. This ensures a more compact representation of speech segments without significant loss of the main informative features (formant structure, envelope, tone/noise parameter).

As for the parameters of the excitation signal, they can be calculated by one of the widely known methods of isolating the fundamental tone frequency (for example, the correlation method, the cepstral method, the Gold-Rabiner method [3,4]).

If the first group of features reflects the static properties of the speech-forming tract, then the second group is designed to fully describe its behavior in time, that is, the articulatory dynamics of speech. According to the existing assumption, the initial and main stage in the organization of the speech production process is a program of a complex of articulatory movements controlled by the human central nervous system, corresponding to the message that is planned to be transmitted at a given moment in time [1, 2]. There is no doubt that the individual nature of the result of speech activity is already determined at the level of the central nervous system, that is, at the level of synthesis of articulatory programs. The decisive factors in this process are such moments as the socially conditioned speech skills of the speaker, his individual experience, psychological makeup (in particular, temperament), characterological features and even intelligence. Speech process control cannot be carried out without these main components. It should be noted that an articulatory program is a program that would contain the rules for pronouncing certain structures. These rules relate to the management of speech intonation, its rhythm, stress, volume, i.e. to the management of prosodic characteristics of speech. In this case, the articulatory program extends to such a semantic unit of speech as a syntagma. Syntagma is understood as a rhythmic-melodic unit of speech, grammatically designed and expressing within a more complex whole (for example, a sentence) a relatively complete thought. Within the framework of one syntagma, suprasegmental characteristics or intonation characteristics of the speech flow are distinguished. The main parameters here are intensity, melody or movement of the main tone, stress system, time characteristics (duration of segments, pauses, tempo), rhythmic picture of the speech phrase.

The study of the rhythmic pattern of a speech phrase has shown that its temporal pattern remains invariant for an individual articulatory program, regardless of the absolute durations of individual words and syllables included in it, i.e. it remains invariant with respect to the speech rate [1]. This position allows us to assume the existence in the central nervous system of some unique schemes for each person, which ensure the generation of a specific and repeating sequence of actions of the speech apparatus in time. When analyzing intra-syllabic articulation, it was revealed that although it is the result of successive movements, it can be assumed that these movements are not dictated by the central nervous system sequentially one after another, but are obtained reflexively.

To calculate the parameters describing the articulatory dynamics of speech, the methods of spectral-temporal data analysis described above can be used. However, it is necessary to note such a feature of the calculation of prosodic parameters as their rigid connection with the lexical and syntactic context of the phrase under study. This requires the complex use of both linguistic analysis tools and parametric processing methods, which clearly determines the complexity of the analysis of these characteristics. In this case, the main task is to establish a direct connection between the activity of the speech-forming apparatus (the dynamics of its articulatory movements) and the characteristics of the spectral picture of the speech flow.

Continuing the conversation about the parameters of the speech signal that determine the individuality of the human voice, it is necessary to touch upon the issue of integral parameters of speech. These parameters, due to their nature, cannot be attributed to any of the above-mentioned characteristic groups, but are strongly correlated with them and are formed under the influence of the anatomical features of the speech-forming tract and the articulatory activity of a person.

Subjective research methods allow us to establish that a specific voice source exists in a speech signal in the form of a certain constant background. Human hearing, easily filtering the information it needs, constantly monitors the coloring of the voice. Sometimes, without distinguishing the phonetic elements of speech or even the meaning of the spoken sentence, a person nevertheless easily identifies the speaker by the characteristic flow of voice parameters.

This circumstance has prompted many researchers to the idea of ​​using some integral properties of the speech signal as characteristic features of the voice, i.e. properties that manifest themselves in the form of average values ​​over a segment of the entire analyzed signal. If the duration of the signal is presented from a statistical point of view and its duration allows such laws of language to manifest themselves as the regularity of the occurrence of frequencies of individual phonemes, then it is believed that the analysis of the integral parameters of the speech signal makes it possible to determine the features of individual pronunciation for speech segments of different phonetic content. This assumption is in good agreement with everyday experience, when stable identification of the speaker does not depend on the phonetic content of speech.

One of the most widely used integral features is the average weighted speech spectrum. Despite the fact that this voice parameter is the simplest type of primary data processing, it is considered one of the effective features of voice identification in a continuous speech stream. The pitch of the speaker's voice is important, and in some cases, decisive, and can be expressed as the average value of the fundamental frequency of the speech signal over a fixed period of time. In addition, this parameter can be presented in the form of distribution diagrams of the fundamental tone periods.

Thus, the speech signal parameters described above characterize various aspects of human voice formation. Depending on the chosen concept of constructing an identification system, its basis will be made up of different parameters. Most of them are analyzed by classical methods, others require special modes of extraction and processing, which will be discussed below.

Principles of constructing automatic speaker identification systems

Most voice identification systems developed to date are based on a one-time check of the required key phrase and the one pronounced at the initial moment of access to the computing system. These systems support two main modes of operation: system training and authentication upon access.

In the first mode (registration), the user is asked to pronounce a key phrase (password) several times, usually limited in duration to 3-4 seconds. In this case, the identification system is trained on average speech segments based on the results of recording several pronunciations. The recorded key can be stored in full or compressed by effective algorithms that allow preserving individual voice parameters without distortion (linear prediction methods). Some systems remove weakly expressed speech sections (pauses, noises, bursts of energy) from the recorded key phrase by dividing it into segments corresponding to the phonemes of the base language, from which a set of required parameters is then extracted. As a rule, the above-described systems use parameters associated with the anatomical features of the speech apparatus and integral parameters. To prevent the possibility of substitution or destruction of reference phrases, they are stored in write-protected files.

In verification mode, the spoken key phrase is compared with the reference phrase using distance calculation methods

in the parametric N-dimensional space between two implementations, where N is the dimension of the parametric vector, and M is the number of time-ordered vectors. If the value

does not exceed the established identification threshold, a decision is made on the positive identification of this voice.

For systems that analyze the individual pronunciation of individual sounds, the decision is made by calculating the mutual correlation function of the parameters of the reference and control phonemes at the maximum of the main lobe.

The main advantage of the systems described above is the simplicity of construction. The wide possibilities of their implementation based on standard digital signal processing (DSP) procedures and low requirements for computing resources and computer memory capacity have made such systems almost a textbook example in studying the theory of automatic human recognition by voice.

However, a number of significant drawbacks limit their wide application. First of all, such systems have a high value of errors of the first («false alarm») and second («missed target») kind. This is due to the complexity of the same pronunciation of the key phrase at each access to the system (short-term variability) and anatomical changes in the vocal tract during life (long-term variability). In this regard, the password can be pronounced at different tempo and intonation, in different emotional states, under conditions of the speaker's speech apparatus disease. The stability of the parameters of the key phrase depends on various acoustic conditions of recording and recognition, on changes in the distance to the microphone, on external noise conditions, etc. These factors inevitably blur the recognition areas in the N-dimensional parametric space corresponding to specific voices, and with a large number of users lead to their significant overlap. To reduce the effects of variability of recognition parameters and pronunciation duration, almost every identification system built on the principle described above uses normalization mechanisms. It should be noted that normalization procedures, «pulling» the studied vector to the nearest center of the recognition area, inevitably deform neighboring areas, leaving the percentage of their overlap the same. Therefore, the use of such procedures does not change the value of errors of the first and second kind. Minimization of errors of the first and second kind can be achieved only by selecting highly informative and uncorrelated features that ensure minimal overlap of the distributions of identification parameters in the vector space.

However, for a given learning algorithm, metric space and known probability distribution of individual parameters, there is a problem of optimally choosing the identification threshold. The identification threshold is expressed by the ratio of errors of the first and second kind, and its value is dictated by specific tasks and the area of ​​application of the identification system. In cases where it is necessary to prevent penetration of an outsider as much as possible, the error of the second kind should be minimized by maximizing the error of the first kind. An increase in the error of the first kind, that is, a rare omission of the «target» also creates difficult conditions for admitting «your» person, which will require an increase in the number of system re-requests. In cases where «your» user must be admitted from the first utterance, while agreeing with the possibility of penetration of an «outsider», the error of the first kind should be minimized by maximizing the error of the second kind.

As a rule, in reliable identification systems, program authors are forced to take the first path, which requires the user to repeat the password, or even completely deny access if his voice has changed as a result of a speech apparatus disease. Therefore, users of computing systems, as a rule, refuse such identification mechanisms, returning to the traditional and more convenient password entry from the keyboard.

In addition, systems built on the principle described above can be hacked if the intruder has a recorded fragment of the key phrase, which he could overhear or obtain under the influence of force. Therefore, more complex identification systems use a certain password database formed by the system at the training stage to solve the problem of key «substitution». In this concept, the identification system randomly selects a password from this database and asks the user to pronounce a new key phrase each time. Since the intruder does not know in advance what password the system will offer for pronunciation, he cannot use the recorded key. These systems are called text-dependent and require the use of algorithms for determining the phonemic composition of the key phrase. As a rule, such systems do not perform linguistic analysis of the speech signal, limiting themselves only to the correspondence of the phoneme parameters at the beginning and end of the key phrase and the standard.

The complexity of implementing this identification mechanism lies in the formation of a password database with a sufficiently large number of keys for each user. The training mode in such systems can take a lot of time (up to several hours). It is also possible that when using modern algorithms and digital speech processing equipment, an intruder will be able to synthesize the responses of a legitimate user based on his individual phoneme models (text-to-speech converters). Although the synthesized phrase will differ from the one actually pronounced, especially in places where there is a transition from phoneme to phoneme, only a powerful analytical mechanism of human hearing can detect distortions, but not an identification system built on the principle described above.

Taking into account the above-described shortcomings of existing access control mechanisms, active research has been conducted recently on the possibility of constructing dialogue systems for identifying a person by voice. Unlike most existing identification systems, dialogue systems are based on the analysis of prosodic characteristics (the second group of features), which are most clearly expressed not in a single pronunciation of individual key words, phrases, and even sentences, but in a meaningful act of communication between people. This explains the desire to construct identification algorithms within the framework of the human-computer communication model. Prosodic characteristics have the property of resistance to changes in the acoustic environment, short-term and long-term variability of the parameters of the speaker's speech-forming tract.

It should be noted that the problem of dialogue between a person and a computer is part of the general problem of creating artificial intelligence systems and is at the junction of several sciences, which indicates its complexity. Therefore, it is proposed to adopt a speech-to-text conversion system as the basis for a system of dialogue between a person and a computer, on the basis of which a speech control system for a computer can be created. Unlike text-dependent identification systems, the dialogue method implements not only a single user response to a requested sentence or question from the password database, but also its expansion to a full-fledged speech interface of the computer. The machine accepts commands from the user and executes them only if the speaker's voice matches the registered one. Such a concept for constructing a system for restricting access to a computer will determine its competitiveness and resistance to external attacks.

The peculiarity of the developed access control system is that the identification procedure based on prosodic characteristics is included in the process of processing commands or messages received from the operator. Such integration allows eliminating the main disadvantage of prosodic analysis — the need for a large volume of speech signal samples for training and, accordingly, a lot of time and computer memory. In the process of speech control of the computer, the identification system receives a sufficient volume of speech material unnoticed by the speaker himself. This approach allows eliminating a separate system training mode and moving it to the stage of the user's work with the computer. Obviously, in this case it is advisable to build the identification system as a real-time system.

The use of a command system in a single-access mode to an object (for example, at a checkpoint) will not allow achieving the desired effect due to the complexity of training the system and analyzing short key phrases. In the case of frequent and prolonged «communication» of a person with the same object (for example, a personal computer), statistics are accumulated on the user's articulatory activity on large volumes of speech data, which will significantly reduce the number of errors made by the system.

As noted above, prosodic analysis is most effective in conditions of real speech activity of the speaker, which requires the development of systems for command control and speech-to-text conversion in a flow of continuous speech. These mechanisms solve the opposite problem compared to the systems described above. If identification systems analyze the difference between the pronunciation of specific users, then speech recognition mechanisms must determine the common one. This task is part of the general problem of automatic speech recognition and understanding.

Literature

1. Ramishvili G. S. Automatic recognition of the speaker by voice. Moscow: Radio and communication, 1981.
2. Bloom F. Leiserson A., Hofstadter L. Brain, mind and behavior. Moscow: Mir, 1988.
3. Rabiner L., Shafer R. Digital processing of speech signals. M.: Radio and Communications, 1981.
4. Markel J., Gray A. X. Linear prediction of speech/Translated from English. M.: Communications, 1980.
5. Marple Jr. S. L. Digital spectral analysis and its applications/Translated from English. M.: Mir, 1990.

Добавить комментарий

18 − 12 =

Cогласен с использованием cookie.
Принять
Отказаться