What is a vocoder and a lipreader?

What is a vocoder and a lipreader?

Proceedings of the 2nd All-Russian Conference «Theory and Practice of Speech Research»

In modern digital systems for recording, transmitting and storing speech information, various speech compression methods are used to reduce the volume occupied by information on physical media or the speed of its transmission over digital communication channels. In such systems, a speech signal converted into digital form is encoded using a special compression algorithm before being recorded on a media or transmitted, and is decoded when played back from a media or received.
As is known, a speech signal in the information and communication plan has a certain redundancy that does not affect the semantic content of the speech message. At the same time, speech compression is possible due to the partial removal of this redundancy, which may not reduce the intelligibility and quality of auditory perception of speech, but, at the same time, deprive it of special features necessary for expert identification of speech. Therefore, when conducting an examination, it is important to establish both the fact of compression and its effect on the speech signal.

Currently, many speech compression algorithms are used. All of them can be implemented using both hardware and software methods. Conventionally, all algorithms can be divided into three types:
— advanced types of pulse-code modulation (PCM, Pulse-Code Modulation PCM);
— vocoders (from English Voice and Coder);
— linear and predictor.
To assess the nature of the changes and losses introduced into the speech signal, we will consider the principles of constructing various compression methods.

1. Advanced types of PCM.

The parameters of PCM for digitizing speech signals are described in the recommendations of the CCITT (International Consultative Committee on Telephony and Telegraphy, CCITT) and, as a rule, have the following values:
— sampling frequency of 8000 Hz;
— number of binary digits per sample 8;
— transmission rate 64000 bit/s.

In this case, an analog signal with an upper frequency of up to 4000 Hz can be digitized and restored.

When using differential PCM (DPCM), instead of encoding samples, the differences between adjacent samples are encoded. Usually the differences between samples are less than the samples themselves. The transmission rate of the digital stream is reduced to 32-56 kbps. In systems with logarithmic DPCM, the A- and mu-laws of companding are used to implement non-uniform quantization. Adaptive Differential PCM (ADPCM) is a DPCM system with adaptation of the quantizer (ADC and DAC) and the predictor. With ADPCM, it is not the signal itself that is digitized, but its deviation from the predicted value (error signal, prediction error).

The following types of ADPCM are most often used:
— CCITT Recommendation G.721 (transmission rate 32 kbit/s);
— CCITT Recommendation G.722 (sampling frequency 16 000 Hz);
— CCITT Recommendation G.723 (transmission rate 24 kbit/s);
— Creative ADPCM (4, 2.6, or 2 bits per sample);
— IMA/DVI ADPCM (4, 3, or 2 bits per sample);
— Microsoft ADPCM.

The methods discussed above can introduce minor changes and losses into speech signals (e.g., narrowing the dynamic range in the high-frequency region, limiting the signal slope), which have virtually no effect on the authenticity of speech.

Vocoder Scheme

Analyzer

A— spectrum analyzer

T-SH — tone-to-noise signal highlighter

HERE — highlighter

UO — united device signals

KS — communication channel

Synthesizer

UR — signal disconnecting device

С — spectrum synthesizer

П — spectrum view switch

ГОТ — pitch generator

ГШ — noise generator

2. Vocoders

Vocoders can be divided into two classes:
— speech element;
— parametric.

In speech element vocoders, the pronounced elements of speech (for example, a phoneme) are recognized during encoding, and only their numbers are sent to the encoder output. In the decoder, these elements are created according to the rules of speech production or are taken from the decoder memory. Phoneme vocoders are designed to achieve maximum compression of speech signals. The scope of phoneme vocoders is command communication lines, control and talking machines of the information and reference service. In such vocoders, automatic recognition of auditory images occurs, rather than determining speech parameters, and, accordingly, all individual features of the speaker are lost.

In general, a vocoder (from the English words voice and coder) is a device that performs parametric companding of speech signals.

Speech signal compression in the coder is performed in the analyzer, which extracts slowly changing parameters from the speech signal. In the decoder, the speech signal is synthesized using local signal sources that are controlled by the received parameters.

In parametric vocoders, two types of parameters are extracted from the speech signal, and speech is synthesized in the decoder using these parameters:
— Parameters that characterize the source of speech vibrations (generator function) — the frequency of the fundamental tone, its change over time, the moments of appearance and disappearance of the fundamental tone (voweled or guttural sounds), noise signal (hissing and whistling sounds);
— Parameters that characterize the envelope of the speech signal spectrum.

In the decoder, accordingly, the fundamental tone and noise are generated according to the specified parameters, and then passed through a comb of bandpass filters to restore the envelope of the speech signal spectrum.

According to the principle of determining the parameters of the filter function, vocoders are distinguished:
— band (channel);
— formant;
— orthogonal.

In band vocoders, the speech spectrum is divided into 7 — 20 bands (channels) by analog or digital band filters. A larger number of channels in a vocoder provides greater naturalness and intelligibility. From each band filter, the signal goes to a detector to determine the average level.
In formant vocoders, the envelope of the speech spectrum is described by a combination of formants (resonant frequencies of the vocal tract). The main parameters of the formants are the central frequency, amplitude, and width of the spectrum.

In orthogonal vocoders, the envelope of the instantaneous spectrum is decomposed into its constituent parts in a series according to a selected system of orthogonal basis functions. The calculated coefficients of this schedule are transmitted to the receiving side. Harmonic vocoders that use a schedule in a Fourier series have become widespread.
The vocoders under consideration provide signal compression up to 1200-4800 bit/s, allowing the decoder to restore the fundamental tone frequency with a resolution of several hertz and with low accuracy the signal spectrum envelope with a change period of 16-40 ms, while even with sufficiently high speech intelligibility, many individual features of the speaker are lost.

Due to the complexity of determining the parameters of the generator function, semi-vocoders (Voice Excited Vocoder, VEV) appeared, in which instead of the fundamental tone signals, a speech signal band of up to 800 — 1000 Hz is used, which is encoded, for example, by ADPCM, and instead of the characteristics of the fundamental tone is transmitted to the encoder output. Such an algorithm allows compressing speech to 4800-9600 bits/s, preserving the generator function of the larynx (frequency and the law of change of the fundamental tone) of the speaker.

3. Lipreaders

One of the most effective methods of speech signal analysis and synthesis is the linear prediction method. The method has become widespread and continues to be improved, its essence is that to predict the current sample of the speech signal, you can use a linearly weighted sum of previous samples, that is, the predicted sample

All speech analysis methods assume a fairly slow change in the properties of the speech signal over time. The characteristics of the vocal tract can be considered unchanged over an interval of 10-20 ms, that is, the parameters should be measured with a frequency of about 1/20 ms = 50 Hz.

There are several varieties of the linear prediction method, namely:
— with excitation from fundamental tone pulses — LPC (Linear Predictive Coding);
— multi-pulse excitation MPELP (Multi Pulse Excidet Linear Predictive) or MPLPC (Multi Pulse Excited LPC);
— excitation from the prediction residue RELP (Residual Excited Linear Predictive);
— with excitation from the CELP (Code Excited Linear Predictive) code.

In the LPC coder, the excitation signal is transmitted using three parameters: the period of the fundamental tone (PFT) for sounds that are voiced; the tone-noise signal (characterizing the presence of its parameters at a given moment, either tone or noise) and the signal amplitude.

An encoder with excitation from the fundamental tone frequency (FFT) is an LPC coder that is used to transmit speech signal parameters at a rate of 2400 bits/s and below.
An encoder with excitation from the FO does not provide the required quality of synthesized speech even at a high transmission rate. Not for all sounds it is possible to obtain an accurate division of speech into voiced and unvoiced.

It is known that in addition to the FO of the primary excitation, which occurs when the glottis closes, there is a secondary excitation, which occurs not only when the glottis opens, but also when it closes.

In multi-pulse excitation, the LPC residual signal is represented as a sequence of pulses with unevenly distributed intervals and with different amplitudes (approximately 8 pulses per 10 ms).

Information about the positions and amplitudes of the excitation pulses together with the LPC parameters in each frame is formed by the encoder.

If the speed of up to 10 LPC parameters of 1.8 kbps (36 bits of frames of 20 ms) is used, then at transmission speeds of 16 and 9.6 kbps, the speeds of 14.2 and 7.8 kbps, respectively, are allocated for transmission of excitation signal parameters. At a speed of 16 kbps and even lower, high-quality synthesized speech is created. At speeds of 16 and 9.6 kbps, synthesized speech corresponds in quality to PCM signals (with logarithmic companding) with transmission speeds of 56 and 52 kbps.

At 4.8 kbps, the LPC parameters and the cross-correlation function are transmitted for reception. The autocorrelation function is reproduced from the LPC parameters that are received, after which the positions and amplitudes of the excitation pulses are determined. The quality of synthesized speech with multi-pulse excitation at a transmission rate of 4.8 kbps is significantly higher than with single-pulse excitation at the same transmission rate.

A linear prediction coder that can use the prediction residual as an excitation signal is called a RELP coder. The prediction residual is passed through a low-pass filter with a cutoff frequency of 800 Hz when transmitted at a rate of 9.6 kbps and 600 Hz at a rate of 4.8 kbps. In the first case, the residual signal is sampled at a frequency of 7.2 kbps and transmitted at the same rate. The residue of 9.6 — 7.2 = 2.4 kbps is used to transmit the prediction coefficients and gain. In the second case, i.e. at a transmission rate of 4.8, the residual signal is sampled at a frequency of 2.4 kbps and transmitted at the same rate. The residue of 2.4 kbps is used in the same way as in the first case.

In the decoder, the excitation signal is restored in the entire frequency band. In this case, the upper half of the restored excitation spectrum becomes a mirror image of the lower half.

The residual signal for the RELP coder can also be formed during decoding. The fact is that a fairly high speed is required to transmit this signal, which is unacceptable for LPC coders, the transmission speed of which is 2.4 kbps, so it is necessary to create a residual signal to receive the FR signal.
The residual signal does not have an amplitude spectrum, but has the same resonance regions as the real speech signal. This is why the residual signal has high intelligibility. The formant amplitudes at the output of the LPC synthesis filter are often smaller than the formant amplitudes in the real speech signal. This happens as a result of quantization of the LPC parameters.

In the Code Excited Linear Predictive (CELP) linear predictor, the excitation signal is represented as a vector that is assigned a specific index, i.e. a code.

The optimal vector is selected from a large set of candidate vectors that make up the code book. Determining the size of the excitation code book is of critical importance for achieving the required quality of synthesized language reconstruction.

The code-excited linear prediction method ensures high quality of the speech signal at transmission rates of 4…16 kbps.

In comparison to the multi-pulse method, the CELP method achieves higher speech reconstruction rates at the same rates.

Two federal standards for the use of CELP have been adopted in the USA:
— 1015 (LPC-10E, 2400 bit/s);
— 1016 (E-CELP, 4800 bit/s).
ITU (International Telecommunication Union, ITU) developed recommendations:
— G.728 for the LD-CELP algorithm (16 kbit/s);
— G.729 for the CS-ACELP algorithm (8 kbit/s).

What is a vocoder and lipreader?

What is a vocoder and a lipreader?