Speech transmission in modern communication networks..
Bykov Sergey Fedorovich
Shalimov Igor Anatolyevich, Candidate of Technical Sciences
SPEECH TRANSMISSION IN MODERN COMMUNICATION NETWORKS.
At the turn of two centuries, we are witnessing the embodiment of the famous phrase: “who owns information owns the world”. Information is knowledge, it is money, it is the ability to manage people, it is power. It is impossible to overestimate the role and importance of information in modern society. The possibilities inherent in information can be realized by ensuring its correct use or, figuratively speaking, movement. To ensure this movement, humanity has created a variety of institutions, technical complexes that regulate and provide the ability to communicate within society. The types of this communication are extremely extensive, but, touching on the technical side of the issue and considering communication as the transmission of information over a distance, it is necessary to single out telephone communication from all the diversity.
Telephone communication, with such characteristics as efficiency, voice recognition, and the possibility of two-way exchange, is the most attractive. For the purposes of voice transmission over digital communication channels, various methods of encoding speech signals are used, ranging from direct conversion into code to complex mathematical algorithms based on linear prediction and code excitation. The current level of development of information transmission systems and networks opens up new opportunities for further improvement of speech encoding methods, designed to significantly increase the efficiency of both existing and newly commissioned channels and paths.
Speech over networks can be transmitted in several ways.
The first is constant-rate speech transmission. This is the traditional transmission method. It can be used in both circuit-switched and packet-switched networks. It provides sufficient quality for most applications and acceptable transmission delays.
The second is the transmission of speech at a variable rate over networks that provide the transmission of variable rate streams. High efficiency of these networks is achieved when using multiple access with code division — CDMA (Code Division Multiple Access), which is significantly adapted to variable rate information flows ([1], [2]). The most flexible system includes a variable rate control module — VRCU (Variable Rate Control Unit), created to ensure the optimal distribution of the channel bandwidth between different information sources (see Fig. 1).
Each information source (speech, video, data and control signals) forms information flows ri(n), which are a function of frame n and enter the channel coding block. Then the received bit flows, each with its own redundancy, with variable rates Ri(n) are multiplexed into an output bit flow with a rate Rtot(n) with dynamically allocated bandwidths.
VRCU ensures optimal distribution of communication channel resources, for which it analyzes:
- information source needs,
- system needs,
- user requirements,
- communication channel capabilities.
Thus, it can be noted that, on the one hand, modern communication systems and networks have the ability to manage variable-speed flows, and, on the other hand, there is a need for efficient use of communication channel capacity. This makes it important, firstly, to develop variable-speed speech coding algorithms, and, secondly, to modify existing algorithms for the transition to variable transmission speed.
Fig. 1. Illustration of the variable-speed flow management process.
The high speech quality achieved in multipulse and code excitation algorithms at relatively low transmission rates has ensured their widespread use in various communication systems and networks. When attempting to transmit speech using these methods at rates of 4 kbit/s and lower, the speech quality decreases. One of the possibilities for further reducing the transmission rate while maintaining high quality is the transition to a variable transmission rate that takes into account the information redundancies of individual segments of speech signals. It is well known ([3], [4]) that the information required for an accurate representation of a speech signal changes over time. This is the basis for the development and application of variable transmission rates in speech coding technology.
Currently, there are several approaches to constructing variable-rate speech coders. They are based on classifying speech signal segments according to a certain feature and using different coding systems on different segments.
One approach is based on phonetic classification of speech segments. This method is used in developments by Fujitsu laboratories, Rockwell International Corporation, Hughes Aircraft, Qualcomm, and others ([5]). In general, the structure of such an encoder is shown in Figure 2. The purpose of the classification is to identify several phonetic categories that correspond to different levels of speech signal entropy and satisfy variable-rate coding.
Fig. 2. Block diagram of a variable bit rate encoder.
Phonetic classification is performed on speech segments of the signal and controls the selection of the appropriate coding system for a given segment. The phonetic classification proposed in [5] is performed based on the voiced/unvocalized feature. To do this, the main analysis interval of 20 ms is divided into 4 subsegments of 5 ms each and their type is determined – voiced or unvocalized.
Classification of segments is performed by the types of its subsegments in accordance with Table 1.
Table 1: Classification of speech segments.
Class |
Subsegment 1 |
Subsegment 2 |
Subsegment 3 |
Subsegment 4 |
U | Nonvocalized. | Nonvocalized. | Nonvocalized. | Nonvocalized. |
UO | Non-vocalization | Unvocalized. | Undefined | Undefined |
OV | Undefined | Undefined | Vocalized. | Vocalized. |
V | Vocalized. | Vocalized. | Vocalized. | Vocalized. |
Phonetic classification acts as a preprocessor that determines which encoding algorithm should be used for the selected segment of speech.
Fig. 3. Block diagram of a variable-speed coder with phonetic classification. The dotted lines indicate the blocks whose operation depends on the phonetic type.
Figure 3 shows how the phonetic information classifier controls various components of the encoder and decoder, namely, LPC analysis, structure and size of excitation codebooks, and weighting filter. The figure does not show the adaptive postfilter of the decoder excitation signal (LTP synthesizer), which is used only for processing voiced segments.
The excitation signal selection block operates differently on voiced, unvoiced, and undefined segments. This is reflected in the sizes of the codebooks used, gain coding, and, in addition, the parameters of the long-term prediction (LTP) of the excitation signal are calculated for voiced segments.
As studies have shown ([5]), the output speech of such an encoder sounds natural and is free from singing or reverberant distortions typical of low-speed CELP encoders. Speech quality was assessed for a clean speech signal and in the case of its distortion by noise similar to vehicle noise. The average speed was below 3 kbps. At the same time, the quality was no worse than that of the US federal standard 1016 — CELP codec with a fixed speed of 4.8 kbps.
Another approach to taking into account the redundancy of speech signals and, as a result, the creation of a variable rate coder, was applied in the developments of the Committee for Scientific Research Programs financed by the Commission of the European Committee. The developed coder FVR-CELP (Fast Variable Rate CELP Coder) provides high quality at an average rate of about 6 kbit/s and a peak rate of 16 kbit/s. The main attention in the development of the coder was paid to the segment classification algorithm and limiting the algorithmic delay to 10 ms ([1]).
Fig. 4. Block diagram of the FVR-CELP encoder.
The encoder analyzes 10 ms segments (80 samples), each segment is divided into four 2.5 ms subsegments (20 samples). The analyzer identifies parameters related to both the entire segment and its subsegments. The choice of the encoder algorithm and, accordingly, the change in the transmission rate are controlled by two classification blocks (see Fig. 4):
One block performs classification directly based on the speech signal,
the second uses a closed algorithm — the analysis-through-synthesis method.
The direct classifier analyzes a speech segment based on the pause-speech feature and, for speech, on the voiced-unvoiced feature.
In total, the FVR coder provides 8 operating modes obtained as a result of classification by a closed method of selected segment types (see Fig. 5).
The structural diagram of the FVR coder is a multi-level CELP algorithm, which includes: a short-term (ST) analyzer, a long-term (LT) analyzer, and constant code books of excitation signals A and B.
Fig. 5. Block diagram of the algorithm for selecting the operating speed.
The block of comfortable sounding pauses provides, as a result of using different analysis algorithms, noise filling of pauses of 3 types: zero (1), random noise signal (2) or signal close in form to the original (3) (see Table 2).
Table 2: Categories of coding and transmission speed.
Parameters |
Coding category |
|||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
Signal gain | + | + | ||||||
Short-term (ST) analyzer parameters | + | + | + | + | + | + | ||
Long-term (LT) analyzer parameters | + | + | + | |||||
Codebook A | + | + | + | + | ||||
Codebook B | + | + | ||||||
Baud rate ( kbit/s) | 0 | 0.4 | 3.2 | 8.5 | 12.5 | 7.2 | 12 | 16 |
The “quality evolution” block refers to a closed classifier and ensures the connection of additional excitation signal analysis tools (see Table 2) in the event that the weighted prediction error exceeds a certain threshold. As a result, the volume of parameters transmitted for the segment increases, which leads to an increase in the transmission rate.
The transmission rate is selected so that a constant quality of reconstructed speech is ensured for each segment. As a result, the algorithm has demonstrated significant noise immunity to various environmental conditions, as well as to various speakers. Formalized subjective quality tests using the paired comparison method confirmed the ability of this codec to provide quality close to the G.728 standard ([1]).
QUALCOMM Incorporated has developed a variable rate coder algorithm implemented in the form of a single-chip microcircuit — Q4401 [6]. The Q4401 coder meets the speech compression requirements of digital telephone systems, speech storage systems and speech synthesis. The software-implemented QUALCOMM Codebook Excited Linear Predictive (QCELP) algorithm ensures high speech quality at low data rates.
The Q4401 encodes speech in fixed or variable bit rate mode. In fixed rate mode, the Q4401 can encode speech at 4 kbps, 4.8 kbps, 8 kbps, or 9.6 kbps. In variable rate mode, the Q4401 automatically adjusts the bit rate every 20 ms within the range of 800 bps to 8 kbps (normal variable rate mode) or 800 bps to 9.6 kbps (enhanced variable rate mode). In variable rate mode, the Q4401 provides an average rate of 7 kbps in continuous speech applications and 3.5 kbps in typical two-way voice communications, without significant degradation in speech quality.
The Q4401 encoder operates on a 20 ms time interval (160 samples). The encoder's algorithm is based on the CELP method. The speech encoding process includes: measuring the speech signal energy, determining the encoding algorithm and, accordingly, the data rate, dynamically adjusting the frequency limits, and encoding speech into compressed data blocks. The encoder sends a 25-byte data block to the processor every 20 ms. Each encoded packet contains one byte, which determines the data rate, and 24 bytes of data, which contain the encoded speech. The number of information bits in the block depends on the selected data rate, the remaining bits of the 24-byte frame are filled with zeros.
The coding algorithm (in variable rate mode) for each 20 ms speech segment is selected depending on the signal energy in that segment. If the signal energy is high, the maximum rate will be used. If the signal energy is average, an intermediate rate will be used. If the signal energy is low, a data rate of 800 bps will be used. The average rate for a normal telephone conversation is about 6 kbps, the quality is close to the G.728 standard.
Thus, at present there are several speech coding systems based on variable bit rate. These systems, using various speech characteristics to classify segments, are based on CELP algorithms. Based on the current state and prospects for the development of communication systems and networks, it can be argued that the approach to speech coding with variable bit rate will develop and become widespread. In [7] it is noted that “variable bit rate speech is an inevitable direction of development of future generations of digital networks”.
The use of complex speech coding algorithms with variable bit rate saves channel bandwidth, increases the efficiency of communication systems and networks. Such algorithms are the basis for the creation and development of stochastic transmission systems that take into account the statistical features of the transmitted information.
LITERATURE
- Cellario L., Sereno D. CELP Coding at Variable Rate. //ETT, Vol.5, No. 5 September-October 1994, pp. 603-613.
- Berutto E., Sereno D. Variable-rate for the basic speech service in UMTS. VTC. Secaucus NJ 1993, pp. 520-523.
- Vocoder telephony. Methods and problems. Under. ed. A.A. Pirogov. – M.: Communication, 1974, — P. 536.
- Mikhailov V.G., Zlatoustova L.V. Measuring speech parameters./Ed. M.A. Sapozhkova. – M.: Radio and Communications, 1987. – P. 168.
- Paksoy E., Srinivasan K., Gersho A. Variable Bit-Rate CELP Coding of Speech with Phonetic Classification. //ETT, Vol.5, No. 5 September-October 1994, pp. 591-602.
- Q4401 Variable Rate Vocoder. General Description. QUALCOMM Incorporated, ASIC Products 6455 Lusk Boulevard, San Diego, 1997.
- Gersho A., Paksoy E. Variable rate speech coding for cellular networks./Speech and Audio Coding for Wireless and Network Application. Kluwer Academic Publishers. 1993, p. 77-84.