MPEG-4 vs H.264.

MPEG-4 vs H.264

Of course, it is incorrect to contrast MPEG-4 and H.264, since H.264 and MPEG-4 (part 10) are the same thing. However, it turned out to be more convenient to write the main part of the article, understanding MPEG-4 only as MPEG-4 (part 2).

MPEG-4 structure and a bit of history
Video coding algorithms play an important role in the modern world. They are used for digital presentation, compression, storage, transmission and processing of video information in a variety of systems. Most of these algorithms recently are associated with the activities of two organizations: MPEG (Motion Picture Experts Group), working under the auspices of the International Organization for Standardization (ISO), and VCEG (Video Coding Experts Group), working as part of the International Telecommunication Union (ITU). The first group releases MPEGxx standards (-1, -2, -4, -7, -21), the second creates ITU recommendations Hxx (.261, .263, .263+, .263++, .264). This article will discuss the latest developments of these groups in the field of video coding — the MPEG-4 standard (part 2) and the H.264 recommendation. The latter recommendation is both the MPEG-4 standard (Part 10) and ISO/IEC 14496-10. This unification of the two standards was made possible by the joint work of the MPEG and VCEG groups within the Joint Video Team project.
Table 1 shows the background to the creation of these standards.
Table 1 – ITU Recommendations and MPEG Standards

It should be noted that the core of the MPEG-4 standard (Part 2) was based on the H.263 recommendation.
So, the latest standard in the field of video coding can be called MPEG-4 (part 10), or ISO/IEC 14496-10, or H.264/AVC. The abbreviation AVC here means Advanced Video Coding.
Here are the contents of the remaining parts of the MPEG-4 standard:
Part 1 — System description: scenes, combining audio, video and service information, synchronization, buffer management, intellectual property rights management.
Part 3 — Audio coding.
Part 4 — Conformance testing: conditions, procedures, bit streams.
Part 5 — Publicly available software that implements the requirements of the standard as a reference.
Part 6 — Multimedia Distribution Protocols.
Part 7 — Optimized Video Coding Software (technical report, not a standard).
Part 8 — Specifies mechanisms for transmitting MPEG-4 streams over IP networks.
Part 9 — Description of an MPEG-4 implementation in VHDL (technical report, not a standard).
Part 11 — Scene Description Mechanism.
Part 12 — ISO Multimedia File Format.
Part 13 – Supplements Concerning Intellectual Property Rights Management.
Part 14 – MPEG-4 File Format (Part 2).
Part 15 – MPEG-4 File Format (Part 10).
Part 16 – Supplements Concerning Animation Coding.
In the following discussion, in order to reduce confusion, we will refer to the MPEG-4 standard (Part 2) as the MPEG-4 standard, and the MPEG-4 standard (Part 10) as the H.264 standard.

How Pragmatism Conquered Romanticism
Looking at Table 1, we will see that the popular MPEG-2 video coding standard (for example, it is used in DVD) was developed back in 1996. Why was it necessary to develop the MPEG-4 standard? Oh, there were great reasons for this…
First of all, why limit ourselves to encoding some rectangular images? Give us an arbitrary-shaped image! What if we need to encode not only natural images, but also synthetic ones, as well as hybrids of the two? After all, virtual reality is just around the corner! And how can we not take care of astronomers and other consumers for whom 8 bits per color component are nothing? And in general, what kind of backward idea is this — to treat video as a sequence of rectangular static images? Give us an object-based approach! Let's encode interacting objects, three-dimensional surfaces. The super-duper apparatus (wavelets) is not suitable for encoding video? Well, let's use them for encoding still images. So what if MPEG-4 is for video: wavelets, after all.
Probably, this is approximately how the enthusiasts — creators of MPEG-4 thought and developed a truly revolutionary standard. True, the technical details of decoding increased from 17 pages in H.261 to 539 pages in MPEG-4, although the presentation here is far from being as detailed. But the principles of video coding have not changed over many years, they are only being clarified and clarified. True, there were as many as 19 developer profiles (in fact, 19 decoding algorithms need to be developed).
But the main relative failure of the standard was that its creators did not take into account the needs of the market. There are not many applications that require coding of objects of arbitrary shape, high color depth, non-standard discretization and other exotic things. But users like to make digital video clips, send them over the network, use digital television services and video-on-demand services. Of course, MPEG-4 was more effective than its predecessor for these applications, but then there were also problems with the licensing purity of solutions.
In general, two years after the adoption of MPEG-4, MPEG-ers united with VCEG and created a new standard – H.264. The cornerstone of this standard is the licensing purity of solutions and maximum efficiency due to the rejection of all the above-mentioned exotics.

Main characteristics of H.264
The expected areas of application of the H.264 standard are as follows:
broadcasting (cable, cable modem, satellite, DSL, TV);
storage on various media (DVD, magnetic disks);
videoconferencing (ISDN, Ethernet, LAN, DSL, radio networks, mobile networks, modems);
services such as video on demand;
MMS services (DSL, ISDN).
The efficiency of the algorithm in the H.264 standard is understood as a high degree of video compression with acceptable quality and robustness of the bit stream to errors/losses of transmission. The asceticism of H.264, in contrast to MPEG-4, is manifested in the fact that it provides only three profiles:
Baseline — for videoconferencing;
Extended – for streaming video over the network;
Main – for storing and broadcasting video.
It should be noted that the Extended profile completely covers the Baseline profile, while the Main profile is somewhat to the side.
The H.264 standard implements the following key new technical solutions:
1) To improve prediction:
motion compensation based on small blocks of adaptively adjustable size;
motion compensation accuracy up to ¼ sample;
motion compensation based on one or more reference frames;
independence of the frame display order from the reference frame order;
the ability to use any frame as a reference;
prediction using weighting factors;
direct spatial prediction based on intra-frame coding;
cyclic filtering to eliminate the blocking effect.
2) Other solutions that improve coding efficiency:
small block transformation (4 x 4);
hierarchical block transformation;
integer fast transformation algorithms; arithmetic coding;
context-adaptive entropy coding.
3) To improve noise immunity and flexibility of transmission over various media:
new structure of a set of parameters;
NAL syntactic structure, allowing to abstract network service data from coding service data;
flexible configurable slice size;
arbitrary slice order;
introducing repeating slices into the stream;
data ordering;
switching streams based on the use of SI/SP synchronization.
Like previous video coding standards, H.264 defines three things:
1) the syntax of the bitstream representing the video;
2) the semantics of this stream;
2) the decoding method for reconstructing the video.
That is, the standard defines only the output sequences, but not the principles of constructing a video signal encoder. This allows manufacturers to compete in creating the best encoder.
The video encoding scheme in accordance with the H.264 standard generally follows the encoding schemes of previous standards. The only difference is the presence of block removal at the last stage of processing. The encoding algorithm (it is not explicitly described in the standard) consists of four main components:
motion compensation and subtraction of the current frame from the reference frame;
discrete cosine transform (DCT) of the difference frame;
quantization of the transform coefficients;
entropy coding of the quantized coefficients.
Let's consider these components in more detail.

Motion compensation
The high efficiency of H.264 is due to improvements in each of the components. The energy of the difference frame depends on the effective motion compensation. The more accurately this motion is compensated, the less energy, and therefore the compression ratio will be higher. It would be possible to calculate motion vectors for each pixel, but this is difficult, so they are calculated for rectangular blocks. The advantage here is that the image is also rectangular, and a transformation can be applied later, for example, DCT. The disadvantages are obvious: the boundaries of objects usually do not lie on a rectangle, and the motion is also usually not horizontal or vertical. Nevertheless, at present this is the only method.
As the block size increases, the computational efficiency increases and the amount of bits allocated for coding motion vectors decreases. However, at the same time, the accuracy of compensation deteriorates, and, consequently, the energy of the difference image increases. Thus, there is a need for optimization, and H.264 implements an adaptive selection of the block size from 4 x 4 to 16 x 16 pixels, and the accuracy of specifying the vector is brought to ¼ pixel (due to preliminary interpolation). If the frames are completely different from each other, then motion compensation is not used, but intra-frame coding is applied.

Transformation of the difference frame and quantization
As is known, various orthogonal transformations can be used to transform an image into the spectral domain. The purpose of the transformation is to redistribute the image energy: most of it is concentrated in a small number of coefficients. The most effective transformation in this sense among the fast ones is considered to be the wavelet transformation. It is used in MPEG-4 for encoding still images. However, the wavelet transformation requires more memory (it is necessary to remember the entire frame) and does not fit well with block motion compensation, therefore it is not used for video encoding.
MPEG-4, like MPEG-2 (and also JPEG), uses a DCT with a basic block size of 8 x 8. H.264 uses an integer orthogonal transform over 4 x 4 blocks, which approximates the DCT. As a result, the transform kernel uses only addition, subtraction, and shifts. Subsequent scaling requires one multiplication by a coefficient for each pixel, but this operation can be attributed to further quantization. All arithmetic is 16-bit, i.e., it can be performed on a cheap microcontroller.
The purpose of quantization is to reduce many coefficient values to a small number of distinct values. This is usually achieved by division and rounding of the result. However, the quantization coefficients in H.264 are chosen to avoid computationally expensive divisions (multiplication and accumulation and right shift are performed instead).
After quantization, the coefficients are reordered. In MPEG-4, this is either a zig-zag scan for 8 x 8 blocks or a null-tree structure for the wavelet coefficients. In H.264, a zig-zag scan is performed for 4 x 4 blocks.

Entropy coding and bit stream formation
The purpose of entropy coding is to designate a more frequently occurring sequence of symbols (bits) with a shorter code. In MPEG-4, this is achieved by first coding the run-length encoding (RLE), and then using variable-rate encoding based on pre-computed Huffman tables. It should be noted that the Huffman coder is sensitive to errors in the transmission channel. In addition, it is necessary for the codec/decoder to have identical tables.
In H.264, depending on the profile, either exponential Golomb codes or a context-sensitive arithmetic coder are used. The advantage of the arithmetic coder is a higher compression ratio, since a symbol can be encoded with a fractional number of bits, while the Huffman coder can only encode with an integer number of bits.
MPEG-2 can be used as a transport, part 1 of which defines the procedure for combining video, audio and service data into a single stream. Another solution is to use the Real-Time Protocol (RTP). The NAL structure of the H.264 stream is ideal for packet transmission in accordance with this protocol. Another option is to use the MPEG-4 part 6 standard.

What is better?
Since the introduction of H.264, numerous comparisons have been made between this standard and MPEG-4. The results have typically shown a 1–3 dB gain for H.264 across a wide range of encoding rates. Visually, H.264 video also looks better (largely due to the use of a deblocking filter). Here is a typical result:

To be fair, there is no big difference for highly textured images. You can see a comparison of different H.264 codecs on the website [1]. In many tests, the difference in encoding efficiency between individual H.264 video codecs reaches two or more times.
So, whether it is MPEG-4 or H.264 – the main efficiency of codecs is based on the nuances of implementation. And there are so many of these subtleties that you will not get them even from a 700-page book [2], which I still recommend you read.

Literature http://compression.ru/video/codec_comparison/L.Hanzo, P.Cherriman, J.Streit. Video compression and communications. Second Edition. Wiley, 2007. 704 p.