H.264 Video Encoding

H.264 Video Encoding

Imaged and Color Coding

A single one hour video at 720p resolution with 30fps would require 278GB. Since using solely lossless data compression algorithms like DEFLATE (used in PKZIP, Gzip, and PNG), won’t decrease the required bandwidth sufficiently we need to find other ways to compress video ( a sequence of images).

During the lecture we will highlight the main points of the H264 encoder. There are many encoder standards and open source implementations (e.g VP8/9) - if you understand H264 you will be able to understand the bulk of the rest.

video-coding-concept Video Coding - Concept

Note that the color coding used in the MPEG standards is YCbCr and is related to RGB via

rgb-to-ycbcr

Slices and Macroblocks

A picture is divided to slices that are decoded independent of each other.

coding-unit Cr and Cb components are stored in the YCbCr color encoding scheme with half the resolution of the Y component. The human eye is far more sensitive to Y (luminance) than to the other two chromas.

Each slice consists of fixed size macroblocks (coding units) that each cover a rectangular picture area of 16 × 16 samples of the luma component and, in the case of video in 4:2:0 chroma sampling format, 8 × 8 samples of each of the two chroma components.

macroblocks Macroblock types

All luma and chroma samples of a macroblock are either spatially (intra-frame) or temporally (inter-frame) predicted, and the resulting prediction residual is represented using transform coding. This is shown next.

I-P-B-frames

I, P and B slices/frames. In H264 the lowest level of granularity is the slice, in other standards you may come across them as frames.

  • An I‑frame (Intra-coded picture) is a complete image, like a JPG or BMP image file.
  • A P‑frame (Predicted picture) holds only the changes in the image from the previous frame. For example, in a scene where a car moves across a stationary background, only the car’s movements need to be encoded. The encoder does not need to store the unchanging background pixels in the P‑frame, thus saving space. P‑frames are also known as delta‑frames.
  • A B‑frame (Bidirectional predicted picture) saves even more space by using differences between the current frame and both the preceding and following frames to specify its content.
While the terms “frame” and “picture” are often used interchangeably, the term picture is a more general notion, as a picture can be either a frame or a field. A frame is a complete image, and a field is the set of odd-numbered or even-numbered scan lines composing a partial image. For example, an HD 1080 picture has 1080 lines (rows) of pixels. An odd field consists of pixel information for lines 1, 3, 5…1079. An even field has pixel information for lines 2, 4, 6…1080. When video is sent in interlaced-scan format, each frame is sent in two fields, the field of odd-numbered lines followed by the field of even-numbered lines. P and B frames are also called Inter frames. The order in which the I, P and B frames are arranged is called the Group of pictures.

Each color component of the prediction residual is subdivided into blocks. For all remaining pictures of a sequence typically Inter coding is utilized.

interframe-prediction Inter-frame prediction. The encoder will try to find a block similar to the one it is encoding on a previously encoded frame, referred to as a reference frame. This process is done by a block matching algorithm. If the encoder succeeds on its search, the block could be encoded by a vector, known as motion vector, which points to the position of the matching block at the reference frame. The process of motion vector determination is called motion estimation.

residual

Transform Coding

Each residual block is transformed using an integer transform. The principle is simple: represent an image as the linear combination of some basis images and specify the linear coefficients ($t_i$). This is shown below:

transform-coding-principle Transform coding principle

The Discrete Cosine Transform is used extensively in many video encoders and can be considered a real version of the well known DFT.

dct-vs-dft DCT vs DFT

The basis vectors of eg the 1D DCT contain only cosine patterns

8-dct 8-point 1D-DCT basis functions

4-dct-bases Calculation of the bases vector for 4 point 1D-DCT

Quantization

The transform coefficients are quantized and entropy coded.

transform-quantize Transformation and Quantization

The DCT transform steps minus the entropy coding are demonstrated in the corresponding video encoding workshop.

Note that the actual DCT transform used in H264 is a scaled approximate version of the one described here.

Entropy Coding

After we quantized the data (image blocks/slices/frames) we still can compress it in a lossless way. There are many ways (algorithms) to compress data. We’re going to briefly experience some of them, for a deeper understanding you can read the amazing book Understanding Compression: Data Compression for Modern Developers.

VLC coding:

Let’s suppose we have a stream of the symbols: a, e, r and t and their probability (from 0 to 1) is represented by this table.

a e r t
probability 0.3 0.3 0.2 0.2

We can assign unique binary codes (preferable small) to the most probable and bigger codes to the least probable ones.

a e r t
probability 0.3 0.3 0.2 0.2
binary code 0 10 110 1110

Let’s compress the stream eat, assuming we would spend 8 bits for each symbol, we would spend 24 bits without any compression. But in case we replace each symbol for its code we can save space.

The first step is to encode the symbol e which is 10 and the second symbol is a which is added (not in a mathematical way) [10][0] and finally the third symbol t which makes our final compressed bitstream to be [10][0][1110] or 1001110 which only requires 7 bits (3.4 times less space than the original).

Notice that each code must be a unique prefixed code Huffman can help you to find these numbers. Though it has some issues there are video codecs that still offers this method and it’s the algorithm for many applications which requires compression.

Both encoder and decoder must know the symbol table with its code, therefore, you need to send the table too.

Arithmetic coding:

Let’s suppose we have a stream of the symbols: a, e, r, s and t and their probability is represented by this table.

a e r s t
probability 0.3 0.3 0.15 0.05 0.2

With this table in mind, we can build ranges containing all the possible symbols sorted by the most frequents.

initial arithmetic range

Now let’s encode the stream eat, we pick the first symbol e which is located within the subrange 0.3 to 0.6 (but not included) and we take this subrange and split it again using the same proportions used before but within this new range.

second sub range

Let’s continue to encode our stream eat, now we take the second symbol a which is within the new subrange 0.3 to 0.39 and then we take our last symbol t and we do the same process again and we get the last subrange 0.354 to 0.372.

final arithmetic range

We just need to pick a number within the last subrange 0.354 to 0.372, let’s choose 0.36 but we could choose any number within this subrange. With only this number we’ll be able to recover our original stream eat. If you think about it, it’s like if we were drawing a line within ranges of ranges to encode our stream.

final range traverse

The reverse process (A.K.A. decoding) is equally easy, with our number 0.36 and our original range we can run the same process but now using this number to reveal the stream encoded behind this number.

With the first range, we notice that our number fits at the slice, therefore, it’s our first symbol, now we split this subrange again, doing the same process as before, and we’ll notice that 0.36 fits the symbol a and after we repeat the process we came to the last symbol t (forming our original encoded stream eat).

Both encoder and decoder must know the symbol probability table, therefore you need to send the table.

Pretty neat, isn’t it? People are damn smart to come up with a such solution, some video codecs use this technique (or at least offer it as an option).

The idea is to lossless compress the quantized bitstream, for sure this article is missing tons of details, reasons, trade-offs and etc. But you should learn more as a developer. Newer codecs are trying to use different entropy coding algorithms like ANS.

H264 Bitstream

In this step we pack the compressed frames and context. We need to explicitly inform to the decoder about the decisions taken by the encoder, such as bit depth, color space, resolution, predictions info (motion vectors, intra prediction direction), profile, level, frame rate, frame type, frame number and much more. We can see the bitstream as a protocol and if you want or need to learn more about this bitstream please refer to the ITU H.264 spec. Here’s a macro diagram which shows where the picture data (compressed YUV) resides.

h264 bitstream macro diagram

Profiles and Levels

The standard defines several sets of capabilities, which are referred to as profiles, targeting specific classes of applications. These are declared using a profile code (profile_idc) and sometimes a set of additional constraints applied in the encoder. The profile code and indicated constraints allow a decoder to recognize the requirements for decoding that specific bitstream. (And in many system environments, only one or two profiles are allowed to be used, so decoders in those environments do not need to be concerned with recognizing the less commonly used profiles.) By far the most commonly used profile is the High Profile.

H.264 levels specify the maximum data rate and video resolution that a device can play back. The levels available are listed in here