Rate Distrortion Theory
Material in this section have been taken from this key paper and introductory material from this
Mutual Information
Consider the communication system shown below
Communication System as an analogy for Human Perception. A perfect communication channel is simply a channel where the output always equals the input
To give a concrete illustration of how the human perceptual system can be viewed as a communication channel as in the figure above consider the task of visually judging (perceiving) the size of an object sitting on a table. The true size can be labeled x, and characterized by a probability distribution pðxÞ. The different possible values for x define the alphabet for the channel (if x is continuous, then the source alphabet is also infinite). The distribution $p(x)$ might reflect, for example, the fact that it would be unlikely to encounter extremely large objects sitting on a typical-sized table. Due to intrinsic noise in neural coding, it is physically impossible for the brain to represent the size exactly. Hence, the encoded representation is necessarily noisy and error-prone. Consequently, in decoding this neural representation to arrive at a ‘‘best guess”, this guess y (the output of the channel) would also be error-prone. Hence, the channel itself can be described by a conditional probability distribution $p(y|x)$. Note that in the general case the alphabet for the channel input and output need not be the same. For example, the channel input might consist of a continuous signal, and the channel output a discrete category label or a binary response.
How do we measure the amount of information conveyed by such a channel? Intuitively, conveyed information can be no greater than the information contained in the source. But if the channel is noisy or error prone, the conveyed information of the output signal may be less than the entropy of the source. Thus, both the information source and the properties of the channel are relevant for measuring the communication of information. Mathematically, these two aspects are related by the mutual information of the channel input and output
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the “amount of information” (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable.
$$I(x,y) = H(x) - H(x|y)$$
Mutual Information
Mutual information measures the reduction in entropy regarding the channel input, after observing the output of the channel. Mutual information depends on the properties of the information source, $p(x)$as well as the behavior of the channel $p(y|x)$. If one were to take a fixed channel defined by $p(y|x)$, and use it to communicate signals from different source distributions, one would find that the mutual information would be higher for some information sources, and lower for others. The capacity of a channel is given by the maximum of the mutual information, over the space of all possible information sources $p(x)$.
The capacity of the channel is usually though lower than the entropy of the information source, the output alphabet of the channel might have fewer symbols than the input alphabet, or the channel might have known mechanistic limitations that limit its fidelity. In any of these cases, the goal of error-free communication must simply be abandoned
In its place, the goal in such a setting must be to minimize the costs of communication error. To this end, it is necessary to define a cost function, $L(x,y)$. This function specifies the cost associated with the event of an input signal x being conveyed as the value y. An intuitive cost function might be the squared error between the channel input and output: $L = (x-y)^2$.
Channels where the output closely resembles the input will have low cost according to this measure. The distortion, or average cost associated with a particular information source and channel is given by
$$D = E [L(x,y)] $$
where the expectation is taken over the alphabets of $x$ abd $y$.
When a channel has capacity greater than the information rate of the source (the average number of bits communicated per transmission), it is possible to achieve an arbitrarily small error rate. When the channel capacity is lower than the information rate, the occurrence of error is unavoidable. In this case, the goal is not to avoid error, but rather minimize the expected cost of error.
$$ \argmin D $$ $$ s.t I(x,y) < C $$
Due to monotonic relationship between distortion and rate, the above optimization is equivalent to
$$ \argmin I(x,y) $$ $$ s.t D < D_{th} $$
Sometimes the later formulation is used as we will see in HTTP adaptive streaming section.
Controlling Quality and Bit Rate
NOTE: The following are partially taken from here
Encoding with the best recipe is not a simple problem. For example, assuming a 1 Mbps bandwidth, should we stream H.264/AVC at 480p, 720p or 1080p? With 480p, 1 Mbps will likely not exhibit encoding artifacts such blocking or ringing, but if the member is watching on an HD device, the upsampled video will not be sharp. On the other hand, if we encode at 1080p we send a higher resolution video, but the bitrate may be too low such that most scenes will contain annoying encoding artifacts.
There are multiple ways to control coding rate om H264 encoders. Netflix decides a bitrate ladder on a per title (and even per chunk) basis and the method is extensively treated in here.
The method involved three steps:
-
Each title in the VOD catalogue is encoded in various resolutions from a finite set — 1920×1080, 1280×720, 720×480, 512×384, 384×288 and 320×240. The encoder bit rates are varied by changing the Constant Rate Factor value. The CRF can be set between 0 and 51, where lower values would result in better quality, at the expense of higher file sizes. Higher values mean more compression, but at some point you will notice the quality degradation.
-
The resultant RD curves are used to obtain the convex hull.
Example of RD curves and the convex hull (red line)
Principle of forming a Pareto efficient encoding
We can see that each resolution has a bitrate region in which it outperforms other resolutions. If we collect all these regions from all the resolutions available, they collectively form a boundary called convex hull. In an economic sense, the convex hull is where the encoding point achieves Pareto efficiency. Ideally, we want to operate exactly at the convex hull, but due to practical constraints (for example, we can only select from a finite number of resolutions), we would like to select bitrate-resolution pairs that are as close to the convex hull as possible.
NOTE: Netflix has a discrepancy between the online and paper versions of the technique. Online talks about varying bit rate based on Quantization Parameter (QP) while the paper insists on varying quality/bit rate based on CRF. It is unknown what the Netflix implementation uses at the moment especially now that they went into a per-shot sceme. Within a homogeneous set of frames, such as those that belong to the same shot, there is much less need to use rate-control, since very simple coding schemes, such as the fixed-quantization parameter (“fixed QP”) mode, supported by virtually all existing video encoders, offers a very consistent video quality, with almost minimal bitrate variation.
In the literature H264 PSNR Performance
What the future holds
Watch the video compression space closely. The Machine Learning community is trying to develop non-linear transformations for video encoding to replace the DCT and achieve also content-aware encoding and by extension offer much higher quality for the same bit rate.
Encoder Decoder for typical image signal $\mathbf x$
Johannes Balle - Learned Image Compression
https://web.stanford.edu/class/ee368b/Handouts/04-RateDistortionTheory.pdf