Entropy
We will cover the value of information, the entropy of a probability distribution and the cross-entropy between two probability distributions.
In this section we take a closer look into the algorithm block of the learning problem. This block implements the underlying optimization problem that produces the weights in regression and classification settings. Any optimization problem requires an objective function and as it turns out in the supervised setting (and not only) there is a whole theory, called information theory pioneered by Claude Shannon at Bell Labs, that guides us in our search to find the best
Value of information:
Low Manhattan-bound traffic flow in Holland tunnel on Monday morning
An outcome
This can be intuitively understood when you compare two outcomes. For example, consider someone is producing the result of the vehicular traffic flow outside of Holland tunnel on Monday morning. The information that the results is “low” carries much more information when the result is “high” since most people expect that there will be horrendous traffic going into Manhattan on Monday mornings. When we want to represent the amount of uncertainty over a distribution (i.e. the traffic in Holland tunnel over all times) we can take the expectation over all possible outcomes i.e.
and we call this quantity the entropy of the probability distribution
This is 4.1 bits (as the
The plot simply was produced by taking the definition of entropy and applying to the binary case,
As you can see the maximum entropy is when the outcome is most unpredictable i.e. when a 1 can show up with uniform probability (in this case equal probability to a 0). Entropy has widespread implications as it measures our uncertainty in terms of the length of the message/code that we need to communicate the outcome across. Taking to the limit if we were certain about an event (deterministic outcome) we would need 0 bits or require to send any message at all.