Kullback-Leibler Divergence

What’s it about?

Wikipedia describes Kullback-Leibler Divergence KLD:

In probability theory and information theory, the Kullback–Leibler divergence (also information divergence, information gain, relative entropy, KLIC, or KL divergence) is a measure of the difference between two probability distributions $P$ and $Q$. It is not symmetric in $P$ and $Q$. In applications, $P$ typically represents the “true” distribution of data, observations, or a precisely calculated theoretical distribution, while $Q$ typically represents a theory, model, description, or approximation of $P$.

The discrete KLD is defined as

it is the average number of bits that are wasted by encoding events from distribution $P$ based on a approximate distribution $Q$.

So what does that mean?

This is best done with an example. Say we have two distributions $f$ and $g$ both generating the outcomes of repeated coin tosses.

Distribution $f$ produces as many heads as tails. It describes the outcomes for a fair coin.

Distribution $g$ is more likely to produce heads than tails, with a frequency 3 heads out of 4 tosses. It describes a biased coin.

If we were to encode distribution $f$ in terms of $g$, the divergence is calculated by:

If we were to encode distribution $g$ in terms of $f$, the divergence is calculated by:

To compute the numerical values we can use a small Python script (warning the code below is not robust):

import math

f = {'H': 0.5, 'T' : 0.5}
g = {'H' : 0.75, 'T' : 0.25}

def kld(f, g):
    kld = 0
    for i in f.keys():
        kld += f[i] * math.log(f[i] / g[i])
    return kld / math.log(2)

yielding results

kld(f, g)

kld(g, f)

we can see that the KLD is not symmetric and that encoding distribution $f$ in terms of $g$ will require discarding more information. This intuitively makes sense as $f$ is less predictable than $g$.