Conditional entropy

11/19/2023

On the reserve side, when the distribution is skewed (i.e. they are uniformly distributed), the entropy is highest, meaning we know less about the data, or we are uncertain about the data. The intuition is that when the probability of entries are the same (i.e. įrom another point of view, we can also call the Entropy as the Uncertainty of data. In this case, if we plug the numbers in, we have. In general, since we can approximate the number of bits to encode x by, the formula for entropy is formally defined as: Thus, the entropy, or the average number of bits to encode one entry of this Embarked column is: In this case, as S appears 644 times, C 168 times, and Q 77 times, their probability are, , and, respectively. Since the probability of each event is different, we cannot just take the mean of their bits. So what is its entropy? Some may compute the entropy as H = (1 + 2 + 2) / 3 = 5/3, but that is not true. The Entropy of a set is the average number of bits we need to encode each entry of that set.įor the Embarked column, as stated above, we encode S by 0, Q by 11, and C by 10. Thus, for convenience, we purposely quantify the amount of information by the number of encoding bits. Īs we have just demonstrated, the lower the probability of an event, the higher its information, and also the more bits we use to encode it. For example, if any value x has a probability of occurrence p(x), then its encoded length is roughly. In general, when we have more different values to be encoded, the best encoding algorithm will map a value to a bit-sequence whose length is approximately the negative of the logarithm of that value’s probability. In this case, the compression has a length of 1134, which is much better.įurthermore, the latter encoding is actually equivalent to a Huffman tree, so it is the most efficient encoding we can have. Another encoding scheme is to map S to 0, Q to 11, and C to 10. The question is: how can we map each of these 3 values to a bit sequence so that when we append all the 889 sequences, we get the shortest uniquely decodable compression?įor example, if we correspond S to 101, C to 0, Q to 11, then the resulting compression will have bits. Here we have 889 instances, among those 644 are S, 168 are C, while the remaining 77 are Q. Information EncodingĮncoding, or Compression, refers to the work that inputs some data and outputs a smaller-size sequence of bits that can represent that data. The value of is also the expected number of bits to efficiently encode x. The answer is, the negative correlation between probability and information is just one of the qualifications, it is necessary but not enough. Why don’t we use a simpler formula, such as: While this formula does ensure that rarer values mean higher information content, it is not so intuitive how the logarithm is employed here. So the amount of information is defined to be the negative logarithm of its probability. Figure 1: How the Information Content changes with event probability.

0 Comments

Conditional entropy

Leave a Reply.

Author

Archives

Categories