From CSULA CS Wiki
Jump to: navigation, search
m (Entropy of the entire data set)
m
Line 14: Line 14:
 
Since there is an equal chance of selecting a '''T''' or an '''F''', we have no information about the data set. The entropy is at its maximal value of 1.
 
Since there is an equal chance of selecting a '''T''' or an '''F''', we have no information about the data set. The entropy is at its maximal value of 1.
  
H(P) = <math>- \sum_{i=1}^N p_i log\,p_i </math>
 
  
 
==First split==
 
==First split==

Revision as of 22:43, 24 February 2011

This is the data set for the restaurant example in this presentation.

Data for the restaurant example

Contents

Entropy of the entire data set

Given a system P with probability distribution P(s) that the system would be in state si with probability pi, the entropy H(P) is define as follows.

H(P) = -∑ pi log pi 
H(P) = - \sum_{i=1}^N p_i log\,p_i 

For the example data there are 12 items and two possible states: T and F. In this case half are T and half are F. The entropy is computed as follows.

H = - ( 6/12 * log2 (6/12) + 6/12 * log2 (6/12) )
  = - 2 * (1/2 * log2 (1/2))
  = - log2 (1/2)
  = log2 2
  = 1

Since there is an equal chance of selecting a T or an F, we have no information about the data set. The entropy is at its maximal value of 1.


First split

Consider Patrons as the first splitting attribute

If we use Patrons to split the data set, there will be three subsets: None, Some, and Full.

 Patrons:None: {F, F}                The two elements with Patrons:None are both F.
 Patrons:Some: {T, T, T, T}          The four elements with Patrons:Some are all T.
 Patrons:Full: {F, T, F, F, F, T}    The six elements with Patrons:Full are half T and half F.

The entropy of these sets are respectively:

Patrons:None: - (0 * log2 0 + 1 * log2 1) = 0
Patrons:Some: - (1 * log2 1 + 0 * log2 0) = 0
Patrons:Full: - (0.333 * log2 (0.333) + 0.667 * log2 (0.667)) = 0.333 * 1.58 + 0.667 * 0.584 = 0.91

The entropy of the tree when divided according to Patrons is the sum of the entropies weighted by the size of each subtree.

2/12 * 0 + 4/12 * 0 + 6/12 * 0.91 = 0.46

Consider Type as the first splitting attribute

If we use Type to split the data set, there will be four subsets: Burger, French, Italian, and Thai.

Type:Burger: {T, F, F, T}
Type:French: {T, F}
Type:Italian: {T, F}
Type:Thai: {F, T, T, F}

The entropy of these sets are respectively:

Type:Burger: - ( 2/4 * log2 (2/4) + 2/4 * log2 (2/4)) = 1
Type:French: - ( 1/2 * log2 (1/2) + 1/2 * log2 (1/2)) = 1
Type:Italian: - ( 1/2 * log2 (1/2) + 1/2 * log2 (1/2)) = 1
Type:Thai: - ( 2/4 * log2 (2/4) + 2/4 * log2 (2/4)) = 1

In other words, splitting the entire data set on the Type attribute gives us no information at all! Each of the Type subsets itself has entropy 1. So the entropy of the resulting tree is still 1:

2/12 * 1 + 2/12 * 1 + 2/12 * 1 + 4/12 * 1 = 1

Consider Hungry as the first splitting attribute

If we use Hungry to split the data set, there will be two subsets: True and False.

Hungry:True: {T, F, T, T, T, F, T}
Hungry:False: {T, F, F, F, F}

The entropy of these sets are respectively:

Hungry:True: - (5/7 * log2 (5/7) + 2/7 * log2 (2/7)) = 0.714 * 0.486 + 0.286 * 1.81 = 0.865
Hungry:False: - (1/5 * log2 (1/5) + 4/5 * log2 (4/5) = 0.2 * 2.32 + 0.8 * 0.32 = 0.72

The entropy of the tree when divided according to Hungry is the sum of the entropies weighted by the size of each subtree.

7/12 * 0.865 + 5/12 * 0.72 = 0.80

First split conclusion

The other attributes also produce trees with higher entropy than Patrons

So the first split is on Patrons.

Since two of the subsets produced by splitting on Patrons are homogeneous, need split only on the third subset (Patrons:Full).

Second split

Consider Hungry as the splitting attribute after Patrons

Patrons:Full & Hungry:True = {F, T, F, T}
Patrons:Full & Hungry:False = {F, F}

The entropy of these sets are respectively:

Patrons:Full & Hungry:True: 1 (since the cases are evenly split)
Patrons:Full & Hungry:False: 0 (since the cases are all F)

The entropy of the Patrons:Full subtree when divided according to Hungry is the sum of the entropies weighted by the size of each subtree.

4/6 * 1 + 2/6 * 0 = 0.667

Consider Rain as the splitting attribute after Patrons

Patrons:Full & Rain:True = {F}
Patrons:Full & Rain:False = {F, T, F, F, T}

The entropy of these sets are respectively:

Patrons:Full & Rain:True : 0 (since there is only one case)
Patrons:Full & Rain:False : - (2/5 * log2 (2/5) + 3/5 * log2 (3/5)) = 0.4 * 1.32 + 0.6 * 0.737 = 0.97

The entropy of the Patrons:Full subtree when divided according to Rain is the sum of the entropies weighted by the size of each subtree.

1/6 * 0 + 5/6 * 0.97 = 0.808

Consider Wait Estimate as the splitting attribute after Patrons

Patrons:Full & Wait Estimate:0-10 = { }
Patrons:Full & Wait Estimate:10-30 = {T, F}
Patrons:Full & Wait Estimate:30-60 = {F, T}
Patrons:Full & Wait Estimate:>60 = {F, F}

The entropy of these sets are respectively:

Patrons:Full & Wait Estimate:0-10 = 0
Patrons:Full & Wait Estimate:10-30 = 1
Patrons:Full & Wait Estimate:30-60 = 1
Patrons:Full & Wait Estimate:>60 = 0

The entropy of the Patrons:Full subtree when divided according to Wait Estimate is the sum of the entropies weighted by the size of each subtree.

0 * 0 + 2/6 * 1 + 2/6 * 1 + 2/6 * 0 = 0.667

Second split conclusion

It appears that Hungry and Wait Estimate are equally good attributes to use to split the cases where Patrons:Full and that none of the other attributes is any better.

Using Wait Estimate is a better choice since it produces subsets that can be each be split with a single attribute. Each of the two remaining non-empty subsets can be split on Type. Interestingly, though, the types are not consistent. In one case Type:Thai results in T; in the other case it results in F. That's why when Type is used to split the subset produced by Hungry:True, it doesn't produce homogeneous subsets.

Patrons:Full & Wait Estimate:10-30 & Type:Thai: {T}
Patrons:Full & Wait Estimate:10-30 & Type:Italian: {F}
Patrons:Full & Wait Estimate:30-60 & Type:Thai: {F}
Patrons:Full & Wait Estimate:30-60 & Type:Burger: {T}

Hungry produces subsets that can't be split with a single attribute. Only one subset (Patrons:Full & Hungry:True) need be split, but apparently no single attribute splits it into homogeneous sets. See the tree on slide 20. In other words, the combination of Wait Estimate and Type is a better splitter than the combination of Hungry and Type even though Wait Estimate and Hungry produce equally good results on their own from an entropy perspective.