首页 > 代码库 > Theoretical comparison between the Gini Index and Information Gain criteria

Theoretical comparison between the Gini Index and Information Gain criteria

Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for a high payoff in many business and scientific applications. One of the main tasks in KDD is classification. A particular efficient method for classification is decision tree induction. The selection of the attribute used at each node of the tree to split the data (split criteria) is crucial in order to correctly classify objects. Different split criteria were proposed in the literature (Information Gain, Gini Index, etc.). It is not obvious which of them will produce the best decision tree for a given data set. A large amount of empirical tests were conducted in order to answer this question. No conclusive results were found. In this paper we introduce a formal methodology, which allows us to compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given data set. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.

Keywords: decision trees, classification, Gini Index, Information Gain, theoretical comparison

  1. 1. Introduction

Early work in the field of decision tree construction focused mainly on the definition and on the realization of classification systems. All of them use different measures of impurity/entropy/goodness to select the split attribute in order to construct the decision tree.

Once a certain number of algorithms were defined, a lot of research was dedicated to compare them. This is a relatively difficult task as the systems evolved from different backgrounds: information theory, discriminant analysis, encoding techniques, etc. These comparisons have been predominantly empirical. Baker and Jain reported experiments comparing eleven feature evaluation criteria and concluded that the feature rankings induced by various rules are very similar. Several feature evaluation criteria are compared using simulated data by Ben-Bassat, on a sequential, multi-class classification problem. The conclusions are that no feature selection rule is consistently superior to the others, and that no specific strategy for alternating different rules is significantly more effective. Mingers compared several attribute selection criteria, and concluded that the tree quality does not seem to depend on the specific criterion used. Babic compared ID3 and CART for two clinical diagnosis problems. Miyakawa compared three activity-based measures, both analytically and empirically. Several researchers pointed out that Information Gain is biased towards attributes with a large number of possible values. Mingers compared Information Gain and 技术分享-statistic for growing the tree as well as for stop splitting. He concluded that 技术分享-corrected Information Gain‘s bias towards multi-valued attributes. Quinlan suggested Gain Ratio as a remedy for the bias of Information Gain. Mantaras argued that Gain Ratio had its own set of problems, and suggested information theory based distance between partitions for tree constructions. White and Liu present experiments to conclude that Information Gain, Gain Ratio and Mantara‘s measure are worse than a 技术分享-based statistical measure, in terms of their bias towards multiple-valued attributes. Gama tried to predict the error rate of a particular classification algorithm and he indicated that no single method can be considered better than the others. About twenty different algorithms were evaluated on more than twenty different data sets. Kononenko pointed out that Minimum Description Length based feature evaluation criteria have the least bias towards multi-valued attribute. In twenty-two decision tree and two neural network algorithms are compared in terms of classification accuracy, training time, and number of leaves. In Gini Index, Information Gain, and the new family of split functions are tested on 技术分享 data sets of different sizes (from 技术分享 to 技术分享 tuples). In , the authors proposed a measure for the distance between the bias of two evaluation metrics and gave numerical approximations of it.技术分享技术分享技术分享技术分享技术分享技术分享

However, a thorough understanding of the behavior of the split functions demands an analytical and direct comparison between them, without using any other external measure. Our contribution in this paper is to introduce a formal methodology, which allows us to analytically compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given dataset. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.

  1. 3. The Gini Index and Information Gain criteria

The objects are classified by decision trees which sort them down from the root to some leaf node, which provides the classification (the class) of each object. An decision tree contains zero or more internal nodes and one or more leaf nodes. The internal nodes have two or more child nodes. Each nonterminal node contains a split which specifics a test based on a single attribute, and each branch descending from that node corresponds to one of the possible values for this attribute. Each leaf node has its class label. A leaf node is said to be pureif all of its training examples are belonging to the same class.

Thus, an example is classified by starting with the root node, testing the attribute corresponding to the root node, then moving down the tree brand corresponding to the value of the attribute in the given example. This process continues until the example reaches a leaf node.

In the binary tree classifiers are constructed by repeatedly splitting subsets of 技术分享 into two descendant subsets, beginning with 技术分享 itself. To split 技术分享 into smaller and smaller subsets we have to select the splits in such a way that the descendant subsets are always “purer” than their parents. Thus was introduced the “goodness of split” criterion, which is derived from the notion of an impurity function.

An impurity function is a function 技术分享 defined on the set of all 技术分享-tuples of numbers 技术分享 satisfying 技术分享 技术分享 and 技术分享 with the following properties:

(a) 技术分享 achieves its maximum at the point 技术分享;

(b) 技术分享 achieves its minimum at the points 技术分享;

(c) 技术分享 is a symmetric function of 技术分享.

Given an impurity function 技术分享, the impurity measure of any node 技术分享 is defined by

技术分享

If a split 技术分享 in a node 技术分享 divides all examples into two subsets 技术分享 and 技术分享 of proportions 技术分享 and 技术分享, the decrease of impurity is defined as

技术分享

The goodness of split 技术分享 in node 技术分享, 技术分享, is defined as 技术分享.

If a test 技术分享 is used in a node 技术分享 and this test is based on an attribute having 技术分享 possible values, the expressions defined before are generalized as follows:

技术分享,

技术分享.

Breiman adopts in his work the Gini diversity Index which has the following form:

技术分享 (1)

In a node 技术分享, an impurity function based on the Gini Index criterion assigns a training example to a class 技术分享 with the probability 技术分享. The estimated probability that the item is actually in class 技术分享 is 技术分享. Therefore, the estimated probability of misclassification under this rule is the Gini Index:

技术分享.

This function can also be interpreted in terms of variance. In a node 技术分享 we assign to all examples belonging to class 技术分享 the value 技术分享, and to all other examples the value 技术分享. The sample variance of these values is 技术分享. There are 技术分享 classes, thus the corresponding variances are summed together:

技术分享.

Having a test 技术分享 with 技术分享 outcomes the goodness of the split is expressed using the Gini Index as follows:

技术分享 (2)

The Gini Index criterion selects a test that maximizes this function.

The Information Gain function has its origin in information theory. It is based on the notion of entropy, which characterizes the impurity of an arbitrary set of examples. If we randomly select an example from a set and we announce that it belongs to the class 技术分享, then the probability of this message is equal to 技术分享, and the amount of information it conveys is 技术分享. The expected information provided by a message with respect to the class membership can be expressed as

技术分享 (3)

The quality 技术分享 measures the average amount of information needed to identify the class of an example in 技术分享. This quantity is also known as the entropy of the set 技术分享 relative to the 技术分享-wise classification. The algorithm is in base 技术分享 because the entropy is a measure of the expected encoding length measured in bits. We will consider a similar measurement after 技术分享 has been partitioned in accordance with the 技术分享 outcomes of a test 技术分享. The expected information requirement is the weight sum over the subsets:

技术分享

The information gained by partitioning 技术分享 in accordance to the test 技术分享 is measured by the quantity 技术分享. We can rewrite the Information Gain as

技术分享 (4)

The information Gain criterion selects a test that maximizes the Information Gain function.

 

Theoretical comparison between the Gini Index and Information Gain criteria