Theoretical comparison between the Gini Index and Information Gain criteria

首页 > 代码库 > Theoretical comparison between the Gini Index and Information Gain criteria

Theoretical comparison between the Gini Index and Information Gain criteria

2024-08-15 16:08:52 223人阅读

Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for a high payoff in many business and scientific applications. One of the main tasks in KDD is classification. A particular efficient method for classification is decision tree induction. The selection of the attribute used at each node of the tree to split the data (split criteria) is crucial in order to correctly classify objects. Different split criteria were proposed in the literature (Information Gain, Gini Index, etc.). It is not obvious which of them will produce the best decision tree for a given data set. A large amount of empirical tests were conducted in order to answer this question. No conclusive results were found. In this paper we introduce a formal methodology, which allows us to compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given data set. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.

Keywords: decision trees, classification, Gini Index, Information Gain, theoretical comparison

1. Introduction

Early work in the field of decision tree construction focused mainly on the definition and on the realization of classification systems. All of them use different measures of impurity/entropy/goodness to select the split attribute in order to construct the decision tree.

Once a certain number of algorithms were defined, a lot of research was dedicated to compare them. This is a relatively difficult task as the systems evolved from different backgrounds: information theory, discriminant analysis, encoding techniques, etc. These comparisons have been predominantly empirical. Baker and Jain reported experiments comparing eleven feature evaluation criteria and concluded that the feature rankings induced by various rules are very similar. Several feature evaluation criteria are compared using simulated data by Ben-Bassat, on a sequential, multi-class classification problem. The conclusions are that no feature selection rule is consistently superior to the others, and that no specific strategy for alternating different rules is significantly more effective. Mingers compared several attribute selection criteria, and concluded that the tree quality does not seem to depend on the specific criterion used. Babic compared ID3 and CART for two clinical diagnosis problems. Miyakawa compared three activity-based measures, both analytically and empirically. Several researchers pointed out that Information Gain is biased towards attributes with a large number of possible values. Mingers compared Information Gain and 技术分享 -statistic for growing the tree as well as for stop splitting. He concluded that -corrected Information Gain‘s bias towards multi-valued attributes. Quinlan suggested Gain Ratio as a remedy for the bias of Information Gain. Mantaras argued that Gain Ratio had its own set of problems, and suggested information theory based distance between partitions for tree constructions. White and Liu present experiments to conclude that Information Gain, Gain Ratio and Mantara‘s measure are worse than a 技术分享 -based statistical measure, in terms of their bias towards multiple-valued attributes. Gama tried to predict the error rate of a particular classification algorithm and he indicated that no single method can be considered better than the others. About twenty different algorithms were evaluated on more than twenty different data sets. Kononenko pointed out that Minimum Description Length based feature evaluation criteria have the least bias towards multi-valued attribute. In twenty-two decision tree and two neural network algorithms are compared in terms of classification accuracy, training time, and number of leaves. In Gini Index, Information Gain, and the new family of split functions are tested on 技术分享 data sets of different sizes (from to tuples). In , the authors proposed a measure for the distance between the bias of two evaluation metrics and gave numerical approximations of it.

However, a thorough understanding of the behavior of the split functions demands an analytical and direct comparison between them, without using any other external measure. Our contribution in this paper is to introduce a formal methodology, which allows us to analytically compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given dataset. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.

3. The Gini Index and Information Gain criteria

The objects are classified by decision trees which sort them down from the root to some leaf node, which provides the classification (the class) of each object. An decision tree contains zero or more internal nodes and one or more leaf nodes. The internal nodes have two or more child nodes. Each nonterminal node contains a split which specifics a test based on a single attribute, and each branch descending from that node corresponds to one of the possible values for this attribute. Each leaf node has its class label. A leaf node is said to be pureif all of its training examples are belonging to the same class.

Thus, an example is classified by starting with the root node, testing the attribute corresponding to the root node, then moving down the tree brand corresponding to the value of the attribute in the given example. This process continues until the example reaches a leaf node.

In the binary tree classifiers are constructed by repeatedly splitting subsets of 技术分享 into two descendant subsets, beginning with itself. To split into smaller and smaller subsets we have to select the splits in such a way that the descendant subsets are always “purer” than their parents. Thus was introduced the “goodness of split” criterion, which is derived from the notion of an impurity function.

An impurity function is a function 技术分享 defined on the set of all -tuples of numbers satisfying and with the following properties:

(a) 技术分享 achieves its maximum at the point ;

(b) 技术分享 achieves its minimum at the points ;

Given an impurity function 技术分享 , the impurity measure of any node is defined by

技术分享

If a split 技术分享 in a node divides all examples into two subsets and of proportions and , the decrease of impurity is defined as

技术分享

The goodness of split 技术分享 in node , , is defined as .

If a test 技术分享 is used in a node and this test is based on an attribute having possible values, the expressions defined before are generalized as follows:

技术分享 ,

技术分享 .

Breiman adopts in his work the Gini diversity Index which has the following form:

技术分享 (1)

In a node 技术分享 , an impurity function based on the Gini Index criterion assigns a training example to a class with the probability . The estimated probability that the item is actually in class is . Therefore, the estimated probability of misclassification under this rule is the Gini Index:

技术分享 .

This function can also be interpreted in terms of variance. In a node 技术分享 we assign to all examples belonging to class the value , and to all other examples the value . The sample variance of these values is . There are classes, thus the corresponding variances are summed together:

技术分享 .

Having a test 技术分享 with outcomes the goodness of the split is expressed using the Gini Index as follows:

技术分享 (2)

The Gini Index criterion selects a test that maximizes this function.

The Information Gain function has its origin in information theory. It is based on the notion of entropy, which characterizes the impurity of an arbitrary set of examples. If we randomly select an example from a set and we announce that it belongs to the class 技术分享 , then the probability of this message is equal to , and the amount of information it conveys is . The expected information provided by a message with respect to the class membership can be expressed as

技术分享 (3)

The quality 技术分享 measures the average amount of information needed to identify the class of an example in . This quantity is also known as the entropy of the set relative to the -wise classification. The algorithm is in base because the entropy is a measure of the expected encoding length measured in bits. We will consider a similar measurement after 技术分享 has been partitioned in accordance with the outcomes of a test . The expected information requirement is the weight sum over the subsets:

技术分享

The information gained by partitioning 技术分享 in accordance to the test is measured by the quantity . We can rewrite the Information Gain as

技术分享 (4)

The information Gain criterion selects a test that maximizes the Information Gain function.

Theoretical comparison between the Gini Index and Information Gain criteria

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Theoretical comparison between the Gini Index and Information Gain criteria

Theoretical comparison between the Gini Index and Information Gain criteria

1. Introduction

3. The Gini Index and Information Gain criteria

看完仍有疑问？有类似问题直接问程序猿