首页 > 代码库 > Large Margin DAGs for Multiclass Classification

Large Margin DAGs for Multiclass Classification

Abstract

We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many two-class classifiers into a multiclass classifiers. For an 技术分享-技术分享class problem, the DDAG contains 技术分享 classifiers, one for each pair of classes. We present a VC analysis of the case when the node classifiers are hyperplanes; the resulting bound on the test error depends on 技术分享 and on the margin achieved at the nodes, but not on the dimension of the space. This motivates an algorithm, DAGSVM, which operates in a kernel-induced feature space and uses two-class maximal margin hyperplanes at each decision-node of the DDAG. The DAGSVM is substantially faster to train and evaluate than either the standard algorithm or Max Wins, while maintaining comparable accuracy to both of these algorithms.技术分享技术分享

  1. 1 Introduction

The problem of multiclass classification, especially for systems like SVMs, doesn‘t present an easy solution. It is generally simpler to construct classifier theory and algorithms for two mutually-exclusive classes than for 技术分享 mutually-exclusive classes. We believe constructing 技术分享-class SVMs is still an unsolved research problem.技术分享技术分享

The standard method for 技术分享-class SVMs is to construct SVMs. The 技术分享th SVM will be trained with all of the examples in the 技术分享th class with positive labels, and all other examples with negative labels. We refer to SVMs trained in this way as 技术分享 SVMs (short for one-versus-rest). The final output of the 技术分享 技术分享 SVMs is the class that corresponds to the SVM with the highest output value. Unfortunately, there is no bound on the generalization error for the 技术分享 SVM, and the training time of the standard method scales linearly with 技术分享.技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享

Another method for constructing 技术分享-class classifiers from SVMs is derived from previous research into combining two-class classifiers. Knerr suggested constructing all possible two-class classifiers from a training set of 技术分享 classes, each classifier being trained on only two out of 技术分享 classes. There would thus be 技术分享 classifiers. When applied to SVMs, we refer to this as 技术分享 SVMs (short for one-versus-one).技术分享技术分享技术分享技术分享技术分享

Knerr suggested combining these two-class classifiers with an “AND” gate. Friedman suggested a Max Wins algorithm: each 技术分享 classifier casts one vote for its preferred class, and the final result is the class with the most votes. Friedman shows circumstances in which this algorithm is Bayes optimal. KreBel applies the Max Wins algorithm to Support Vector Machines with excellent results.技术分享

A significant disadvantage of the 技术分享 approach, however, is that, unless the individual classifiers are carefully regularized (as in SVMs), the overall 技术分享-class classifier system will tend to overfit. The “AND” combination method and the Max Wins combination method do not have bounds on the generalization error. Finally, the size of the 技术分享 classifier may grow superlinearly with 技术分享, and hence, may be slow to evaluate on large problems.技术分享技术分享技术分享技术分享

  1. 2 Decision DAGs

A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. A Rooted DAG has a unique node such that it is the only node which has no arcs pointing into it. A Rooted Binary DAG has nodes which have either 技术分享 or 技术分享 arcs leaving them. We will use Rooted Binary DAGs in order to define a class of functions to be used in classification tasks. The class of functions computed by Rooted Binary DAGs is formally defined as follows.技术分享技术分享

Definition 1Decision DAGs (DDAGs). Given a space 技术分享 and a set of boolean functions 技术分享, the class 技术分享 of Decision DAGs on 技术分享 classes over 技术分享 are functions which can be implemented using a rooted binary DAGs with 技术分享 leaves labeled by the classes where each of the 技术分享 internal nodes is labeled with an element of 技术分享. The nodes are arranged in a triangle with the single root node at the top, two nodes in the second layer and so on until the final layer of 技术分享 leaves. The 技术分享-th node in layer 技术分享 is connected to the 技术分享-th and 技术分享-st node in the 技术分享-st layer.技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享技术分享

To evaluate a particular DDAG G on input 技术分享, starting at the root node, the binary function at node is evaluated. The node is then exited via the left edge, if the binary function is zero; or the right edge, if the binary function is one. The next node‘s binary function is then evaluated. The value of the decision function 技术分享 is the value associated with the final leaf node. The path taken through the DDAG is known as the evaluation path. The input 技术分享 reaches a node of the graph, if that node is on the evaluation path for 技术分享. We refer to the decision node distinguishing classes 技术分享 and 技术分享 as the 技术分享-node. Assuming that the number of a leaf is its class, this node is the 技术分享-th node in the 技术分享-th layer provided . Similarly the 技术分享-nodes are those nodes involving class , that is, the internal nodes on the two diagonals containing the leaf labeled by 技术分享.

The DDAG is equivalent to operating on a list, where each node eliminates one class from the list. The list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the other class is eliminated from the list, and the DDAG proceeds to test the first and last elements of the new list. The DDAG terminates when only one class remains in the list. Thus, for a problem with 技术分享 classes, 技术分享 decision nodes will be evaluated in order to derive an answer.

The current state of the list is the total state of the system. Therefore, since a list state is reachable in more than one possible path through the system, the decision graph the algorithm traverses is a DAG, not simply a tree.

Decision DAGs naturally generalize the class of Decision Trees, allowing for a more efficient representation of redundancies and repetitions that can occur in different branches of the tree, by allowing the merging of different decision paths. The class of functions implemented is the same as that of Generalized Decision Trees, but this particular representation presents both computational and learning-theoretical advantages.

3 Analysis of Generalization

In this paper we study DDAGs where the node-classifiers are hyperplanes. We define a Perceptron DDAG to be a DDAG with a perceptron at every node. Let 技术分享 be the (unit) weight vector correctly splitting the 技术分享 and 技术分享 classes at the 技术分享-node with threshold 技术分享. We define the margin of the 技术分享-node to be 技术分享, where 技术分享 is the class associated to training example 技术分享. Note that, in this definition, we only take into account examples with class labels equal to 技术分享 or 技术分享.

Theorem 1 Suppose we are able to classifya random 技术分享 sample of labeled examples using a Perceptron DDAG on 技术分享 classes containing 技术分享 decision nodes with margins 技术分享 at node 技术分享, then we can bound the generalization error with probability greater than 技术分享 to be less than

技术分享,

where 技术分享 and 技术分享 is the radius of a ball containing the distribution‘s support.

Theorem 1 implies that we can control the capacity of DDAGs by enlarging their margin. Note that, in some situations, this bound may be pessimistic: the DDAG partitions the input space into polytopic regions, each of which is mapped to a leaf node and assigned to a specific class. Intuitively, the only margins that should matter are the ones relative to the boundaries of the cell where a given training point is assigned, whereas the bound in Theorem 1 depends on all the margins in the graph.

By the above observations, we would expect that a DDAG whose 技术分享-node margins are large would be accurate at identifying class 技术分享, even when other nodes do not have large margins. Theorem 2 substantiates this by showing that the appropriate bound depends only on the 技术分享-node margins, but first we introduce the notation, 技术分享.

Theorem 2 Suppose we are able to correctly distinguish class 技术分享 from the other classes in a random 技术分享-sample with a DDAG 技术分享 over 技术分享 classes containing 技术分享 decision nodes with margins 技术分享 at node 技术分享, then with probability 技术分享,

技术分享,

where 技术分享, and 技术分享 is the radius of a ball containing the support of the distribution.

Large Margin DAGs for Multiclass Classification