A Comparison of Methods for Multiclass
Support Vector Machines
Chih-Wei Hsu and Chih-Jen Lin
Abstract—Support vector machines(SVMs)were originally designed for binary classification.How to effectively extend it for multiclass classification is still an ongoing research issue.Several methods have been proposed where typically we construct a multiclass classifier by combining several binary classifiers.Some authors also proposed methods that consider all classes at once.As it is computationally more expensive to solve multiclass problems, comparisons of these methods using large-scale problems have not been seriously conducted.Especially for methods solving multiclass SVM in one step,a much larger optimization problem is required so up to now experiments are limited to small data sets. In this paper we give decomposition implementations for two such “all-together”methods.We then compare their performance with three methods based on binary classifications:“one-against-all,”“one-against-one,”and directed acyclic graph SVM(DAGSVM). Our experiments indicate that the“one-against-one”and DAG methods are more suitable for practical use than the other methods.Results also show that for large problems methods by considering all data at
once in general need fewer support vectors. Index Terms—Decomposition methods,multiclass classification, support vector machines(SVMs).
I.I NTRODUCTION
S UPPORT vector machines(SVMs)[6]were originally de-signed for binary classification.How to effectively extend it for multiclass classification is still an ongoing research issue. Currently there are two types of approaches for multiclass SVM. One is by constructing and combining several binary classifiers while the other is by directly considering all data in one opti-mization formulation.Up to now there are still no comparisons which cover most of these methods.
The formulation to solve multiclass SVM problems in one step has variables proportional to the number of classes.There-fore,for multiclass SVM methods,either several binary clas-sifiers have to be constructed or a larger optimization problem is needed.Hence in general it is computationally more expen-sive to solve a multiclass problem than a binary problem with the same number of data.Up to now experiments are limited to small data sets.In this paper we will give a decomposition implementation for two such“all-together”methods:[25],[27] and[7].We then compare their performance with three methods based on binary classification:“one-against-all,”“one-against-one,”and DAGSVM[23].
Manuscript received April9,2001;revised August22,2001.This work was supported in part by the National Science Council of Taiwan under Grant NSC 89-2213-E-002-106.
The authors are with the Department of Computer Science and Informa-tion Engineering,National Taiwan University,Taipei106,Taiwan(e-mail: u.edu.tw).
Publisher Item Identifier S1045-9227(02)01805-2.
Note that it was pointed out in[11]that the primal forms proposed in[25],[27]are also equivalent to those in[3],[11]. Besides methods mentioned above,there are other implemen-tations for multiclass SVM.For example,[14],[19].However, due to the limit of space here we do not conduct experiments on them.An earlier comparison between one-against-one and one-against-all methods is in[5].
In Section II,we review one-against-all,one-against-one,and directed acyclic graph SVM(DAGSVM)methods which are based on solving several binary classifications.In Section III, we give a brief introduction to the method in[25],[27]which considers all classes at once and show that the decomposition method proposed in[12]can be applied.Another method which also considers all variables together is by Crammer and Singer [7],which will be discussed in Section IV.Numerical experi-ments are in Section V where we show that“one-against-one”and DAG methods are more suit
able for practical use than the other methods.Results also show that for large problems the method proposed in[25],[27]by considering all variables at once generally needs fewer support vectors.Finally,we have some discussions and conclusions in Section VI.
II.O NE-A GAINST-A LL,O NE-A GAINST-O NE,
AND DAGSVM M ETHODS
The earliest used implementation for SVM multiclass clas-sification is probably the one-against-all method(for example, [2]).It
constructs is the number of classes.The
th class with positive labels,and all other examples with negative labels.Thus
given,
where
and is the class
of ,
the
(1) where the training
data are mapped to a higher dimensional space by the
function is the penalty parameter.
Minimizing
,the margin between two groups of data.When data are not linear separable,there is a penalty
term
which can reduce the number of training errors.The basic con-cept behind SVM is to search for a balance between the regu-larization
term
After solving (1),there
are
..
.
is in the class which has the largest value of the deci-sion function
class
of
th and
the
if
if
says
th class,then the vote for
the
th is increased by
one.Then we
predict
leaves.Each node is a binary SVM of th classes.
Given a test
sample
two-class
rules where
the
from the other vectors.Hence there
are
(4)
Then the decision function
is
which is the same as (2)of the one-against-all method.Like bi-nary SVM,it is easier to solve the dual problem here.Following
[27],the dual formulation of (4)
is
(5a)
(5b)
if (5c)
where
(6)
and the decision function
is
(7a)
HSU AND LIN:A COMPARISON OF METHODS FOR MULTICLASS SUPPORT VECTOR MACHINES417 and the other is from
[3]
(8)
with the same constraints of(4).
For the optimal solution of(4),from(6)and(5c),we
have
of them are always
zero.Hence we can also say it has
and
are fixed while a subproblem on variables corresponding
to
and the selection of its contents
are both important issues on designing a decomposition method.
For example,the sequential minimal optimization(SMO)by
Platt[22]considers only two variables in each iteration.Then
in each iteration of the decomposition method the sub-problem
on two variables can be analytically solved.Hence no optimiza-
tion software is needed.
However,such working set selections in the binary case may
not work here for the dual problem(5).Instead of only one linear
constraint in the dual of(1),now in(5a)we
have
linear equations
with
,then(5a)may be an over-determined linear
system with more equations than variables so the solution of the
decomposition algorithm cannot be moved.Therefore,in gen-
eral we have to select more
than
to the objective
function:
(9)
Then in the dual
problem
comparisonsif
(11)
where is
a
(12)
where is a permutation of the
matrix
418IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.13,NO.2,MARCH 2002
Assume
that is the size of the working set.We use the following working set selection of BSVM :
1)
Let
and
calculate ,
where
.
Select
the
into
is an optimal solution of (13),
then must
be
defined above actually
has
.
Thus selecting the smallest elements
in
whose
corresponding
and the
other
whose
corresponding
which are mainly accessed
as
in
(12).
As
and
.As any quadratic formulation can be written as a symmetric
form,we must reformulate (10).This will be explained in (15).Note that
as
is positive semidefinite and (10)is a convex optimiza-tion problem.Here (14a)to (14b)is by direct calculation.It is easy to see how the first and third terms
of .In particular,we will
show what a column
of
.
In (10),the quadratic terms compose of three groups.First we
have
if
if
..
.
...
..
.
includes all
elements
th column,row
indexes corresponding
to
will have contributed by this part.We
use
to
represent
and
for .
HSU AND LIN:A COMPARISON OF METHODS FOR MULTICLASS SUPPORT VECTOR MACHINES419
For the third
term,
will
include.
The most complicated one is the second
part
th column,elements corresponding
to
and will
have
and,respectively.
In summary,
the can be obtained
as follows:
1)Obtain the column
vector.Initialize
the
,
add
(from the third part)
3)For elements corresponding
to
(from the second part)
4)For elements corresponding to each
of
(from the first part)
5)For elements corresponding to each
of,
minus(from the second part)
Thus
.The reduction of the cached matrix
from further
improve the training time of this algorithm.
There are different implementations of the above procedure.
As the situation may vary for different computational environ-
ment,here we do not get into further details.
We have implemented this method as an extension of
BSVM and LIBSVM which will be used for the experi-
ments in Section V.In addition,the software is available at
u.edu.tw/~cjlin/bsvm.
IV.M ETHOD BY C RAMMER AND S INGER
In[7],Crammer and Singer proposed an approach for mul-
ticlass problems by solving a single optimization problem.We
will also give a decomposition implementation here.Basically
[7]solves the following primal
problem:
(16)
where
slack
variables.That is,instead of
using as the gap
between each two decision planes,here the maximum
of
.Note that here we do not have to
explicitly write down
constraints
so(16)
becomes
(17a)
(17b)
where
is
an identity matrix
and
is positive
semidefinite,,the Hessian
of the dual objective function is also positive semidefinite.This
is another way to explain that(17)is a convex optimization
problem.
420IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.13,NO.2,MARCH2002 The decision function
is
equations while(17a)
has
variables.In a sense
we can say that(17a)contains
variables associated
with the
same in the working set.That
is,
will be discussed in(20).Then the subproblem
is
(18)
where
is
a
th component
is
and
by one vector,can be calculated as follows by the
information of the
gradient
operations are needed.
Next we discuss the selection of the working set and the stop-
ping condition of the decomposition method.The KKT condi-
tion requires that there
are
and
We can rewrite this
as
with
which are selected,they do not satisfy the
KKT condition of the subproblem(18)so solving(18)will guar-
antee the strict decrease on the objective function of the dual
problem.
Following(19)the stopping criterion can
be
is the stopping tolerance.
The convergence of the above decomposition method has
been proved in[17].In addition,[17]shows that the limit
of
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论