“OF ALL THINGS THE MEASURE IS MAN”
AUTOMATIC CLASSIFICATION OF EMOTIONS AND INTER-LABELER CONSISTENCY Stefan Steidl,Michael Levit,Anton Batliner,Elmar N¨o th,and Heinrich Niemann
Lehrstuhl f¨u r Mustererkennung,Universit¨a t Erlangen-N¨u rnberg
Martensstraße3,91058Erlangen,Germany
steidl@informatik.uni-erlangen.de
ABSTRACT
In traditional classification problems,the reference needed for train-ing a classifier is given and considered to be absolutely correct. However,this does not apply to all tasks.In emotion recognition in non-acted speech,for instance,one often does not know which emotion was really intended by the speaker.Hence,the data is annotated by a group of human labelers who do not agree on one common class in most cases.Often,similar classes are confused systematically.We propose a new entropy-based method to evalu-ate classification results taking into account these systematic con-fusions.We can show that a classifier which achieves a recognition rate of“only”about60%on a four-class-problem
performs as well as ourfive human labelers on average.
1.INTRODUCTION
An essential aspect of pattern recognition is the classification of patterns.Besides the search for applicable features,a lot of work has also been done to develop new and to improve existing au-tomatic classification techniques.Well known are,for instance, artificial neural networks,or support vector machines which be-came very popular in the last few years.In the case of supervised learning,the classifiers are trained to map a set of features into a given reference class.The standard method to evaluate an au-tomatic classifier is to calculate the recognition rate which is the percentage of correctly recognized samples.The basic assumption is that this reference class is given and that it is non-ambiguous.
In our work on the recognition of emotions on the basis of emotional speech,we face the problem that it is not clear at all which emotions the people expressed when they were recorded. The corpus on which the experiments were done in this paper con-sists of children playing with the Sony robot Aibo.The kids were asked to direct the Aibo along a given route;they were not asked to express any emotions.Nevertheless emotional behavior can be observed in these recordings.As these emotions a
re not acted by professional actors,but are emotions as they appear in daily life, they are called“realistic”.From the application developers’point of view,it is very important to deal with such realistic behavior. However,one side effect is that one has to cope with relatively weak emotions in contrast to full-blown emotions of acted speech. As the recorded persons do not have to play a given emotion and due to the fact that it is often not feasible to ask them afterwards what kind of emotions they felt during the recordings,one employs This work was partially funded by the EU projects PF-STAR(IST-2001-37599)and HUMAINE(IST-2002-507422).The responsibility for the content lies with the authors.human labelers to label the data set.Normally,only in a few cases, all available labelers agree on one common label.In our corpus, in most cases,only three out offive labelers agreed.Yet this is not a problem of bad labeling but rather a consequence of the fact that we are dealing with a realistic classification problem which also raises real difficulties for humans.Accordingly,the expectations of the automatic classifier measured in recognition rates must be lowered.
In order to be able to calculate recognition rates at all,hard decisions are needed for the reference as well as for the classi-fier’s output.If a metric can be imposed on the label space,the labels of all labelers can be averaged.This is well possible,for example,if the tiredness of persons is labeled on a scale from1to 10.If two labelers judge someone with’8’as very tired and one labeler says only’5’,the r
eference would be’7’.But this does not work for categorical emotion labels like anger,bored,etc.as the mean of anger and bored is not defined.In those cases,the state-of-the-art is to use a majority voting to create the reference. Proceeding this way,we achieve recognition rates of about60% on our corpus with four emotion classes which is a state-of-the-art result for a task set-up like that.Nonetheless,the assessment of the emotion classification success should not be done without con-sidering how well humans would perform in this task.Depending on the number and type of classes,human labelers confuse certain classes with each other more than other classes.In general,the more similar classes are,the more they are confused.This confu-sion should be considered in the evaluation of a classifier.If the automatic classifier makes the same“mistakes”as many humans do,then this fault cannot be as severe as if the classifier mixes up two classes that are never confused by humans.Instead,the ques-tion is if such systematic confusions are faults at all since“of all things the measure is man”as already Protagoras said more than 2400years ago.
In this paper,we would like to propose a new entropy-based measure to judge a classifier’s output taking systematic confusions made by humans into account.
2.ENTROPY-BASED MEASURE
According to Shannon’s information theory[1],the entropy is a measure for the information content.We propose to use the en-tropy to measure the unanimity of the labelers.If all reference labelers agree on one class,the entropy will be zero.Otherwise, the more the labelers disagree,the higher the entropy will be.In the following,we assume to have N labelers L n who have labeled a data set of S samples X s.For each sample,each of our label-
ers has to decide in favor of one of K classes C k.However,the approach is also easily portable to soft labels where all classes get scores from a continuous range of values and all scores for a sam-ple sum up to one.The hard decisions of any number of labelers can be converted into one soft reference label as it is depicted in Fig.1for a four-class-problem(K=4)with ten labelers.The more the labelers disagree theflatter is the distribution of the soft label.
labeler class
1A
2E
3A
4N
5A 6E 7A 8A 9N 10E →
A M E N
0.50.00.30.2
Fig.1.Conversion of the hard decisions of ten labelers into a soft reference label l ref.The four classes are’A nger’,’M otherese’,’E mphatic’,and’N eutral’
Our suggestion is to leave out each labeler(we can also use a more general term“decoder”)in succession.If labeler n is left out,then the resulting soft reference label for sample X s is denoted l ref(¯n,s),with¯n indicating the omitted labeler.
Now,we add another decoder.This can be an automatic clas-sifier,but also the remaining human labeler who was omitted in the reference,so that direct comparisons between a classifier and a human labeler are possible.In order to avoid dependency on the number of labelers,the new decoder is not considered in the same manner as other reference labelers.Instead,the hard decision of the new decoder for sample X s(also converted into a soft label l dec(s))is weighted1:1with the reference
label l ref(¯n,s):
l(¯n,s)=0.5·l ref(¯n,s)+0.5·l dec(s)(1) Then,the entropy can be calculated for the given sample X s:
H(¯n,s)=−
K
k=1
l k(¯n,s)·log
2
(l k(¯n,s))(2)
Taking the example of Fig.1,the entropy will decrease com-pared to the reference labels if the decoder decides in favor of ’A nger’as’A nger’is what the majority of labelers said.Other-wise,if the decoder chooses’E mphatic’,the entropy will increase but not as much as if the decoder decides in favor of’N eutral’since 30%of the labelers agree that this sample is’E mphatic’and only 20%said the s
ample is’N eutral’.As none of the labelers decided for’M otherese’,choosing this class yields the highest entropy. This makes sense since’M otherese’seems to be definitely wrong in this case.Note that if using hard decisions,’A nger’would be the only correct class although50%of the labelers disagree.
Next step is to average each of the two computed entropy val-ues for X s over the left-out labelers:
H(s)=
1N
comparisonsn=1H(¯n,s)(3)
We say that our classifier performs not worse than an average
human labeler on sample X s,if the entropy from Eq.3with this
classifier as the new decoder does not exceed the entropy where
the additional decoders were always humans.By plotting two cor-
responding histograms of H(s)for the entire corpus,we obtain a
visual means for the assessment of performance of the classifier
on this corpus:the closer the histogram for the classifier to the his-
togram for the human labelers,the better the classifier.In general,
nothing is known about the distributions approximated by these
histograms.However,if instead of plotting entropy values of in-
dividual samples we average them over series of several samples,
then,according to the central limit theorem,the resulting distribu-
tions will be approximately normal,and thus,describable in terms
of its means and variances.In our experiments we used series of
20samples.
The overall entropy mean itself can be used for comparison
and is computed by averaging H(s)over all samples of the data
set:
H=
1
S
S
s=1
H(s)(4)
3.THE AIBO-EMOTION-CORPUS
This entropy-based measure is useful in all those cases where a
large discrepancy between the human reference labelers exists.In
this paper,we demonstrate the evaluation of different decoders
considering the example of emotion recognition in speech of chil-
dren.All experiments are done on a subset of our Aibo-Emotion-
Corpus which consists of51children at the age of10to13years.
The children were asked to direct the Aibo robot along a given
route and to certain objects.To evoke emotions,the Aibo was op-
erated by remote control and misbehaved at predefined positions.
In addition,the children were told to address Aibo like a normal
dog,especially to reprimand or to laud it.Besides that,we pressed
the children slightly for time and put up some danger spots where
Aibo was not allowed to go under any circumstances.Neverthe-
less,the recorded emotions are relatively weak,especially in con-
trast to full-blown emotions of acted speech.The corpus consists
mainly of the four emotions’A nger’,’M otherese’,’E mphatic’,
and’N eutral’which were annotated at word level byfive experi-
enced graduate labelers.Before labeling,the labelers agreed on a
common set of discrete emotions.For a more detailed description
of the corpus,please refer to[2].As’N eutral’is the most frequent
“emotion”by far,we downsampled the data until all four classes
were equally present according to the majority voting of ourfive
labelers.At least three labelers had to agree.Cases were less than
three labelers agreed were omitted as well as those cases where
other than the four basic classes were labeled.In thefinal data set,
1557words for’A nger’,1224words for’M otherese’,and1645
words each for’E mphatic’and for’N eutral’are used.The inter-
labeler consistency can be measured using the kappa statistic.The
formula is in[3].For our subset,the kappa value is only
0.36which expresses the large disagreement of ourfive labelers.
It is generally agreed that kappa scores greater than0.7indicate
good agreement.As mentioned above,our low kappa value is not
due to bad labeling.Rather,we are dealing with a difficult classifi-
cation problem where even human labelers disagree about certain
classes.
4.MACHINE CLASSIFICATION OF EMOTIONS The experiments described in the following are all conducted with artificial neural networks.Because of the small data set,we do “Leave-One-Speaker-Out”experiments:1speaker for testing,40 speakers for training,and10speakers for validation of the neural networks.As features we use our set of95prosodic features and30 part-of-speech features.Details to these features can be found in [4,5].The total number of features is reduced to95using principal component analysis(PCA).Two machine classifiers are trained: machine1is trained with soft labels,machine2with hard labels. The results in terms of traditional recognition rates are given Tab.1 and Tab.2together with a confusion matrix of the classes.The average recognition rate per class is with59.7%slightly higher for machine2which is trained with hard labels than for machine 1which achieves58.1%.The majority voting of allfive labelers serves as hard reference.
A M E NΣRR
A79147261458155750.8%
M5655927582122445.7%
E21423947461164557.6%
N100941611290164578.4%
∅58.1%
Table1.Machine decoder1:confusion matrix and recognition rates(RR)evaluated using hard decisions for the classes’A nger’,’M otherese’,’E mphatic’,and’N eutral’
A M E NΣRR
A89990303265155757.7%
M11069768349122456.9%
E273431076253164565.4%
N215201266963164558.5%
∅59.7%
Table2.Machine decoder2:confusion matrix and recognition rates(RR)evaluated using hard decisions for the classes’A nger’,’M otherese’,’E mphatic’,and’N eutral’
The intention of this paper is to compare those two machine classifiers with an average human labeler as described in Sec.2. But prior to this,we present results for different naive classifiers. In Fig.2(left),entropy histograms for an average human labeler and a random choice classifier,which randomly chooses one of four classes,are shown.As expected,the mean entropy for the simple classifier(1.050,Tab.3)is much higher than for the hu-man labeler(0.722).Accordingly,the histogram of the random choice classifier is shifted to the right.On the right side of Fig.2, the histograms of two other naive decoders are shown.One clas-sifier decides always in favor of’N eutral’,the other one always for’M otherese’.Analyzing the data set,it is obvious that human labelers are often not sure whether they should label a word as emotional or as neutral due to the weak emotions we are dealing with.Consequently,deciding for’N eutral’conforms more to the human labeling behavior than deciding for a certain emotion class. This fact is reflected in our entropy values as well.The mean entropy for the classifier that always chooses’N eutral’is0.843 which is better than random choice.In contrast,always deciding for’M otherese’is quite worse(1.196).
decoder entropy measure
human majority voting0.542
human labeler0.721
machine10.722
machine20.758
choose always’N’0.843
choose always’E’  1.049
random choice  1.050
choose always’A’  1.127
choose always’M’  1.196
Table3.Different decoders and their classification our entropy measure
As for the comparison between the two machine classifiers, the entropy measure H from Eq.4shows that the decoder machine 1performs as well as an average human labeler,albeit it yields an average recognition rate per class of“only”58.1%.The mean en-tropy is with0.722almost identical with the value attained by the human labelers(0.721).Our second machine decoder machine2, even though it
is slightly superior to machine1in terms of recog-nition rates,performs a little worse than it in terms of the mean entropy(0.758).The reason becomes obvious if one looks at the confusion matrices in Tab.1and Tab.2.Both neural networks are trained in such a way that all four classes should be recognized equally well.This works better if hard labels are used for training as in the case of machine2.In contrast,machine1tends to favor ’N eutral’,and this is exactly what humans do in our data set.This is why the entropy measure,being a rather intuitive one,prefers machine1over machine2,even though its recognition rates are lower.
The reference for calculating recognition rates is the majority voting of allfive labelers.This majority voting can also be in-terpreted as decoder.In Fig.3(right),this decoder is plotted in comparison with an average human labeler.The mean entropy of 0.542specifies the minimum entropy which can be achieved by a machine decoder.Thus,a machine classifier can very well be bet-ter than a single human on average.The results show that we are as good as one of our human labelers on average,but that there is also enough room for further improvements.
5.CONCLUSION
We proposed a new entropy-based measure which makes possi-ble a comparison between human l
abelers and machine classi-fiers.Even more important for the evaluation is the fact that sys-tematic confusions of human reference labelers are taken into ac-count as in most of our cases the reference is far from being non-ambiguous.For instance,slight forms of’A nger’are often con-fused with’E mphatic’or with’N eutral’since it is very hard to distinguish among these emotions–even for humans.From the application’s point of view,deciding for a similar class cannot be that wrong in those cases.Our measure punishes classification faults that also occur in human classification less than those faults that are never done by humans.Traditional recognition rates are not capable of this distinction.
0.05 0.1 0.15 0.2
0.25
r e l. f r e q u e n c y  [%]
entropy
0.05 0.1 0.15 0.2 0.25
r e l. f r e q u e n c y  [%]
entropy
Fig.2.Comparison between an average human labeler and three naive classifiers:a decoder which selects randomly one of the four classes (left)and two decoders which always choose ’N eutral’and ’M otherese’respectively (right)
0.05 0.1 0.15 0.2
0.25
r e l. f r e q u e n c y  [%]
entropy
0.05 0.1 0.15 0.2 0.25r e l. f r e q u e n c y  [%]
entropy
Fig.3.Comparison between an average human labeler and our machine decoder 1(left)and the majority voting of our five human labelers respectively (right)
6.REFERENCES
[1]  C.E.Shannon,“A Mathematical Theory of Communica-tion,”in Bell System Technical Journal ,vol.27,pp.379–423,623–656.1948,reprint available at cm.bell-labs/cm/ms/what/shannonday/paper.html (08/17/2004).[2]  A.Batliner,C.Hacker,S.Steidl,E.N¨o th,S.D’Arcy,M.Rus-sel,and M.Wong,“‘You stupid tin box’-children interact-ing with the AIBO robot:A cross-linguistic emotional speech
corpus,”in Proc.of the 4th International Conference on Lan-guage Resources and Evaluation (LREC),2004,vol.1,pp.171–174.[3]R.Sproat,W.Black A.S.Chen,S.Kumar,M.Ostendorf,
and C.Richards,“Normalization of non-standard words,”in Computer Speech and Language ,vol.15,pp.287–333.2001.[4]  A.Batliner,K.Fischer,R.Huber,J.Spilker,and E.N¨o th,
“How to Find Trouble in Communication,”Speech Communi-cation ,vol.40,pp.117–143,2003.[5]J.Buckow,Multilingual Prosody in Automatic Speech Under-standing ,Logos,Berlin,2003.

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。