Off all things the measure is man Automatic classification--688IT编程网

“OF ALL THINGS THE MEASURE IS MAN”

AUTOMATIC CLASSIFICATION OF EMOTIONS AND INTER-LABELER CONSISTENCY Stefan Steidl,Michael Levit,Anton Batliner,Elmar N¨o th,and Heinrich Niemann

Lehrstuhl f¨u r Mustererkennung,Universit¨a t Erlangen-N¨u rnberg

Martensstraße3,91058Erlangen,Germany

steidl@informatik.uni-erlangen.de

ABSTRACT

In traditional classiﬁcation problems,the reference needed for train-ing a classiﬁer is given and considered to be absolutely correct. However,this does not apply to all tasks.In emotion recognition in non-acted speech,for instance,one often does not know which emotion was really intended by the speaker.Hence,the data is annotated by a group of human labelers who do not agree on one common class in most cases.Often,similar classes are confused systematically.We propose a new entropy-based method to evalu-ate classiﬁcation results taking into account these systematic con-fusions.We can show that a classiﬁer which achieves a recognition rate of“only”about60%on a four-class-problem

performs as well as ourﬁve human labelers on average.

1.INTRODUCTION

An essential aspect of pattern recognition is the classiﬁcation of patterns.Besides the search for applicable features,a lot of work has also been done to develop new and to improve existing au-tomatic classiﬁcation techniques.Well known are,for instance, artiﬁcial neural networks,or support vector machines which be-came very popular in the last few years.In the case of supervised learning,the classiﬁers are trained to map a set of features into a given reference class.The standard method to evaluate an au-tomatic classiﬁer is to calculate the recognition rate which is the percentage of correctly recognized samples.The basic assumption is that this reference class is given and that it is non-ambiguous.

In our work on the recognition of emotions on the basis of emotional speech,we face the problem that it is not clear at all which emotions the people expressed when they were recorded. The corpus on which the experiments were done in this paper con-sists of children playing with the Sony robot Aibo.The kids were asked to direct the Aibo along a given route;they were not asked to express any emotions.Nevertheless emotional behavior can be observed in these recordings.As these emotions a

re not acted by professional actors,but are emotions as they appear in daily life, they are called“realistic”.From the application developers’point of view,it is very important to deal with such realistic behavior. However,one side effect is that one has to cope with relatively weak emotions in contrast to full-blown emotions of acted speech. As the recorded persons do not have to play a given emotion and due to the fact that it is often not feasible to ask them afterwards what kind of emotions they felt during the recordings,one employs This work was partially funded by the EU projects PF-STAR(IST-2001-37599)and HUMAINE(IST-2002-507422).The responsibility for the content lies with the authors.human labelers to label the data set.Normally,only in a few cases, all available labelers agree on one common label.In our corpus, in most cases,only three out ofﬁve labelers agreed.Yet this is not a problem of bad labeling but rather a consequence of the fact that we are dealing with a realistic classiﬁcation problem which also raises real difﬁculties for humans.Accordingly,the expectations of the automatic classiﬁer measured in recognition rates must be lowered.

In order to be able to calculate recognition rates at all,hard decisions are needed for the reference as well as for the classi-ﬁer’s output.If a metric can be imposed on the label space,the labels of all labelers can be averaged.This is well possible,for example,if the tiredness of persons is labeled on a scale from1to 10.If two labelers judge someone with’8’as very tired and one labeler says only’5’,the r

eference would be’7’.But this does not work for categorical emotion labels like anger,bored,etc.as the mean of anger and bored is not deﬁned.In those cases,the state-of-the-art is to use a majority voting to create the reference. Proceeding this way,we achieve recognition rates of about60% on our corpus with four emotion classes which is a state-of-the-art result for a task set-up like that.Nonetheless,the assessment of the emotion classiﬁcation success should not be done without con-sidering how well humans would perform in this task.Depending on the number and type of classes,human labelers confuse certain classes with each other more than other classes.In general,the more similar classes are,the more they are confused.This confu-sion should be considered in the evaluation of a classiﬁer.If the automatic classiﬁer makes the same“mistakes”as many humans do,then this fault cannot be as severe as if the classiﬁer mixes up two classes that are never confused by humans.Instead,the ques-tion is if such systematic confusions are faults at all since“of all things the measure is man”as already Protagoras said more than 2400years ago.

In this paper,we would like to propose a new entropy-based measure to judge a classiﬁer’s output taking systematic confusions made by humans into account.

2.ENTROPY-BASED MEASURE

According to Shannon’s information theory[1],the entropy is a measure for the information content.We propose to use the en-tropy to measure the unanimity of the labelers.If all reference labelers agree on one class,the entropy will be zero.Otherwise, the more the labelers disagree,the higher the entropy will be.In the following,we assume to have N labelers L n who have labeled a data set of S samples X s.For each sample,each of our label-

ers has to decide in favor of one of K classes C k.However,the approach is also easily portable to soft labels where all classes get scores from a continuous range of values and all scores for a sam-ple sum up to one.The hard decisions of any number of labelers can be converted into one soft reference label as it is depicted in Fig.1for a four-class-problem(K=4)with ten labelers.The more the labelers disagree theﬂatter is the distribution of the soft label.

labeler class

5A 6E 7A 8A 9N 10E →

A M E N

0.50.00.30.2

Fig.1.Conversion of the hard decisions of ten labelers into a soft reference label l ref.The four classes are’A nger’,’M otherese’,’E mphatic’,and’N eutral’

Our suggestion is to leave out each labeler(we can also use a more general term“decoder”)in succession.If labeler n is left out,then the resulting soft reference label for sample X s is denoted l ref(¯n,s),with¯n indicating the omitted labeler.

Now,we add another decoder.This can be an automatic clas-siﬁer,but also the remaining human labeler who was omitted in the reference,so that direct comparisons between a classiﬁer and a human labeler are possible.In order to avoid dependency on the number of labelers,the new decoder is not considered in the same manner as other reference labelers.Instead,the hard decision of the new decoder for sample X s(also converted into a soft label l dec(s))is weighted1:1with the reference

label l ref(¯n,s):

l(¯n,s)=0.5·l ref(¯n,s)+0.5·l dec(s)(1) Then,the entropy can be calculated for the given sample X s:

H(¯n,s)=−

k=1

l k(¯n,s)·log

(l k(¯n,s))(2)

Taking the example of Fig.1,the entropy will decrease com-pared to the reference labels if the decoder decides in favor of ’A nger’as’A nger’is what the majority of labelers said.Other-wise,if the decoder chooses’E mphatic’,the entropy will increase but not as much as if the decoder decides in favor of’N eutral’since 30%of the labelers agree that this sample is’E mphatic’and only 20%said the s

ample is’N eutral’.As none of the labelers decided for’M otherese’,choosing this class yields the highest entropy. This makes sense since’M otherese’seems to be deﬁnitely wrong in this case.Note that if using hard decisions,’A nger’would be the only correct class although50%of the labelers disagree.

Next step is to average each of the two computed entropy val-ues for X s over the left-out labelers:

H(s)=

comparisonsn=1H(¯n,s)(3)

We say that our classiﬁer performs not worse than an average

human labeler on sample X s,if the entropy from Eq.3with this

classiﬁer as the new decoder does not exceed the entropy where

the additional decoders were always humans.By plotting two cor-

responding histograms of H(s)for the entire corpus,we obtain a

visual means for the assessment of performance of the classiﬁer

on this corpus:the closer the histogram for the classiﬁer to the his-

togram for the human labelers,the better the classiﬁer.In general,

nothing is known about the distributions approximated by these

histograms.However,if instead of plotting entropy values of in-

dividual samples we average them over series of several samples,

then,according to the central limit theorem,the resulting distribu-

tions will be approximately normal,and thus,describable in terms

of its means and variances.In our experiments we used series of

20samples.

The overall entropy mean itself can be used for comparison

and is computed by averaging H(s)over all samples of the data

set:

s=1

H(s)(4)

3.THE AIBO-EMOTION-CORPUS

This entropy-based measure is useful in all those cases where a

large discrepancy between the human reference labelers exists.In

this paper,we demonstrate the evaluation of different decoders

considering the example of emotion recognition in speech of chil-

dren.All experiments are done on a subset of our Aibo-Emotion-

Corpus which consists of51children at the age of10to13years.

The children were asked to direct the Aibo robot along a given

route and to certain objects.To evoke emotions,the Aibo was op-

erated by remote control and misbehaved at predeﬁned positions.

In addition,the children were told to address Aibo like a normal

dog,especially to reprimand or to laud it.Besides that,we pressed

the children slightly for time and put up some danger spots where

Aibo was not allowed to go under any circumstances.Neverthe-

less,the recorded emotions are relatively weak,especially in con-

trast to full-blown emotions of acted speech.The corpus consists

mainly of the four emotions’A nger’,’M otherese’,’E mphatic’,

and’N eutral’which were annotated at word level byﬁve experi-

enced graduate labelers.Before labeling,the labelers agreed on a

common set of discrete emotions.For a more detailed description

of the corpus,please refer to[2].As’N eutral’is the most frequent

“emotion”by far,we downsampled the data until all four classes

were equally present according to the majority voting of ourﬁve

labelers.At least three labelers had to agree.Cases were less than

three labelers agreed were omitted as well as those cases where

other than the four basic classes were labeled.In theﬁnal data set,

1557words for’A nger’,1224words for’M otherese’,and1645

words each for’E mphatic’and for’N eutral’are used.The inter-

labeler consistency can be measured using the kappa statistic.The

formula is in[3].For our subset,the kappa value is only

0.36which expresses the large disagreement of ourﬁve labelers.

It is generally agreed that kappa scores greater than0.7indicate

good agreement.As mentioned above,our low kappa value is not

due to bad labeling.Rather,we are dealing with a difﬁcult classiﬁ-

cation problem where even human labelers disagree about certain

classes.

4.MACHINE CLASSIFICATION OF EMOTIONS The experiments described in the following are all conducted with artiﬁcial neural networks.Because of the small data set,we do “Leave-One-Speaker-Out”experiments:1speaker for testing,40 speakers for training,and10speakers for validation of the neural networks.As features we use our set of95prosodic features and30 part-of-speech features.Details to these features can be found in [4,5].The total number of features is reduced to95using principal component analysis(PCA).Two machine classiﬁers are trained: machine1is trained with soft labels,machine2with hard labels. The results in terms of traditional recognition rates are given Tab.1 and Tab.2together with a confusion matrix of the classes.The average recognition rate per class is with59.7%slightly higher for machine2which is trained with hard labels than for machine 1which achieves58.1%.The majority voting of allﬁve labelers serves as hard reference.

A M E NΣRR

A79147261458155750.8%

M5655927582122445.7%

E21423947461164557.6%

N100941611290164578.4%

∅58.1%

Table1.Machine decoder1:confusion matrix and recognition rates(RR)evaluated using hard decisions for the classes’A nger’,’M otherese’,’E mphatic’,and’N eutral’

A M E NΣRR

A89990303265155757.7%

M11069768349122456.9%

E273431076253164565.4%

N215201266963164558.5%

∅59.7%

Table2.Machine decoder2:confusion matrix and recognition rates(RR)evaluated using hard decisions for the classes’A nger’,’M otherese’,’E mphatic’,and’N eutral’

The intention of this paper is to compare those two machine classiﬁers with an average human labeler as described in Sec.2. But prior to this,we present results for different naive classiﬁers. In Fig.2(left),entropy histograms for an average human labeler and a random choice classiﬁer,which randomly chooses one of four classes,are shown.As expected,the mean entropy for the simple classiﬁer(1.050,Tab.3)is much higher than for the hu-man labeler(0.722).Accordingly,the histogram of the random choice classiﬁer is shifted to the right.On the right side of Fig.2, the histograms of two other naive decoders are shown.One clas-siﬁer decides always in favor of’N eutral’,the other one always for’M otherese’.Analyzing the data set,it is obvious that human labelers are often not sure whether they should label a word as emotional or as neutral due to the weak emotions we are dealing with.Consequently,deciding for’N eutral’conforms more to the human labeling behavior than deciding for a certain emotion class. This fact is reﬂected in our entropy values as well.The mean entropy for the classiﬁer that always chooses’N eutral’is0.843 which is better than random choice.In contrast,always deciding for’M otherese’is quite worse(1.196).

decoder entropy measure

human majority voting0.542

human labeler0.721

machine10.722

machine20.758

choose always’N’0.843

choose always’E’ 1.049

random choice 1.050

choose always’A’ 1.127

choose always’M’ 1.196

Table3.Different decoders and their classiﬁcation our entropy measure

As for the comparison between the two machine classiﬁers, the entropy measure H from Eq.4shows that the decoder machine 1performs as well as an average human labeler,albeit it yields an average recognition rate per class of“only”58.1%.The mean en-tropy is with0.722almost identical with the value attained by the human labelers(0.721).Our second machine decoder machine2, even though it

is slightly superior to machine1in terms of recog-nition rates,performs a little worse than it in terms of the mean entropy(0.758).The reason becomes obvious if one looks at the confusion matrices in Tab.1and Tab.2.Both neural networks are trained in such a way that all four classes should be recognized equally well.This works better if hard labels are used for training as in the case of machine2.In contrast,machine1tends to favor ’N eutral’,and this is exactly what humans do in our data set.This is why the entropy measure,being a rather intuitive one,prefers machine1over machine2,even though its recognition rates are lower.

The reference for calculating recognition rates is the majority voting of allﬁve labelers.This majority voting can also be in-terpreted as decoder.In Fig.3(right),this decoder is plotted in comparison with an average human labeler.The mean entropy of 0.542speciﬁes the minimum entropy which can be achieved by a machine decoder.Thus,a machine classiﬁer can very well be bet-ter than a single human on average.The results show that we are as good as one of our human labelers on average,but that there is also enough room for further improvements.

5.CONCLUSION

We proposed a new entropy-based measure which makes possi-ble a comparison between human l

abelers and machine classi-ﬁers.Even more important for the evaluation is the fact that sys-tematic confusions of human reference labelers are taken into ac-count as in most of our cases the reference is far from being non-ambiguous.For instance,slight forms of’A nger’are often con-fused with’E mphatic’or with’N eutral’since it is very hard to distinguish among these emotions–even for humans.From the application’s point of view,deciding for a similar class cannot be that wrong in those cases.Our measure punishes classiﬁcation faults that also occur in human classiﬁcation less than those faults that are never done by humans.Traditional recognition rates are not capable of this distinction.

0.05 0.1 0.15 0.2

0.25

r e l. f r e q u e n c y [%]

entropy

0.05 0.1 0.15 0.2 0.25

r e l. f r e q u e n c y [%]

entropy

Fig.2.Comparison between an average human labeler and three naive classiﬁers:a decoder which selects randomly one of the four classes (left)and two decoders which always choose ’N eutral’and ’M otherese’respectively (right)

0.05 0.1 0.15 0.2

0.25

r e l. f r e q u e n c y [%]

entropy

0.05 0.1 0.15 0.2 0.25r e l. f r e q u e n c y [%]

entropy

Fig.3.Comparison between an average human labeler and our machine decoder 1(left)and the majority voting of our ﬁve human labelers respectively (right)

6.REFERENCES

[1] C.E.Shannon,“A Mathematical Theory of Communica-tion,”in Bell System Technical Journal ,vol.27,pp.379–423,623–656.1948,reprint available at cm.bell-labs/cm/ms/what/shannonday/paper.html (08/17/2004).[2] A.Batliner,C.Hacker,S.Steidl,E.N¨o th,S.D’Arcy,M.Rus-sel,and M.Wong,“‘You stupid tin box’-children interact-ing with the AIBO robot:A cross-linguistic emotional speech

corpus,”in Proc.of the 4th International Conference on Lan-guage Resources and Evaluation (LREC),2004,vol.1,pp.171–174.[3]R.Sproat,W.Black A.S.Chen,S.Kumar,M.Ostendorf,

and C.Richards,“Normalization of non-standard words,”in Computer Speech and Language ,vol.15,pp.287–333.2001.[4] A.Batliner,K.Fischer,R.Huber,J.Spilker,and E.N¨o th,

“How to Find Trouble in Communication,”Speech Communi-cation ,vol.40,pp.117–143,2003.[5]J.Buckow,Multilingual Prosody in Automatic Speech Under-standing ,Logos,Berlin,2003.

688IT编程网

Off all things the measure is man Automatic classification

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

Off all things the measure is man Automatic classification

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式