In Proceedings of LREC-2000 Lessons Learned from a--688IT编程网

Lessons Learned from a Task-Based Evaluation

of Speech-to-Speech Machine Translation

Lori Levin,Boris Bartlog,Ariadna Font Llitjos,Donna Gates,

Alon Lavie,Dorcas Wallace,Taro Watanabe,Monika Woszczyna

Language Technologies Institute

Carnegie Mellon University

Pittsburgh,PA15213USA

u.edu

Abstract

For several years we have been conducting Accuracy Based Evaluations(ABE)of the JANUS speech-to-speech MT system(Gates et al., 1997)which measure quality andﬁdelity of translation.Recently we have begun to design a Task Based Evaluation for JANUS(Thomas, 1999)which measures goal compl

etion.This paper describes what we have learned by comparing the two types of evaluation.Both evaluations(ABE and TBE)were conducted on a common set of user studies in the semantic domain of travel planning.

1.Introduction

For several years we have been conducting Accu-racy Based Evaluations(ABE)(Gates et al.,1997)of the JANUS speech-to-speech machine translation system (Waibel,1996;Levin et al.,).Our ABE focuses on whether the meaning of a source language segment is totally and accurately conveyed in the target language,and also in-cludes a separate measure ofﬂuency.This type of evalu-ation was useful in the early stages of system development for tracking our improvement over time.The measure we used was percent of sentences that were accurate(we call these acceptable)and the percent that were both accu-rate andﬂuent(we call these perfect).However,when our system reached a level of coverage that allowed us be-gin user studies,we noticed that the ability of a user to com-plete a task(for example,getting a plane reservation)was higher than would be expected based on an ABE.For ex-ample,the ABE might be around70%acceptable,but the users could almost always complete the task.Recently we have begun to design a Task Based Evaluation for JANUS (Thomas,1999)which measures goal completion.This pa-per describes what we have learned by co

mparing the two types of evaluation.

2.Design Criteria

Most previous work on TBE has been conducted on human-machine dialogue(for example(Walker et al., 1997)).For machine translation,we need a TBE that is suitable for two humans each expressing communicative goals,but mediated by a machine.(Our coding scheme for communicative goals is described below.)In particu-lar,we have to separate human clumsiness and error from machine error,because we are not evaluating the humans, but rather the translation of what they said.Additionally, we have to allow for a large and unpredictable number of communicative goals in each dialogue.For example,us-ing the goal coding scheme described below,the dialogues we are evaluating each contain over one hundred commu-nicative goals.After coding the communicative goals in a dialogue,we had to design a scoring function that takes into account whether the communicative goals ultimately succeed or fail and how many times each goal is attempted before succeeding(being understood by the interlocutor)or being abandoned.

3.The Data

The data used for this evaluation came from three user-study dialogues that were unseen by system

developers.In each dialogue,the role of the traveller was played by a second-time user of our machine translation system and the role of the travel agent was played by one of the system developers.The traveller was told to book a trip to Ky-oto.Input to the system was through a headset with micro-phone.The agent and traveller could not see or hear each other.The only communication was through the user inter-face,which included speech synthesis,written translations, and web pages showing itineraries and travel information. There is a total of254utterances in the three dialogues.

In these user studies,the source and target languages were both English.This does constitute a real translation in that it goes through all of the machine translation compo-nents:English sentences are parsed to produce interlingua representations(see below)and then new English sentences are generated from the interlingua.One could argue,how-ever,that there may be some translation problems which do not appear in English-to-English translation.For this reason,we conducted an additional informal user study in which the travel agent was speaking German and the trav-eller was speaking English.This was not as carefully con-trolled as the original user studies;the two users could hear each other and the German speaker also understood En-glish.

4.Coding Scheme for Communicative Goals

The most difﬁcult issue in designing our TBE was deﬁn-ing what counts as a communicative goal.Because we need a deﬁnition that allows goals to be coded with high

Transcription(1)Agent:WOHIN#6f REISEN SIE#7f

Where are you travelling?

Ideal IF a:request-information+features+trip(location=question) Recognized as W ANN REISEN SIE

When are you travelling

German Paraphrase Wann reisen Sie ab?

when are you leaving

English Translation When will you leave?

Transcription(2)Client:uh i’m leaving#8f next monday#9f

Ideal IF c:give-information+temporal+departure(time=next

Traveller

58.7%51.8%

.75.65

82.8%73.8%

Table1:Results of Accuracy-and Task-Based Evaluations for English-English Paraphrase

5.The Scoring Function

Our TBE scoring scheme assigns each identiﬁed goal in the dialogue a score,ranging between minus one and one.The score is determined according to the formula below(Thomas,1999).The formula takes into account whether the goal ultimately succeeds or fails and the number of times the goal was attempted before the user ﬁnally succeeded or gave up.The number of attempts is denoted by.

goal fails

The TBE score for a complete dialogue is calculated as the average of the score per goal,taken over all goals in the dialogue.The rationale behind the scoring formula is the following:

A goal that succeeds in itsﬁrst attempt receives the

maximal score of one.Goals that succeed after further attempts should score less,with a penalty that decays as a function of the number of attempts.

Goals that fail should be penalized more as a func-tion of the number of attempts,since the number of at-tempts can be indicative of the importance of the goal.

Thus,a goal that was attempted once and abandoned receives a score of zero,while a goal attempted ten failed times and then abandoned receives a score of .The penalty decays as a function of the number of attempts.

Our explicit goal in the design of the scoring function was to come up with a function that in fact followed the above rationale.Our formula is only one of a variety of functions which would have the above desired properties. We do not associate great signiﬁcance to the speciﬁc func-tion chosen,but rather to the desired properties themselves. While different functions would result in different absolute scores for individual goals as well as complete dialogues,it is the relative score of different dialogues that is ultimately of greater interest in a TBE.

6.Results

Table1shows the results of the ABE and TBE on En-glish to English translation.There were four human coders. The ABE score is the percent of utterances whose transla-tions preserved the original meaning.The TBE score was computed by the formula above,taking into account suc-cess/failure of goals in addition to the number of attempts for each goal.The row labeled TBE success shows the per-centage of goals that ultimately succeeded(out of a total of approximately460goals in three dialogues).Each row breaks down into a score for the agent(who was an expe-rienced user),a score for the traveller(a second-time user), and an overall score for agent and traveller.

generatedThe results for the less controlled English-German ex-periment are as follows.In one dialogue coded by one coder,there were102goals and a total of133attempts. 83%of the goals ultimately succeeded.The score returned by our scoring function is.73.The ABE showed63%ac-ceptable translations.

7.Discussion and Lessons Learned

There are a few things to notice about Table1.For ex-ample,the users playing the travel agent role have more success in both ABE and TBE than users playing the trav-eller role.This is because the pretend travel agents were system developers and the travellers were second time users of our machine translation system.

Another notable point about Table1is that task success (73.8%)is higher than translation accuracy(51.8%).This conﬁrms the need for TBE in addition to ABE.The rea-son for task success being higher than translation accuracy is that both experienced and inexperienced users accepted some bad translations as long as they can be understood in context.For example,in the context of the question How much does it cost?,users will accept the answer128hours.

The percent of task success,however,does not provide a measure of user frustration(Walker et al.,1997).This is why we formulated the TBE scoring function to take into account success/failure of goals as well as the number of attempts at each goal.(In future work,we will give some thought to making the TBE score(on a minus one to one scale)more comparable to the ABE score(expressed as a percentage).)In sum,weﬁnd three kinds of measures use-ful—a measure of quality andﬁdelity,a measure of goal success/failure,and a measure of user effort combined with success/failure.

We will close by giving some examples that illustrate a peculiarity in our coding scheme:the utterance two is associated with the IF give-information+num-eral(numeral=2),which has a domain action and an argument.Therefore,it counts as two communica-tive goals.A slightly different problem is that the phrase You’ll be returning in You’ll be returning on the twenty ﬁrst counts as two goals give-information+reser-vation+temporal+transportation and trip--type=return.Similarly,is cheaper in T

he bus is

cheaper counts as give-information+price and price=cheaper and With a Mastercard in the context of How will you be paying?counts as giveinformation-+payment and method=mastercard.

Acknowledgements

We would like to thank Alexandra Slavkovic for run-ning the user studies and Kavita Thomas for her prelimi-nary work on the design of the TBE.

8.References

Gates,Donna,Alon Lavie,Lori Levin,Marsal Gavald`a, Monika Woszczyna,and Puming Zhan,1997.End-to-End Evaluation in JANUS:a Speech-to-Speech Transla-tion System.

Levin,Lori,D.Gates,A.Lavie,F.Pianesi,Dorcas Wallace, Taro Watanabe,and Monika Woszczyna,2000.Evalu-ation of a Practical Interlingua for Task-Oriented Dia-logue.In Workshop on Applied Interlinguas:Practical Applications of Interlingual Approaches to NLP.Seattle. Levin,Lori,D.Gates,A.Lavie,and A.Waibel,1998. An Interlingua Based on Domain Actions for Machine Translation of Task-Oriented Dialogues.In Proceedings of the International Conference on Spoken La

nguage Processing(ICSLP’98).Sydney,Australia.

Levin,Lori, A.Lavie,M.Woszczyna, D.Gates, M.Gavald`a, D.Koll,and A.Waibel.The Janus-III Translation System.Machine Translation.To appear. Thomas,Kavita,1999.Designing a Task-Based Evaluation Methodology for a Spoken Machine Translation Sys-tem.In Proceedings of ACL-99(Student Session).Col-lege Park,MD.

Waibel,Alex,1996.Interactive Translation of Conversa-tional Speech.Computer,19(7):41–48.

Walker,Marilyn,D.Litman,C.Kamm,and A.Abella, 1997.PARADISE:A Framework for Evaluating Spo-ken Dialogue Agents.In Proceedings of the Annual Con-ference of the Association for Computational Linguistics (ACL’97).

688IT编程网

In Proceedings of LREC-2000 Lessons Learned from a

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

In Proceedings of LREC-2000 Lessons Learned from a

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式