Lessons Learned from a Task-Based Evaluation
of Speech-to-Speech Machine Translation
Lori Levin,Boris Bartlog,Ariadna Font Llitjos,Donna Gates,
Alon Lavie,Dorcas Wallace,Taro Watanabe,Monika Woszczyna
Language Technologies Institute
Carnegie Mellon University
Pittsburgh,PA15213USA
u.edu
Abstract
For several years we have been conducting Accuracy Based Evaluations(ABE)of the JANUS speech-to-speech MT system(Gates et al., 1997)which measure quality andfidelity of translation.Recently we have begun to design a Task Based Evaluation for JANUS(Thomas, 1999)which measures goal compl
etion.This paper describes what we have learned by comparing the two types of evaluation.Both evaluations(ABE and TBE)were conducted on a common set of user studies in the semantic domain of travel planning.
1.Introduction
For several years we have been conducting Accu-racy Based Evaluations(ABE)(Gates et al.,1997)of the JANUS speech-to-speech machine translation system (Waibel,1996;Levin et al.,).Our ABE focuses on whether the meaning of a source language segment is totally and accurately conveyed in the target language,and also in-cludes a separate measure offluency.This type of evalu-ation was useful in the early stages of system development for tracking our improvement over time.The measure we used was percent of sentences that were accurate(we call these acceptable)and the percent that were both accu-rate andfluent(we call these perfect).However,when our system reached a level of coverage that allowed us be-gin user studies,we noticed that the ability of a user to com-plete a task(for example,getting a plane reservation)was higher than would be expected based on an ABE.For ex-ample,the ABE might be around70%acceptable,but the users could almost always complete the task.Recently we have begun to design a Task Based Evaluation for JANUS (Thomas,1999)which measures goal completion.This pa-per describes what we have learned by co
mparing the two types of evaluation.
2.Design Criteria
Most previous work on TBE has been conducted on human-machine dialogue(for example(Walker et al., 1997)).For machine translation,we need a TBE that is suitable for two humans each expressing communicative goals,but mediated by a machine.(Our coding scheme for communicative goals is described below.)In particu-lar,we have to separate human clumsiness and error from machine error,because we are not evaluating the humans, but rather the translation of what they said.Additionally, we have to allow for a large and unpredictable number of communicative goals in each dialogue.For example,us-ing the goal coding scheme described below,the dialogues we are evaluating each contain over one hundred commu-nicative goals.After coding the communicative goals in a dialogue,we had to design a scoring function that takes into account whether the communicative goals ultimately succeed or fail and how many times each goal is attempted before succeeding(being understood by the interlocutor)or being abandoned.
3.The Data
The data used for this evaluation came from three user-study dialogues that were unseen by system
developers.In each dialogue,the role of the traveller was played by a second-time user of our machine translation system and the role of the travel agent was played by one of the system developers.The traveller was told to book a trip to Ky-oto.Input to the system was through a headset with micro-phone.The agent and traveller could not see or hear each other.The only communication was through the user inter-face,which included speech synthesis,written translations, and web pages showing itineraries and travel information. There is a total of254utterances in the three dialogues.
In these user studies,the source and target languages were both English.This does constitute a real translation in that it goes through all of the machine translation compo-nents:English sentences are parsed to produce interlingua representations(see below)and then new English sentences are generated from the interlingua.One could argue,how-ever,that there may be some translation problems which do not appear in English-to-English translation.For this reason,we conducted an additional informal user study in which the travel agent was speaking German and the trav-eller was speaking English.This was not as carefully con-trolled as the original user studies;the two users could hear each other and the German speaker also understood En-glish.
4.Coding Scheme for Communicative Goals
The most difficult issue in designing our TBE was defin-ing what counts as a communicative goal.Because we need a definition that allows goals to be coded with high
Transcription(1)Agent:WOHIN#6f REISEN SIE#7f
Where are you travelling?
Ideal IF a:request-information+features+trip(location=question) Recognized as W ANN REISEN SIE
When are you travelling
German Paraphrase Wann reisen Sie ab?
when are you leaving
English Translation When will you leave?
Transcription(2)Client:uh i’m leaving#8f next monday#9f
Ideal IF c:give-information+temporal+departure(time=next
Traveller
58.7%51.8%
.75.65
82.8%73.8%
Table1:Results of Accuracy-and Task-Based Evaluations for English-English Paraphrase
5.The Scoring Function
Our TBE scoring scheme assigns each identified goal in the dialogue a score,ranging between minus one and one.The score is determined according to the formula below(Thomas,1999).The formula takes into account whether the goal ultimately succeeds or fails and the number of times the goal was attempted before the user finally succeeded or gave up.The number of attempts is denoted by.
goal fails
The TBE score for a complete dialogue is calculated as the average of the score per goal,taken over all goals in the dialogue.The rationale behind the scoring formula is the following:
A goal that succeeds in itsfirst attempt receives the
maximal score of one.Goals that succeed after further attempts should score less,with a penalty that decays as a function of the number of attempts.
Goals that fail should be penalized more as a func-tion of the number of attempts,since the number of at-tempts can be indicative of the importance of the goal.
Thus,a goal that was attempted once and abandoned receives a score of zero,while a goal attempted ten failed times and then abandoned receives a score of .The penalty decays as a function of the number of attempts.
Our explicit goal in the design of the scoring function was to come up with a function that in fact followed the above rationale.Our formula is only one of a variety of functions which would have the above desired properties. We do not associate great significance to the specific func-tion chosen,but rather to the desired properties themselves. While different functions would result in different absolute scores for individual goals as well as complete dialogues,it is the relative score of different dialogues that is ultimately of greater interest in a TBE.
6.Results
Table1shows the results of the ABE and TBE on En-glish to English translation.There were four human coders. The ABE score is the percent of utterances whose transla-tions preserved the original meaning.The TBE score was computed by the formula above,taking into account suc-cess/failure of goals in addition to the number of attempts for each goal.The row labeled TBE success shows the per-centage of goals that ultimately succeeded(out of a total of approximately460goals in three dialogues).Each row breaks down into a score for the agent(who was an expe-rienced user),a score for the traveller(a second-time user), and an overall score for agent and traveller.
generatedThe results for the less controlled English-German ex-periment are as follows.In one dialogue coded by one coder,there were102goals and a total of133attempts. 83%of the goals ultimately succeeded.The score returned by our scoring function is.73.The ABE showed63%ac-ceptable translations.
7.Discussion and Lessons Learned
There are a few things to notice about Table1.For ex-ample,the users playing the travel agent role have more success in both ABE and TBE than users playing the trav-eller role.This is because the pretend travel agents were system developers and the travellers were second time users of our machine translation system.
Another notable point about Table1is that task success (73.8%)is higher than translation accuracy(51.8%).This confirms the need for TBE in addition to ABE.The rea-son for task success being higher than translation accuracy is that both experienced and inexperienced users accepted some bad translations as long as they can be understood in context.For example,in the context of the question How much does it cost?,users will accept the answer128hours.
The percent of task success,however,does not provide a measure of user frustration(Walker et al.,1997).This is why we formulated the TBE scoring function to take into account success/failure of goals as well as the number of attempts at each goal.(In future work,we will give some thought to making the TBE score(on a minus one to one scale)more comparable to the ABE score(expressed as a percentage).)In sum,wefind three kinds of measures use-ful—a measure of quality andfidelity,a measure of goal success/failure,and a measure of user effort combined with success/failure.
We will close by giving some examples that illustrate a peculiarity in our coding scheme:the utterance two is associated with the IF give-information+num-eral(numeral=2),which has a domain action and an argument.Therefore,it counts as two communica-tive goals.A slightly different problem is that the phrase You’ll be returning in You’ll be returning on the twenty first counts as two goals give-information+reser-vation+temporal+transportation and trip--type=return.Similarly,is cheaper in T
he bus is
cheaper counts as give-information+price and price=cheaper and With a Mastercard in the context of How will you be paying?counts as giveinformation-+payment and method=mastercard.
Acknowledgements
We would like to thank Alexandra Slavkovic for run-ning the user studies and Kavita Thomas for her prelimi-nary work on the design of the TBE.
8.References
Gates,Donna,Alon Lavie,Lori Levin,Marsal Gavald`a, Monika Woszczyna,and Puming Zhan,1997.End-to-End Evaluation in JANUS:a Speech-to-Speech Transla-tion System.
Levin,Lori,D.Gates,A.Lavie,F.Pianesi,Dorcas Wallace, Taro Watanabe,and Monika Woszczyna,2000.Evalu-ation of a Practical Interlingua for Task-Oriented Dia-logue.In Workshop on Applied Interlinguas:Practical Applications of Interlingual Approaches to NLP.Seattle. Levin,Lori,D.Gates,A.Lavie,and A.Waibel,1998. An Interlingua Based on Domain Actions for Machine Translation of Task-Oriented Dialogues.In Proceedings of the International Conference on Spoken La
nguage Processing(ICSLP’98).Sydney,Australia.
Levin,Lori,  A.Lavie,M.Woszczyna,  D.Gates, M.Gavald`a,  D.Koll,and A.Waibel.The Janus-III Translation System.Machine Translation.To appear. Thomas,Kavita,1999.Designing a Task-Based Evaluation Methodology for a Spoken Machine Translation Sys-tem.In Proceedings of ACL-99(Student Session).Col-lege Park,MD.
Waibel,Alex,1996.Interactive Translation of Conversa-tional Speech.Computer,19(7):41–48.
Walker,Marilyn,D.Litman,C.Kamm,and A.Abella, 1997.PARADISE:A Framework for Evaluating Spo-ken Dialogue Agents.In Proceedings of the Annual Con-ference of the Association for Computational Linguistics (ACL’97).

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。