Data Min Knowl Disc(2013)26:533–558
DOI10.1007/s10618-012-0270-1
Representations for multi-document event clustering
Wim De Smet·Marie-Francine Moens
Received:27October2008/Accepted:11May2012/Published online:20June2012
©The Author(s)2012
Abstract We study several techniques for representing,fusing and comparing con-tent representations of news documents.As underlying models we consider the vector space model(both in a term setting and in a latent semantic analysis setting)and probabilistic topic models based on latent Dirichlet allocation.Content terms can be classified as topical terms or named entities,yielding several models for content fusion and comparison.All used methods are completely unsupervised.Wefind that simple methods can still outperform the current state-of-the-art techniques.
Keywords Text mining·Probabilistic content models·Clustering
1Introduction
When processing news stories of several accounts of a certain happening,it is often relevant to determine whether two stories report on the same event.An event is defined here as a well-specified happening at a certain moment in time(a single day or a short period)which deals with a certain set of ,a hurricane and inundations, an earthquake and lack of drinking water)and involves some named entities.Those entities are,for instance,the actors(such as the names of the leading persons or com-panies)and the location where the event occurred.News stories are typical examples. Broadcast news can be segmented in different stories that each report on a single event. Responsible editor:R.Bayardo.
W.De Smet(B)·M.-F.Moens
Department of Computer Science,K.U.Leuven,Leuven,Belgium
e-mail:wim.desmet@cs.kuleuven.be
spring framework documentationM.-F.Moens
s@cs.kuleuven.be
534W.De Smet,M.-F.Moens Written news typically is recorded per story,where each story reports on one event. However,different sources or even the same source can produce several stories on the same event,which we might group as a preprocessing step for mining,summarizing or searching purposes.In this article we focus on the clustering of textual news stories coming from different sources that describe the same event.We use the words“story”and“document”interchangeably.
Any clustering depends on the quality of the distinction between the elements,and the quantitative representation he distance or dissimilarity function.Our main goal is to investigate the suitability of existing document representations for the event detection task in a repository of news stories.As underlying models we consider the vector space model[both in a term setting as in a latent semantic analysis(LSA) setting(Hofmann1999)]and probabilistic topic models based on latent Dirichlet allo-cation(LDA)(Blei et al.2003).These methods also include representing documents along different angles.Views or aspects1of their content enhance these distance com-putations.Indeed,documents can contain different kinds of information.Based on the definition of a news event,these aspects comprise the event’s topics and its entities.
There is the long standing vector space model(Salton1989)for document rep-resentation where the importance of topics is signalled by the term weights usually computed as the term frequency multiplie
d by the inverse document frequency.Sim-ilarity and distance computations rely on real distances in a geometric vector space. Latent class models have recently become popular for representing content.Among them is the LSA model(Deerwester et al.1990),a vector space model that represents the terms of a document in a lower dimensional space representing semantic con-cepts or topics.There are the newer probabilistic topic models,such as probabilistic LSA(Hofmann1999)and LDA(Blei et al.2003),which see a document as a mixture of topics and topics as mixtures of words.Similarity or dissimilarity are respectively seen here as convergence or divergence of probability distributions.Current information extraction and analysis techniques enable the detection of the aspects.For instance,we already have reliable named entity recognizers for common languages such as English that classify proper names into their semantic categories.Typical semantic categories are locations,persons and organizations.Similarity or dissimilarity computations can alternatively be based on the extracted content elements.
When documents are represented by different different parts of their content),one can compute their dissimilarity based on a single aspect,or based on the combination of similarities each obtained by considering a different aspect.In afirst model,what we could call early fusion or full text,we do not split the information we combine all the features(in our case the topical words and the
named entities)in a vector or probabilistic model and compute the similarity.In a late fusion model,for each aspect there is a different representation,for each aspect representation the dissimilarity between two documents is computed,and the dissimilarities are then fused with an evidence combination function.We will also use an intermediate fusion model,where the representations of the different aspects are used together to calculate hidden variables that can describe a document;the dissimilarity is then expressed in 1“Aspects”here have a broader meaning than the latent probabilistic topics generally meant,which we discuss further in this article.
Representations for multi-document event clustering535 terms of these latent variables.All used methods are completely unsupervised.We compare methods that are trained either on the same,or on a different corpus than the one on which they are applied.
The contributions of this article are the benchmarking study of the value of dif-ferent document representations(algebraic and probabilistic)in an event clustering task and the value of different fusion models that can be applied for computing the dissimilarity in content between two document pairs.An extensive study of the repre-sentation models and of different models of fusion in the dissimilarity computations is to our knowledge non-existing in the literature.This study shows that simple models of text representation,based on term frequency and inverse document frequency,are still competitive com
pared to current latent semantic models.
The techniques proposed in this article offer many avenues for future expansions and applications.The methods,although tested in thefield of news event clustering, could well be applied on other types of texts or media in different comparison or clustering tasks.It would be interesting to assess how the resulting representations influence the computation of similarity of textual content.Documents very often are analyzed and compared.For example,in patentfiles one looks for similarity in sub-ject domains and methods.Facts in police and intelligence reports match based on names,locations,car brands,modus operandi and other extracted information.Auto-matic grading of essay type exams of students and detection of plagiarism constitute another domain,where content comparison is important.In law,precedent reasoning compares cases on the basis of similar facts and correspondence in legal factors and issues.Opinions in political speeches can be compared focusing on specific issues.In medical patient reports common patterns based on certain extracted information can be found.In each of these domains,the text could be divided into different aspects that might be important for text comparison.This article provides a framework for such a comparative analysis.
The remainder of this paper is organized as follows.Section2discusses related work.Section3describes our methodology focusing on document representations, named entity recognition(NER),dissimilarity
metrics and clustering.Our methods are both generically and more specifically described,the latter focusing on the event clustering task.Comparative tests and evaluations are presented in Sect.4.We con-clude by citing the mainfindings and the many applications of our research.
2Related research
Assessing similarities and differences between text documents is a long standing prob-lem.Established algebraic approaches represent documents as term vectors(where terms are possibly weighted by a t f×id f factor)and compute the cosine of the angle of the term vectors(Salton1989)(vector space model).This model assumes that the vectors that span the geometric space are pairwise orthogonal,an assumption which is violated in real texts(Wang et al.1992).
In order to cope with synonym and related terms,an alternative vector space model incorporates document representations based on LSA(Deerwester et al.1990) and singular value decomposition of the term-by-document matrix of a document
536W.De Smet,M.-F.Moens collection.The LSA model is also a kind of aspect model as documents are represented by several“topics”.(Gong and Liu2001;Steinberger et al.2007,2010;Steinberger and Ježek2009)perform a singular value decomposition on the term-by-sentence matrix of a set of docume
nts,from which a multi-document summary is obtained.
Recently probabilistic topic models have been proposed as an alternative to LSA. Proponents claim topic models deal with polysemy in a more natural way.These mod-els have rapidly gained popularity as an unsupervised way of describing documents. The main idea is that documents are viewed as a mixture of topics and each topic as a mixture of words.Several latent topic models exist,such as probabilistic latent semantic analysis or pLSA(Hofmann1999)and LDA(Blei et al.2003).In both cases the topic and word distributions are learned from a large training corpus,but newer models such as LDA learn additional latent variables that are independent of the train-ing corpus,so that also the topic distributions of new,previously unseen documents can be inferred.In addition,LDA models require less parameters to train,whereas this number of parameters grows linearly with the number of documents in a pLSA model.Variant models have been studied by Buntine and Jakulin(2006).Li et al. (2005)build a probabilistic generative model for retrospective news events detection, where an event generates persons,locations,keywords as named entities apart from a time pointer,in this way combining the different aspects.We build further upon these models,but investigate the effect of splitting content representations and fusion of the dissimilarity values obtained with the content representations.
Recent work on probabilistic topic models combines metadata content with topic models as is done by
Mccallum et al.(2005)who steer the discovery of topics accord-ing to the relationships between people.These models,although very valuable in other ways,appear to augment the words in a document with semantics in a limited way, but the document representation is still quite a rudimentary reflection of its semantics. Structured models that take into account topic correlations have been proposed by Li and Mccallum(2006).This model does not yet take into account extracted infor-mation such as named entities.
In the computational linguistics domain,paraphrasing techniques have been developed in order to detect similar content by considering matching of word co-occurrences,matching noun phrases,verb classes,proper nouns,etc.(Hatzivassiloglou 1998;Barzilay and Lee2003),where the matching patterns might be learned in an unsupervised way using sentences that already describe comparable content.From a more cognitive point of view,automatic text understanding and similarity detection, compared to the human capability of doing so has been discussed in Lee and Welsh (2005),Tsatsaronis et al.(2010)and Stone et al.(2011).As a variation of the paraphras-ing models,researchers have attempted to detect contradictions in natural language statements,for instance,by means of handcrafted rules(Mckeown and Radev1995) or learning contradiction models from annotated sentences(de Marneffe et al.2008). These techniques are usually confined tofinding si
milarities or differences in afine grained way,but their use is currently still restricted by a rather low performance, making them less suited for content comparison.
Our work presented here introduces semantic representations of documents which move beyond simple topic models and incorporate specific information extracted from the documents,but are more robust than thefine-grained representations obtained
Representations for multi-document event clustering537 through an in depth natural language processing.Information extraction technolo-gies that semantically classify certain information in the documents(such as NER)in combination with probabilistic topic models offer many interesting possibilities for representing and comparing texts possibly along different aspects of content.
Event detection has received substantial interest in information retrieval research (often as part of topic detection and tracking2(TDT)tasks.Early work on retrospec-tive event detection based on a hierarchical agglomerative clustering(group average clustering)is done by Yang et al.(1999)(building further on Cutting et al.1992).The events are clustered based on lexical(single words)similarity of the documents and temporal proximity.The temporal proximity parameter avoids clustering documents that are too far apart in time.Many different studies on event detection followed these initial efforts(see Alla
n et al.2002for the main approaches).Many of them rely on a vector space representation of the documents,where more recent approaches make a distinction between named entities and non named entity ,Kumaran and Allan2004).In such a scheme each term type might receive a different weight,pos-sibly learned from a training corpus(Zhang et al.2007).As mentioned above we use the vector space model,but our task is different from the TDT task.We do not con-sider a live stream of documents which are ordered in time.Our goal is only to assess similarities in content between documents in the most effective way.
Probabilistic models for representing events in documents are scarce.In Allan et al.(2003),a simple probabilistic language model is used as a document represen-tation.Other research on integrating named entities in an event detection task include Makkonen et al.(2002)and Zhang et al.(2007),where Zhang et al.(2007)demon-strated correlations between named entity types and news classes.
In this paper we also benchmark models characterized by an intermediate or late fusion of evidence obtained from the documents.In the intermediate fusion model we split a document in entities and other words and train a parallel LDA model where top-ics are learned paired on the paired entity and other words documents.In the late fusion model documents are compared by means of their different aspects,and these similar-ity results are fused.In classification,when dealing with heterogeneous infor
mation, late fusion approaches are ,in semantic analysis of video(Snoek2005) or in spam emailfiltering(Hershkop and Stolfo2005)].When fusion information from different sources,the impact of each of the sources might differ.Such weight variables (e.g.,used in a linear interpolation of the evidence)can be learned from a training set. We did not follow this approach,because we did not include any supervision in our model.We fused dissimilarity values by using either their average or maximum scores. Other methods are here possible including probabilistic models(Pearl1991)or models based on Dempster-Shafer evidential theory(Shafer1976).These“split models”are in line with recent work in information retrieval where different language models are built from the different parts of a ,title,HTML anchor texts,body)and combined in a ranking function(Wang et al.2010).
2Note that topic here is used here in a different sense than in latent probabilistic topic models(Nallapati et al.2004).
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论