DynamicTopicModels的Python实现
Dynamic Topic Models的Python实现
Dynamic Topic Models(DTM)简介
Dynamic Topic Models来源于Blei于2006发表在第23届机器学习国际会议上的论⽂ ,与先前的Latent Dirichlate Allocation(LDA)模型有所不同,DTM引⼊了时间因素,从⽽刻画语料库主题随时间的动态演化。
在LDA模型中,给定语料库中的所有⽂档,并⽆时间先后的差别,与词袋模型(bag-of-words)中的词⽆先后之分类似,在建模的过程中认为整个语料库中的K个主题是固定的。在DTM模型中,⽂档有了时间属性,具有先后之分。DTM认为,在不同时期,主题是随时间动态演化的。举个例⼦,⽐如语料库中存在⾳乐这个主题,从上世纪80年代的⾳乐所反映的内容与现在所反映的内容肯定是存在差别的。
DTM的概率图模型如下:
DTM的⽣成过程如下:
在DTM模型中,是语料库中的K个主题在不同的slice中是不断演化的,⽣成过程的第⼀步和第⼆步中,可以看出,t阶段是语料库中doc-topic分布以及topic-word分布是从t-1阶段演(evoled)化过来。由于在其他主题模型(如LDA)中被⼴泛使⽤的狄利克雷分布不再适合词出的序列建模,此处论⽂中采⽤的是⾼斯分布。
想要详细了解DTM的同学可以参读⼀下Blei的论⽂ 。DTM提供了原始代码 ,但是该代码的实现难度较
⼤,感兴趣的同学可以下载源代码琢磨⼀下。下⾯是本⽂的重点,通过Python调⽤NLP神器Gensim中的模块,实现DTM。
Dynamic Topic Models的实现
数据与预处理
为GitHub上所提供的1324个⽂档,分为三个⽉。
⾸先,需要对⽂档进⾏合共,将三个⽉的⾏为⽂档放⼊⼀个txt⽂件中,以每⼀⾏表⽰⼀个⽂档。这⾥我⾸先是通过java对⽂档进⾏了合并操作,并去除标点,代码如下:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
public class newsToDoc {
public static void main(String[] args){
File file =new File("newsData\\sample");
File newsOut =new File("newsData\\");
File[] files = file.listFiles();
String line;
try{
BufferedWriter bfw =new BufferedWriter(new OutputStreamWriter(new FileOutputStream(newsOut),"UTF-8"));
BufferedReader bfr = null;
for(int i =0;i<files.length;i++){
bfr =new BufferedReader(new InputStreamReader(new FileInputStream(files[i]),"UTF-8"));
while((line = adLine())!=null){
line = placeAll("[`~!@#$%^&*()+=|{}':;',\\\\[\\\\].<>/?~!\"\"?@#¥%……&;*()——+|{}《》【】‘;:’。,、|-]","");//去除标点符号
bfw.append(line);
}
bfw.flush();
}
bfw.flush();
bfw.close();
bfr.close();
}catch(IOException e){
e.printStackTrace();
}
}
}
读者也可以⾃⼰动⼿,使⽤其他⽅式尝试将⽂档进⾏合并以及去除标点的操作。经过处理后,原本分三个⽉的共1324篇新闻已经合并到⽂档中。
Python实现
对⽂档进⾏预处理后,通过Python调⽤Gensim实现DTM.java调用python模型
⾸先导⼊相关模块:
import logging
from gensim import corpora
from six import iteritems
dels import ldaseqmodel
pora import Dictionary, bleicorpus
接下⾯,我们需要将这个⽂档转化成DTM模型所需要的语料库,并构造词典
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)#输出⽇志信息,便于了解程序运⾏情况
stoplist =set('a able about above according i accordingly "i across actually after afterwards again against ain’t - all allow allows almost alone alo ng already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren’t around as a’s aside ask asking associated at available away awfully be became because become be comes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by came can cannot cant can’t cause causes certain certainly changes clearly c’mon co com come comes concerning consequently consider considering co ntain containing contains corresponding could couldn’t course c’s currently definitely described despite did didn’t different do does doesn’t doing done don’t down downwards during each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody ev eryone everything everywhere ex exactly example except far few fifth first five followed following follows for former formerly forth four from furt her furthermore get gets getting given gives go goes going gone got gotten greetings had hadn’t happens hardly has hasn’t have haven’t hav ing he hello help hence her here hereafter hereby herein here’s hereupon hers herself he’s hi him himself his hither hopefully how howbeit h owever i’d ie if ignored i’ll i’m immediate in inasmuch inc indeed indicate indicated indicates inner insofar ins
tead into inward is isn’t it it’d it’ll its it’s itself i’ve just keep keeps kept know known knows last lately later latter latterly least less lest let let’s like liked likely little look looking looks ltd mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself name namely nd ne ar nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now n owhere obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside ov er overall own particular particularly per perhaps placed please plus possible presumably probably provides que quite qv rather rd re really re asonably regarding regardless regards relatively respectively right said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn’t since six so some somebody someh ow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure take taken tell tends th than thank thanks thanx that thats that’s the their theirs them themselves then thence there thereafter thereby therefore therein t heres there’s thereupon these they they’d they’ll they’re they’ve think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying t’s twice two un under unfortun
ately unless unlikely until unto up upon us use used useful uses using usually value various very via viz vs want wants was wasn’t way we we’d welcome well we’ll went were we’re w eren’t we’ve what whatever what’s when whence whenever where whereafter whereas whereby wherein where’s whereupon wherever whether which while whither who whoever whole whom who’s whose why will willing wish with within without wonder won’t would wouldn’t yes yet you you’d you’ll your you’re yours yourself yourselves you’ve zero zt ZT zz ZZ'.split())
#构造词典,并去除停⽤词以及⽂档中只出现过⼀次的词
dictionary = corpora.Dictionary(line.lower().split()for line in open(''))
stop_ids =[
for stopword in stoplist
if stopword ken2id
]
once_ids =[tokenid for tokenid, docfreq in iteritems(dictionary.dfs)if docfreq ==1]
dictionary.filter_tokens(stop_ids + once_ids)# 去除只出现过⼀次的词
dictionarypactify()# 删除去除单词后的空格
dictionary.save('datasets/news_dictionary')# 保存词典
#将⽂档加载成构造语料库
class MyCorpus(object):
def__iter__(self):
for line in open(''):
yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus()
corpus =[vector for vector in corpus_memory_friendly]# 将读取的⽂档转换成语料库
corpora.BleiCorpus.serialize('datasets/news_corpus', corpus)# 存储为Blei lda-c格式的语料库
通过上⾯的⼯作,我们已经将⽂档转换成了DTM模型所需要的词典以及语料库,下⾯把语料库、词典加载到模型中
try:
dictionary = Dictionary.load('datasets/news_dictionary')
except FileNotFoundError as e:
raise ValueError("SKIP: Please download the Corpus/news_dictionary dataset.")
corpus = bleicorpus.BleiCorpus('datasets/news_corpus')
time_slice =[438,430,456]#设置这个语料库的间隔,此处分为三个时期,第⼀个时期内有438条新闻,第⼆为430条,第三个为456条。
num_topics =5#设置主题数,此处为5个主题
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num
_topics)#将语料库、词典、参数加载⼊模型中进⾏训练corpusTopic = ldaseq.print_topics(time=0)# 输出指定时期主题分布,此处第⼀个时期主题分布
print(corpusTopic)
topicEvolution = ldaseq.print_topic_times(topic=0)# 查询指定主题在不同时期的演变,此处为第⼀个主题的
print(topicEvolution)
doc = ldaseq.doc_topics(0)# 查询指定⽂档的主题分布,此处为第⼀篇⽂档的主题分布
print(doc)
关于DTM的Python实现,我的介绍就到这⾥,Gensim中关于DTM模型还提供了其他⽅法可调⽤,想要进⼀步了解的同学可以参读。
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论