[Python笔记]⽤LDA(隐含狄利克雷分布)抽取主题分布+⽤户特征⽣成
实习的时候有个任务,利⽤⼤样本关联多种特征⼆分类预测,其中有部分特征为⽂本特征,简单处理过后可取得⼀定收益,便考虑使⽤gensim库⾥的lda抽取样本在⽂本特征上的主题分布⽣成新的特征,具体实现如下:
gensim中lda包的使⽤请参考:
lda原理请参考:
问题背景
在进⾏特征挖掘时,有⼀类表现尚好的特征为样本APP相关信息,其内容皆为⽂本,特征⽂本内容分为3种,apkname唯⼀标识、appname和apptags(以逗号分隔),欲将每个样本所包含的APP相关信息经清洗融合为单⼀⽂本,作为代表该样本的⽂本建⽴主题模型,最终⽤lda得到每个样本在top主题上的概率分布,作为样本新的特征使⽤。
特征的样式……编个例⼦,如下:
apkname appname apptags
com.bilibili.fgo.qihoo命运-冠位指定-FGO正版IP ⽹游 ACT 回合 ⼆次元
数据样本量:约300万
数据中app种类:约28万
预处理
特征⽂本内容分为3种,apkname唯⼀标识、appname和apptags,为⽣成特征⽂本,欲将apkname以"."分隔,appname需进⾏分词以及特殊符号清洗将其融⼊tag。
尽管如上代码在试验⼩数据集时可顺利运⾏,但在实际操作时,由于数据集过⼤造成开发机内存泄漏,本预处理于ETL中以HiveSql进⾏。以下知识点⼗分实⽤:
预处理结束后的样本样式如下:
(如果样本内含多款APP,则相应apkname、appname、apptags会经预处理被分词后分别合在⼀起⽣成APP⽂本)
此处忽略取数过程,得到样本app⽂本,以每个⽤户的app⽂本作为⽂本输⼊⽣成主题模型。
训练lda
import pandas as pd
#sim
import json
import warnings
import gensim
import string
from gensim import corpora , models
import os , sys
import gc
import logging
from logzero import logger
logging .basicConfig (format="%(asctime)s:%(levelname)s:%(message)s",
level =logging .INFO )
warnings .filterwarnings ('ignore') # To ignore all warnings that arise here to enhance clarity
from gensim .models .coherencemodel import CoherenceModel app
= pd .read_csv ('../input/user_apklabels.csv',nrows =None , usecols =['apklabels'])
print (
app .shape [0] - app .isnull ().sum ())
app = app [~app .isnull ()]
#app['is_num_apkname']=[s.isalnum() for s in app['apklabels'].astype(str).values]
#app = app[app['is_num_apkname']==False]
app .shape [0]
app ['apklabels'] = [s .replace ("[","").replace ("]","").replace (" ","") for s in app ['apklabels']]
app .head (20)
doc_clean = [c .split (',') for c in app ['apklabels']]
del app
gc .collect ()
logger .info ('')
dictionary = corpora .Dictionary (doc_clean )
dictionary = dictionary .filter_extremes (no_below =5, no_above =0.3, keep_n =100000)
dictionary = dictionary pactify ()
dictionary .save ("lda_dictionary.dic")
logger .info ('doc ')
doc_term_matrix = [dictionary .doc2bow (doc ) for doc in doc_clean ]
del doc_clean
gc .collect ()
corpora .MmCorpus .serialize ('lda_', doc_term_matrix ) # store to disk, for later use
-- 合在⼀起得到app ⽂本
concat_ws (',',collect_set (apkname_cleaned ),collect_set (app_splited_cleaned ),
collect_set (tag_cleaned )) as app_text
-- group by 之后选取其中⼀个数值就⾏,常⽤于group by userid 的场景(⼀个⽤户所有数据中该栏⽬数据不变)
python新手代码useridcollect_set (is_apply_suc )[0]
-- 正则表达式替换字符
regexp_replace (regexp_replace (regexp_replace (apkname ,'dkplugin|dkmodel|app|App|xposed|The|pro|Pro|apps|com.',''),'^.|$.',''),'.|.{2,}',',') as apkname_cl com,bilibili,fgo,qihoo,命运,冠位,指定,FGO,正版IP,⽹游,ACT,回合,⼆次元
训练1pass⼤概耗时近1⼩时,尽管已经选择多核lda,总体耗时仍较长。建议nohup python 以.py脚本形式离线运⾏,并同时输出结果到⽇志,监控lda训练过程。
由于内存消耗过⼤,在此⽣成model之后先保存模型,转另⼀个脚本进⾏结果输出。
lda 模型的评价
请参考:
以稀疏矩阵输出结果
lda训练得到的结果是以corpus格式保存在模型中的,⽤get_document_topics()⽅法的话,会得到语料
格式的返回主题分布,且会⾃动过滤概率⼩于默认值的主题,返回其余主题编号及⽤户在该主题上的概率。为得到可应⽤于lightGBM的主题分布作为特征输⼊,需要将corpus转为稀疏矩阵。
import gensim
from gensim import corpora
import nltk
from nltk .corpus import stopwords #停⽤词
from nltk .stem .wordnet import WordNetLemmatizer #NLTK 的WordNet 来对同⼀个单词进⾏变体还原
import string
from gensim .utils import simple_preprocess
import operator
from gensim import corpora , models
import os , sys
import gc
import logging
from logzero import logger
logging .basicConfig (format="%(asctime)s:%(levelname)s:%(message)s",
level =logging .INFO )
#这⼀步开始导⼊tag 作为⽂本
logger .info ('')
dictionary = corpora .Dictionary .load ("lda_dictionary.dic")
logger .info ('loading doc ')
doc_term_matrix = gensim .corpora .MmCorpus ('lda_')
#⽤tf-idf 处理:
logger .info ('')
tfidf = gensim .models .TfidfModel (doc_term_matrix )
corpus_tfidf = tfidf [doc_term_matrix ]
del doc_term_matrix
gc .collect ()
logger .info ('start ')
ldamodel = models .ldamulticore .LdaMulticore (corpus_tfidf , num_topics =100, id2word = dictionary , passes =5,iterations = 300,eval_every =50,workers ldamodel .save ('del')
logger .info ('end ')
topics = ldamodel .print_topics (num_topics =100, num_words =10)
logger .info ('topics: %s', topics )
logger .info ('end')
<_document_topics(corpus[0])
>>> [(1, 0.13500942), (3, 0.18280579), (4, 0.1801268), (7, 0.50190312)]
结果拿到的收益不⼤,仅当做⼀次lda及⼤数据集的练习,今后可能考虑使⽤Embedding或者概率图的⽅式再优化处理该特征数据。#topic
制作#
import numpy as np
import pandas as pd
import gensim
from gensim import corpora , models
import os , sys
import gc
from scipy import sparse
import logging
from logzero import logger
logging .basicConfig (format="%(asctime)s:%(levelname)s:%(message)s",
level =logging .INFO )
lda_model_try03 = models .ldamodel .LdaModel .load ('../del')
doc_term_matrix = gensim .corpora .MmCorpus ('../lda_')
tfidf = gensim .models .TfidfModel (doc_term_matrix )
corpus = tfidf [doc_term_matrix ]
del doc_term_matrix
gc .collect ()
get_document_topics = lda__document_topics (corpus ,minimum_probability =0.3)
del corpus
gc .collect ()
all_topics_csr = gensim .matutils .corpus2csc (get_document_topics ,printprogress =20000)
all_topics_numpy = all_topics_csr .T .toarray ()
user_topic = pd .DataFrame (all_topics_numpy )
userid = pd .read_csv ('../user_labels.csv',nrows =None ,usecols =['userid'])
user_topic ['userid'] = userid
del userid
gc .collect ()
user_topic .to_csv ('user_topic_app.csv')
#get_document_topics(bow, minimum_probability=None,minimum_phi_value=None, per_word_topics=False)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论