红楼梦⼈物出场顺序python_Python中⽂词频分析——红楼梦
⼈物出场次数
本篇⽂档,带⼤家⽤Python做⼀下词频统计
本章需要⽤到Python的jieba模块
jieba模块是⼀个经典的⽤于中⽂分词的模块
⾸先呢 我们需要读取⽂章的内容,并⽤jieba库的lcut进⾏分词
import jieba
# 读取红楼梦的⽂本内容
txt = open('红楼梦.txt', 'r', encoding='utf-8').read()
# 运⽤jieba库对⽂本内容进⾏分词
words = jieba.lcut(txt)
然后 我们去统计⼈名的出现次数
这⾥需要分析什么词语是⼈名,我们去创建⼀个⽂档,当做字典存储⼈名信息
⼈名还会有其他的表⽰,我们将它转化成⼀样的名字
# 初始化count字典 ⽤于存放⼈名出现频率
counts = {}
# 读取红楼梦⼈名信息
names = open('⼈名.txt', 'r', encoding='utf-8').read().split('、')
# 对分词数据进⾏筛选 将不需要的数据跳过 只保存有效数据
for word in words:
if len(word) == 1:
continue
elif word == '贾母' or word == '⽼太太':
word = '贾母'
elif word in '贾珍—尤⽒'.split('—'):
word = '贾珍'
elif word in '贾蓉—秦可卿'.split('-'):
word = '贾蓉'
elif word in '贾赦—邢夫⼈'.split('-'):
word = '贾赦'
elif word in '贾政—王夫⼈'.split('-'):
word = '贾政'
elif word in '袭⼈-蕊珠'.split('-'):
elif word in '贾琏—王熙凤'.split('-'):
word = '贾琏'
elif word in '紫鹃-鹦哥'.split('-'):
word = '紫鹃'
elif word in '翠缕-缕⼉'.split('-'):
word = '翠缕'
elif word in '⾹菱-甄英莲'.split('-'):
word = '⾹菱'
elif word in '⾖官-⾖童'.split('-'):
word = '⾖官'
elif word in '薛蝌—邢岫烟'.split('-'):
word = '薛蝌'
elif word in '薛蟠—夏⾦桂'.split('-'):
word = '薛蟠'
elif word in '贾宝⽟-宝⽟'.split('-'):
word = '贾宝⽟'
elif word in '林黛⽟-林姑娘-黛⽟'.split('-'):
word = '林黛⽟'
if word not in names:
continue
counts[word] = (word, 0)+1
最后我们将数据排序整理⼀下
# 将⼈名按照次数排序 降序
items = list(counts.items())
# 排序规则 以次数为参考进⾏排序
items.sort(key=lambda x: x[1], reverse=True)
完整代码如下:
import jieba
# 读取红楼梦的⽂本内容
txt = open('红楼梦.txt', 'r', encoding='utf-8').read() # 运⽤jieba库对⽂本内容进⾏分词
words = jieba.lcut(txt)
# 初始化count字典 ⽤于存放⼈名出现频率
# 读取红楼梦⼈名信息
names = open('⼈名.txt', 'r', encoding='utf-8').read().split('、') # 对分词数据进⾏筛选 将不需要的数据跳过 只保存有效数据
for word in words:
if len(word) == 1:
continue
elif word == '贾母' or word == '⽼太太':
word = '贾母'
elif word in '贾珍—尤⽒'.split('—'):
word = '贾珍'
elif word in '贾蓉—秦可卿'.split('-'):
word = '贾蓉'
elif word in '贾赦—邢夫⼈'.split('-'):
word = '贾赦'
elif word in '贾政—王夫⼈'.split('-'):
word = '贾政'
elif word in '袭⼈-蕊珠'.split('-'):
word = '袭⼈'
elif word in '贾琏—王熙凤'.split('-'):
word = '贾琏'
elif word in '紫鹃-鹦哥'.split('-'):
word = '紫鹃'
elif word in '翠缕-缕⼉'.split('-'):
word = '翠缕'
elif word in '⾹菱-甄英莲'.split('-'):
word = '⾹菱'
elif word in '⾖官-⾖童'.split('-'):
word = '⾖官'
elif word in '薛蝌—邢岫烟'.split('-'):
word = '薛蝌'
elif word in '薛蟠—夏⾦桂'.split('-'):
word = '薛蟠'
elif word in '贾宝⽟-宝⽟'.split('-'):
word = '贾宝⽟'
python中文文档elif word in '林黛⽟-林姑娘-黛⽟'.split('-'):
word = '林黛⽟'
if word not in names:
continue
counts[word] = (word, 0)+1
# 将⼈名按照次数排序 降序
items = list(counts.items())
# 排序规则 以次数为参考进⾏排序
items.sort(key=lambda x: x[1], reverse=True)
# print(items)
print('出现次数最多的是:', items[0][0], '出现了:', items[0][1], '次') print('出现次数最少的是:', items[-1][0], '出现了:', items[-1][1], '次') for item in items:
print(item[0], '出现了:', item[1], '次')
效果图如下:
image.png
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论