⽤Python读取⼀个⽂本⽂件并统计词频
刚刚在写⽂章时360浏览器崩溃了,结果内容还是回来了,感谢博客园的⾃动保存功能!!!
------------恢复内容开始------------
最近在学习Python,⾃⼰写了⼀个⼩程序,可以从指定的路径中读取⽂本⽂档,并统计其中各单词出现的个数并打印
1import os
2#此⽅法⽤于创建⽂件夹及⽂件
3def createFile(fileName,content,filePath=r'd:/PythonExercise/'):
4# 创建⽂件夹
5 os.mkdir(filePath)
6 fullPath=filePath+fileName
7 f=open(fullPath,'w')
8 f.write(content)
9 f.close
10#将下⾯⼀句话写⼊指定的⽂件
11 createFile('',"Life is short,so let's just enjoying Python!")
12
13#此⽅法⽤于读取⽂件并统计词频
14def getWordsFrequency(fullFilePath=r'd:/'):
15 f=open(fullFilePath,'r')
16# 读取内容,并以空格分隔,split中如果不传参,默认为空格,以下适⽤于英⽂
17 adline().split()
18# 以下适⽤于中⽂,由于中⽂汉字之间没有空格,读出来整体会是个str,所以要⽤list()转换成以单个汉字为内容的list
19# tmp=adline())
20
21 f.close()
22print(tmp)
23#标点符号集
24 punctuation='''~!@#$%^&*()_+-[]{};:,./?"'''
25#如果只是以空格分隔,会得到⼀些单词和标点的组合,如“if,”not!"之类的,遍历list并将其中含有标点的内容分隔,去掉原内容并将分割后的list加在原list后
26for i in tmp:
27for p in punctuation:
28if p in i:
29 tmp1=i.split(p)
30 ve(i)
31 d(tmp1)
32#将空元素''去掉并将原有单词中所有字母转换成⼩写
33for j in tmp:
34if j=='':
35# print("let's remove null")
36 ve(j)
37else:
38# print("let's get lowers")
39 tmp[tmp.index(j)]=j.lower()
40# place(j,j.lower())
41#上⾯的if语句中已经去除过''字符,但不知为什么,最后⼀个去不掉,因此再去除⼀遍
unt('')!=0:
43 ve('')
44# unt(''))
45# print('tmp after lower case',tmp)
46#将处理后的单词列表去重并转化为tuple,⽅便后⾯使⽤
47 keys=tuple(set(tmp))
48print(keys)
49#⽣成⼀个和上⾯keys,即去重后的单词的元组长度相同的list,并先赋初值为0,⽅便后续统计词频
50 freq=list(0*i for i in range(len(keys)))
51# print(freq)
52#从keys中获取单词,并在tmp中统计出现的次数,将次数赋给freq中的元素,由于freq长度和keys⼀样,所以freq的序号可以和keys⼀⼀对应,⽅便后⾯组成字典
53for words in keys:
54 freq[keys.index(words)]=unt(words)
55# print(freq)
56#新建⼀个字典
57 freqDict={}
58#将keys批量导⼊成为字典的键
59 freqDict=dict.fromkeys(keys)
60#此时如果打印freqDict可以看到它的值全为None
61# print(freqDict)
62#将上⾯和和keys⼀⼀对应的freq的值赋给freqDict中对应的键
63for words in keys:
64# print(freqDict[words])
65 freqDict[words]=freq[keys.index(words)]
66print(freqDict)
67return freqDict
68运⾏该函数就可以以字典的形式打印出词频
69 getWordsFrequency()
70
71以下语句是从上⾯读出的单词中随机抽10个打印出来
72 wordSet=list(getWordsFrequency().keys())
73 #print(wordSet)
74import random as r
75抽取10个不同的元素,此⽅法随机数可以去重
python怎么读文件76 randomWords=r.sample(wordSet,10)
77⽤下⾯三⾏也可以抽出10个单词,但可能会有重复值
78# randomWords=[]
79# for i in range(10):
80# randomWords.append(r.choice(wordSet))
81print(randomWords)
程序输出的结果
(1)从bing的国际版随意⼀条中选取了⼀段新闻并保存到中,运⾏结果如下
{'orbit': 1, 'hanging': 2, 'another': 1, 'pretty': 2, 'planets': 2, 'planet': 2, 'of': 2, 'system': 2, 'a': 4, 'rings': 1,
'two': 1, 'there’s': 1, 'life': 1, 'claim': 1, 'features': 1, 'moons': 3, 'both': 1, 'conditions': 1, 'means': 1, 'survey': 1, 'moon': 3, 'chance': 1, 'possible': 1, ['to', 'system', 'reports', 'it', 'would', 'no', 'are', 'pretty', 'all', 'make']
(2)从中选取了⼀条,把其中第⼆段并保存到中,运⾏结果如下
{'闻': 1, '开': 1, '家': 2, '表': 1, '管': 1, '⽣': 4, '务': 1, '会': 1, '疫': 1, '系': 1, '物': 3, '⾮': 1, '了': 1, '院': 1, '以': 1, '法': 1, '⽇': 1, '施': 2, '格': 1, '⼯': 1, '最': 1, '控': 2, '取': 1, ',': 3, '措': 1, '布': 1, '严': 2, '新': 1, '保': 1, '厉': 1, '作': 1, '。': 2, '召': 1, '来': 1, '打': 1, '野<class 'list'>
['严', '联', '的', '2', '动', ',', '和', '院', '物', '施']
(3)将保存到中,运⾏结果如下
{"aren't": 1, 'implicit': 1, 'right': 1, 'practicality': 1, 'nested': 1, 'although': 3, 'beautiful': 1, 'break': 1, 'errors': 1, 'of': 3, 'refuse': 1, 'a': 2, 'dense': 1, 'more': 1, 'easy': 1, "you're": 1, 'sparse': 1, 'peters': 1, 'do': 2, 'may': 2, 'explicit': 1, 'implementation': 2, 'ofte <class 'list'>
['of', 'honking', 'preferably', 'by', "you're", 'complicated', 'sparse', 'and', 'pass', 'enough']
------------恢复内容结束------------
为了防⽌链接失效,⼿动将1、2、3中的三段⽂本放在下⾯
1、
Several planets in our solar system are famous for distinctive features. Saturn has its wondrous rings and Jupiter has its famous red spot, while Uranus has its many moons and planets like Mercury have no moons at all. Earth’s main claim to fame is it being the only planet in the solar system with the perfect conditions for human life and a single moon, both of which make it a pretty special place for use to call home. But, there’s a chance that Earth could have another moon hanging out in its orbit for now. CNET reports that Catalina Sky Survey astronomers in Tuscon, AZ has its eyes on an asteroid hanging out in Earth’s gravity. It’s possible that this body of rock could be a mini-moon orbiting our planet, which means Earth would have two moons. That’s pretty cool, right?
2、
国务院联防联控机制27⽇召开新闻发布会,介绍坚决取缔和严厉打击⾮法野⽣动物市场和贸易⼯作情况。国家林草局野⽣动植物保护司副司长王维胜表⽰,疫情发⽣以来,国家林草局实施了最为严格的野⽣动物管控系列措施。
3、
The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论