利⽤python数据分析⼊门,详细教程,教⼩⽩快速⼊门
excel round怎么用 这是⼀篇的数据的分析的典型案列,本⼈也是经历⼀次从⽆到有的过程,倍感珍惜,所以将其详细的记录下来,⽤来帮助后来者快速⼊门,,希望你能看到最后! 需求:对obo⽂件进⾏解析,输出为json字典格式
数据的格式如下:
我们设定⼀个trem or typedef为⼀条标签,⼀⾏为⼀条记录或者是键值对,以此为标准!
下⾯我们来对数据进⾏分析:
数据集中⼀共包含两种标签[trem] and [typedef]两种标签,每个标签下边有多个键值对,和唯⼀的标识符id,每⾏记录以“/n”结尾,且每条标签下下有多个相同的键值对,for examble: is_a,
算法设计:
1. 数据集中含有【trem】和【typedef】两种标签,因此,我们将数据分成两个数据集分别来进⾏处理。
2.循环遍历数据集,将键值对的键去除,并且对每⼀个键进⾏计数,并且进⾏去重操作
(我刚开始的想法是根据id的数量于其他的键的数量进⾏⽐较,出每个标签下存在重复的键值对,进⽽确定每个标签下存在重复的键值对:is_a,有点想多了,呵呵~)
3.由于发现每条标签下的记录的顺序都是⼀定的,id永远排在前⾯,⽤字典的形式存储是顺序是乱的,看上去很不舒服,所以我们相办法将他存在list⾥⾯,最⼤限度的还原了原有数据。
4. 处理相同键的键值对,字典中不允许存在⼀键多值的情况,我们将他存到⼀个list⾥⾯,也就相当于⼤list⾥⾯套⼩list
5.对数据集进⾏遍历,
(1)将取出来的键值对的键值存储起来
(2)以“【”作为我们的结束,将键值对的值存储到相对应的键下⾯,也就是⼀条标签
(3)将我们所取得值存储到汇总在⼀起,并且对声明的字典和list进⾏初始化,⽅便进⾏下⼀次的循环
(4)进⾏到这⾥,我们处理仅仅只是处理完了⼀个标签,还需要⼀个总的list,将所有的标签都存储进去
(这⾥的算法还是不完善的,我希望看到这篇博客的⼈可以提出宝贵的建议)
代码设计以及踩过的坑:
1.打印出所有的键
附引⽤代码:
'''
打印出所有的键
'''
with open('go.obo','r',encoding="utf-8") as f: #打开⽂件
for line adlines(): #对数据进⾏每⼀⾏的循环
list = [] ## 空列表
lable = line.split(":")[0] #读取列表名,
print(lable)
list.append(lable) ## 使⽤ append() 向list中添加元素
# print(list)
#print(lable)
# lst2 = list(set(lst1))
# print(lst2)
print(list)
2.但是在做上⼀步的时候,出现了⼀个问题,那就是没有区分局部变量和全局变量,问题发现的思路,先观察list输出的值,发现只有最后⼀个值,这时候就要考虑值是否被覆盖,到问题,于是把list升级为全局变量
附引⽤代码:
with open('go.obo','r',encoding="utf-8") as f: #打开⽂件
# dict = {}
list = [] ## 空列表
for line adlines(): #对数据进⾏每⼀⾏的循环
total = []
lable = line.split(":")[0] #读取列表名,正确来说读取完列表名之后,还要进⾏去重的处理
# print(lable)
# list.append(lable) ## 使⽤ append() 向list中添加元素
# print(list) 这种操作list中每次都只有⼀个变量
list.append(lable)
#print(lable)
# lst2 = list(set(lst1))
# print(lst2)
# print(list)
dict = {}
for key in list:
dict[key] = (key, 0) + 1
print(dict)
3.我们将统计的结果输出在txt中,这个时候问题出现了,输出的键值对中只有键没有值,这就搞笑了,接着往下⾛
附引⽤代码:
'''
将dict在txt中输出
'''
with open('go.obo', 'r', encoding="utf-8") as f: # 打开⽂件
# dict = {}
list = [] ## 空列表
for line adlines(): # 对数据进⾏每⼀⾏的循环
total = []
lable = line.split(":")[0] # 读取列表名,正确来说读取完列表名之后,还要进⾏去重的处理
# print(lable)
# list.append(lable) ## 使⽤ append() 向list中添加元素
# print(list) 这种操作list中每次都只有⼀个变量
list.append(lable)
# print(lable)
print(">>>>>>>>>###")
# lst2 = list(set(lst1))
# print(lst2)
# print(list)
dict = {}
for key in list:
dict[key] = (key, 0) + 1
print(dict)
fileObject = open('', 'w')
for ip in dict:
fileObject.write(ip)
fileObject.write('\n')
fileObject.close()
4.由于我平时处理的json⽂件⽐较多,主要⾯向mongo,所以我试着将其转化为json格式,发现问题解决了,这⾥还是很神奇的,但是不明确问题出在什么地⽅。附引⽤代码:
import json
with open('go.obo', 'r', encoding="utf-8") as f: # 打开⽂件
# dict = {}
list = [] ## 空列表
for line adlines(): # 对数据进⾏每⼀⾏的循环
total = []
lable = line.split(":")[0] # 读取列表名,正确来说读取完列表名之后,还要进⾏去重的处理
# print(lable)
# list.append(lable) ## 使⽤ append() 向list中添加元素
# print(list) 这种操作list中每次都只有⼀个变量
list.append(lable)
# print(lable)
print(">>>>>>>>>###")
# lst2 = list(set(lst1))
# print(lst2)
# print(list)
dict = {}
for key in list:
dict[key] = (key, 0) + 1
print(dict)
fileObject = open('', 'w')
# for ip in dict:
# fileObject.write(ip)
# fileObject.write('\n')
#
# fileObject.close()
jsObj = json.dumps(dict)
fileObject = open('jsonFile.json', 'w')
fileObject.write(jsObj)
fileObject.close()
5.接下来我先实现简单的测试,抽取部分数据,抽取三个标签,然后再取标签⾥的两个值
附引⽤代码:
with open('nitian','r',encoding="utf-8") as f: #打开⽂件
# dic={} #新建的字典
total = [] #列表
newdic = [] #列表
#在这⾥进⾏第⼀次初始化
#这⾥的每⼀个字段都要写两个
id = {} #
id_number = ""#含有⼀⾏的为“”\ 含有⼀⾏的为字符串
is_a = {}
is_a_list = []#含有多⾏的为[] 含有多⾏的为list
for line adlines(): #对数据进⾏每⼀⾏的循环
lable = line.split(":")[0] #读取列表名,正确来说读取完列表名之后,还要进⾏去重的处理
#print(lable)
#开始判断
if lable == "id": #冒号前的内容开始判断冒号之前的内容
id_number = line[3:] #id 两个字母+
# ⼀个冒号
elif lable == "is_a":
is_a_list.append(line[5:].split('\n'))
elif line[0] == "[":
#把数据存⼊newdic[]中
id["id"] = id_number
newdic.append(id)
is_a["is_a"] = is_a_list
newdic.append(is_a)
#把newdic存⼊总的⾥⾯去
total.append(newdic)
#初始化所有新的标签
id = {} # 含有⼀个的为“”\
id_number = ""
is_a = {}
is_a_list = []
#初始化⼩的newdic
newdic = []
total.append(newdic)
print(total)
6.做到这⾥我们发现问题出了很多,也就是算法设计出现了问题
数据的开头出现了⼀系列的空的{id :“ ”} {name:“”} {},{}.....,多了⼀⾏初始化,回头检查算法,到问题:我们⽤的“[”来判断⼀个标签的结束
修改⽅式(1)将符号“[”作为我们判断的开始
(2)修改数据,将数据中的开头的[term]去掉,加在数据集的结尾
7.数据的后⾯出现了总是出现⼀些没有意义的“ ”,我们发现是我们没有对每个键值对后⾯的标签进⾏处理,所以我们引⼊了strip()函数,但是strip()函数只能作⽤于字符串,当你想要作⽤于list时,要先把list⾥⾯的东西拿出来,进⽽进⾏操作。
8.键值对的键def 与关键字冲突,我们的解决简单粗暴,直接将其转化为⼤写中国慕课网app
9.完整的代码如下:
附引⽤代码:
import json
class GeneOntology(object):
def __init__(self, path):
self.path = path
# Use a dictionary to remove extra values to Simplified procedure
# def rebuild_list(self,record_name):
# records = {id,is_a}
#
# list = rebuile_list('HEADER'')
# (record_name)
# Define a function to read and store data
def read_storage_data(self):
id = {} #Use a dictionary to store each keyword
id_number = "" # Store the value of each row as a string
is_obsolete = {}
is_obsolete_number = ""
is_class_level = {}
is_class_level_number = ""
transitive_over = {}
transitive_over_number = ""
# There is a place where the keyword “def” conflicts, so I want to change the name here.
DEF = {}
DEF_number = ""
property_value = {}
property_value_number = ""
namespace = {}
namespace_number = ""
comment = {}
comment_number = ""
intersection_of = {}
intersection_of_number = ""
xref = {}
xref_number = ""
name = {}
name_number = ""
道具调数play文disjoint_from = {}
disjoint_from_number = ""
replaced_by = {}
replaced_by_number = ""
relationship = {}
relationship_number = ""
alt_id = {}
alt_id_number = ""
holds_over_chain = {}
holds_over_chain_number = ""
subset = {}
subset_number = ""
expand_assertion_to = {}
expand_assertion_to_number = ""
python解析json文件is_transitive = {}
is_transitive_number = ""
is_metadata_tag = {}
is_metadata_tag_number = ""
inverse_of = {}
inverse_of_number = ""
created_by = {}
created_by_number = ""
creation_date = {}
creation_date_number = ""
consider = {}
consider_number = ""
is_a = {}
is_a_list = [] # A field name may have multiple values, so it is stored in the form of a “list”.
synonym = {}
synonym_list = []
newdic = []
f = open(self.path, 'r', encoding="utf-8")
for line adlines():
lable = line.split(":")[0] # Read the list ‘name’, starting from the position of '0', ending with ":", reading all field names
# View the name of the list that was read
# print(lable)
# Start to judge
if lable == "id": # Judge the label for storage
id_number = line[3:].strip() # Remove the label and colon, occupy 3 positions, and strip() is used to remove the trailing spaces. elif lable == "is_obsolete":
is_obsolete_number = line[12:].strip()
elif lable == "is_class_level":replace函数返回值
is_class_level_number = line[15:].strip()
elif lable == "transitive_over":
transitive_over_number = line[16:]
elif lable == "def":
DEF_number = line[5:].strip()
java代码质量检查工具elif lable == "property_value":
property_value_number = line[15:].strip()
elif lable == "namespace":
namespace_number = line[10:].strip()
elif lable == "comment":
comment_number = line[8:].strip()
elif lable == "intersection_of":
intersection_of_number = line[16:].strip()
elif lable == "xref":
xref_number = line[5:].strip()
elif lable == "name":
name_number = line[5:].strip()
elif lable == "disjoint_from":
disjoint_from_number = line[14:].strip()
elif lable == "replaced_by":
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
golang动态实例化解析json
« 上一篇
如何在Shell脚本中漂亮地打印JSON?
下一篇 »
发表评论