AmazonReviewDataset数据集介绍
Amazon Review Dataset数据集记录了⽤户对亚马逊⽹站商品的评价,是推荐系统的经典数据集,并且Amazon⼀直在更新这个数据集,根据时间顺序,Amazon数据集可以分成三类:
2013 版
2014版
如果直接跳转到2018版,可换为访问
2018版
Amazon数据集可以根据商品类别分为 Books,Electronics,Movies and TV,CDs and Vinyl等⼦数据集,这些⼦数据集包含两类信息:
以2014版数据集为例:
1. 商品信息描述
asin商品id
title商品名称
price价格
imUrl商品图⽚链接
related相关商品
salesRank折扣信息
brand品牌
categories⽬录类别
官⽅例⼦:
{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "ecx.images-amazon/images/I/51fAmVkTbyL._SY300_.jpg",
"related":
{
"also_bought": ["B00JHONN1S", "B002BZX8Z6"],
"also_viewed": ["B002BZX8Z6", "B00JHONN1S"],
"bought_together": ["B002BZX8Z6"]
},
"salesRank": {"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
}
2. ⽤户评分记录数据
reviewerID⽤户id
asin商品id
reviewerName⽤户名
helpful有效评价率(helpfulness rating of the review, e.g. 2/3)
reviewText评价⽂本
overall评分
overall评分
reviewerID⽤户id
summary评价总结
unixReviewTime评价时间戳
reviewTime评价时间
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times ha rd to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
Amazon数据集读取:
因为下载的数据是json⽂件,不易操作,这⾥主要介绍如何将json⽂件转化为csv格式⽂件。以2014版Amazon Electronics数据集的转化为例:
商品信息读取
import pickle
import pandas as pd
file_path ='meta_Electronics.json'
fin =open(file_path,'r')
df ={}
useless_col =['imUrl','salesRank','related','title','description']# 不想要的字段
i =0
for line in fin:
d =eval(line)
import picklefor s in useless_col:
if s in d:
d.pop(s)
df[i]= d
i +=1
df = pd.DataFrame.from_dict(df, orient='index')
<_csv('meta_Electronics.csv',index=False)
⽤户评分记录数据读取
file_path ='Electronics_10.json'
fin =open(file_path,'r')
df ={}
useless_col =['reviewerName','reviewText','unixReviewTime','summary']# 不想要的字段
i =0
for line in fin:
d =eval(line)
for s in useless_col:
if s in d:
d.pop(s)
df[i]= d
i +=1
df = pd.DataFrame.from_dict(df, orient='index')
<_csv('Electronics_10.csv',index=False)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论