pandasparquet⽂件读取pyarrow、feather⽂件保存与读取;reque。。。**pandas读取⽂件填写绝对路径,相对路径可能出错读不了
安装 fastparquet库,需要安装python-snappygetsavefilename
,⼀直安装错误,所以使⽤了pyarrow
import glob
import pandas as pd
import pyarrow.parquet as pq
aaaaa = glob.glob(r'C:\Users\lo理\oss数据\*')
kkk = []
for i in aaaaa:
place("C:",""))
pf = pq.read_place("C:",""))
df1 = pf.to_pandas()
kkk.append(df1)
m = kkk[0]
for j in range(1,12):
m = pd.concat([m, kkk[j]])
另外安装pyarrow库后,pandas也可以读取
df = pd.read_parquet(p, engine="pyarrow")
k12 = pd.read_parquet(r"part-***nappy.parquet")
注意***
pandas 读取parquet的引擎:
建议使⽤pyarrow,以为⽤fastparquet有经历过列表内容⽆法读取显⽰None
feather⽂件保存与读取
差不多⼗倍速度提升(⼤⽂件):
requests 下载⽂件
import requests
r = ("i0.hdslb/bfs/album/1eab364136f7dc024eac1d663bb843c43c996798.jpg", stream=True) f = open(r"D:\⽤户点击⽇志\img2.jpg", "wb")
for chunk in r.iter_content(chunk_size=512):
if chunk:
f.write(chunk)
wget下载
shlex.split 会忽略单双引号;?P的意思就是命名⼀个名字为value的组,匹配规则符合后⾯的.+
import os
import re
import shlex
import subprocess
def wget_fetch(download_url, file_path):
"""调⽤wget下载数据"""
file_name = re.search(r"/parquet/(?P<filename>.+)\?", download_url).group("filename")
save_path = os.path.join(file_path, file_name)
print(save_path)
cmd = f'wget --tries=3 --timeout=60 --output-document="{save_path}" "{download_url}"'
cmd_list2 = shlex.split(cmd)
# from python lib manual
# Run the command described by args. Wait for command to complete, then return a CompletedProcess instance. cp = subprocess.run(cmd_list2)
urncode != 0:
print(f'Download fail; url:{download_url}')
return None
# 实际上只会下载⼀个⽂件
print(f"Download success; file: {save_path}")
return save_path
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论