Python⾃动翻译英语论⽂PDF(三⼗九)涉及技术:
1、Python读取PDF⽂本
2、pandas的读取csv、多数据merge、输出Excel
2、Python正则表达式实现英⽂分词
1. 读取PDF⽂本内容¶
!pip install -i pypi.tuna.tsinghua.edu/simple pdfplumber
import pdfplumber
def read_pdf(pdf_fpath):
pdf = pdfplumber.open(pdf_fpath)
page_conts = []
for page in pdf.pages:
page_conts.act_text())
pdf.close()
return " ".join(page_conts)
pdf_fpath = "D:/tmp/Wide & Deep Learning for Recommender Systems.pdf"
pdf_cont = read_pdf(pdf_fpath)
print(pdf_cont[:2000])
2. 读取英语-汉语翻译词典⽂件
import pandas as pd
# 注意:stardict.csv的地址需要替换成你⾃⼰的⽂件地址
df_dict = pd.read_csv("D:/tmp/ECDICT-master/stardict.csv")
df_dict.sample(10).head()
# 把word、translation之外的列扔掉
df_dict = df_dict[["word", "translation"]]
df_dict.head()
3. 英⽂分词和数据清洗
# 分词
import re
word_list = re.split("""[ ,.\(\)/\n|\-:=\$\["']""", pdf_cont)
word_list[:10]
# 数据清洗
word_list_clean = []
for word in word_list:
word = str(word).lower().strip()
# 过滤掉空词、数字、单个字符的词、停⽤词
if not word or word.isnumeric() or len(word)<=1:
continue
word_list_clean.append(word)
python怎么读入excelword_list_clean[:20]
4. 分词结果构造成⼀个DataFrame
df_words = pd.DataFrame({
"word": word_list_clean
})
df_words.head()
# 统计词频
df_words = (
df_words
.groupby("word")["word"]
.agg(count="size")
.reset_index()
.sort_values(by="count", ascending=False)
)
df_words.head(10)
5. 和单词词典实现merge
df_merge = pd.merge(
left = df_dict,
right = df_words,
left_on = "word",
right_on = "word"
)
df_merge.sample(10)
df_merge.shape
6. 存⼊Excel
_excel("./39. pdf_chinese_english.xlsx", index=False)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论