[python爬⾍]Selenium定向爬取PubMed⽣物医学摘要信息
实现代码
1# coding=utf-8
2"""
3Created on 2015-12-05 Ontology Spider
4@author Eastmount CSDN
5URL:
6 ddir/cate/736.htm
7 dlive/pubmed/
8 dlive/literature/1502224
9"""
10
11import time
12import re
13import os
14import shutil
15import sys
16import codecs
17from selenium import webdriver
18from selenium.webdrivermon.keys import Keys
19import selenium.webdriver.support.ui as ui
20from selenium.webdrivermon.action_chains import ActionChains
21
22#Open PhantomJS
23 driver = webdriver.Firefox()
24 driver2 = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.")
25 wait = ui.WebDriverWait(driver,10)
26
27'''
28 Load Ontoloty
29去到每个⽣物本体页⾯下载摘要信息
30 dlive/literature/literature_view.php?pmid=26637181
31 dlive/literature/1526876
32'''
33def getAbstract(num,title,url):
34try:
35 fileName = "E:\\PubMedSpider\\" + str(num) + ".txt"
36#result = open(fileName,"w")
37#Error: 'ascii' codec can't encode character u'\u223c'
38 result = codecs.open(fileName,'w','utf-8')
39 result.write("[Title]\r\n")
40 result.write(title+"\r\n\r\n")
41 result.write("[Astract]\r\n")
42 (url)
43 elem = driver2.find_element_by_xpath("//div[@class='txt']/p")
44#
45 result.+"\r\n")
46except Exception,e:
47print'Error:',e
48finally:
49 result.close()
50print'END\n'
51
52'''
53循环获取搜索页⾯的URL
54规律 dlive/pubmed/pubmed_search.do?q=protein&page=1
55'''
56def getURL():
57 page = 1 #跳转的页⾯总数
58 count = 1 #统计所有搜索的⽣物本体个数
59while page<=20:
60 url_page = "dlive/pubmed/pubmed_search.do?q=protein&page="+str(page)
61print url_page
62 (url_page)
63 elem_url = driver.find_elements_by_xpath("//div[@id='div_data']/div/div/h3/a")
64for url in elem_url:
65 num = "%05d" % count
66 title =
67 url_content = _attribute("href")
68print num
69print title
70print url_content
71#⾃定义函数获取内容
72 getAbstract(num,title,url_content)
73 count = count + 1
74else:
75print"Over Page " + str(page) + "\n\n"
76 page = page + 1
77else:
78"Over getUrl()\n"
79 time.sleep(5)
80
81'''
82主函数预先运⾏
83'''
84if__name__ == '__main__':
85 path = "F:\\MedSpider\\"
86if os.path.isfile(path): #Delete file
87 os.remove(path)
88elif os.path.isdir(path): #Delete dir
89 (path, True)
90 os.makedirs(path) #Create the file directory
91 getURL()
92print"Download has finished."
分析HTML
1.获取每页Page中的20个关于Protein(蛋⽩质)的URL链接和标题。其中getURL()函数中的核⼼代码获取URL如下: url = driver.find_elements_by_xpath("//div[@id='div_data']/div/div/h3/a")selenium xpath定位
url_content = _attribute("href")
getAbstract(num,title,url_content)
2.再去到具体的⽣物⽂章页⾯获取摘要信息
其中你可能遇到的错误包括:
1.Error: 'ascii' codec can't encode character u'\u223c'
它是⽂件读写编码错误,我通常会将open(fileName,"w")改为codecs.open(fileName,'w','utf-8') 即可。
2.第⼆个错误如下图所⽰或如下,可能是因为⽹页加载或Connection返回Close导致
WebDriverException: Message: Error Message => 'URL ' didn't load. Error: 'TypeError: 'null' is not an object
运⾏结果
得到的运⾏结果如下所⽰:~共400个txt⽂件,每个⽂件包含标题和摘要,该数据集可简单⽤于⽣物医学的本体学习、命名实体识别、本体对齐构建等。
PS:最后也希望这篇⽂章对你有所帮助吧!虽然⽂章内容很简单,但是对于初学者或者刚接触爬⾍的同学来说,还是有⼀定帮助的。同时,这篇⽂章更多的是我的个⼈在线笔记,简单记录下⼀段代码,以后也不会再写Selenium这种简单的爬取页⾯的⽂章了,更多是⼀些智能动态的操作和Scrapy、Python分布式爬⾍的⽂章吧。如果⽂中有错误和不⾜之处,还请海涵~昨天⾃⼰⽣⽇,祝福⾃⼰,⽼师梦啊⽼师
梦
(By:Eastmount 2015-12-06 深夜3点半)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论