动态⽹页爬取例⼦(WebCollector+selenium+phantomjs)⽬标:动态⽹页爬取
说明:这⾥的动态⽹页指⼏种可能:1)需要⽤户交互,如常见的登录操作;2)⽹页通过JS / AJAX动态⽣成,如⼀个html⾥有<div
id="test"></div>,通过JS⽣成<div id="test"><span>aaa</span></div>。
这⾥⽤了WebCollector 2进⾏爬⾍,这东东也⽅便,不过要⽀持动态关键还是要靠另外⼀个API -- selenium 2(集成htmlunit
和 phantomjs).
1)需要登录后的爬取,如新浪微博
import java.util.Set;
import cn.edu.hfut.awler.DeepCrawler;
import cn.edu.hfut.del.Links;
import cn.edu.hfut.del.Page;
import cn.edu.hfut.dmic.webcollector.HttpRequesterImpl;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import des.Element;
import org.jsoup.select.Elements;
/*
* 登录后爬取
* Refer: /topics/33
* github/CrawlScript/WebCollector/blob/master/README.zh-cn.md
* Lib required: webcollector-2.07-bin, selenium-java-2.44.0 & its lib
*/
public class WebCollector1 extends DeepCrawler {
public WebCollector1(String crawlPath) {
super(crawlPath);
/*获取新浪微博的cookie,账号密码以明⽂形式传输,请使⽤⼩号*/
try {
String cookie=SinaCookie("yourAccount", "yourPwd");
HttpRequesterImpl myRequester=(HttpRequesterImpl) HttpRequester();
myRequester.setCookie(cookie);
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public Links visitAndGetNextLinks(Page page) {
/*抽取微博*/
Elements Doc().select("div.c");
for(Element weibo:weibos){
System.out.());
}
/*如果要爬取评论,这⾥可以抽取评论页⾯的URL,返回*/
return null;
}
public static void main(String[] args) {
WebCollector1 crawler=new WebCollector1("/home/hu/data/weibo");
crawler.setThreads(3);
/*对某⼈微博前5页进⾏爬取*/
for(int i=0;i<5;i++){
crawler.addSeed("weibo/zhouhongyi?vt=4&page="+i);
crawler.addSeed("weibo/zhouhongyi?vt=4&page="+i);
}
try {
crawler.start(1);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class WeiboCN {
/**
* 获取新浪微博的cookie,这个⽅法针对weibo有效,对weibo⽆效
* weibo以明⽂形式传输数据,请使⽤⼩号
* @param username 新浪微博⽤户名
* @param password 新浪微博密码
* @return
* @throws Exception
*/
public static String getSinaCookie(String username, String password) throws Exception{
StringBuilder sb = new StringBuilder();
HtmlUnitDriver driver = new HtmlUnitDriver();
driver.setJavascriptEnabled(true);
<("login.weibo/login/");
WebElement mobile = driver.findElementByCssSelector("input[name=mobile]");
mobile.sendKeys(username);
WebElement pass = driver.findElementByCssSelector("input[name^=password]");
pass.sendKeys(password);
WebElement rem = driver.findElementByCssSelector("input[name=remember]");
rem.click();
WebElement submit = driver.findElementByCssSelector("input[name=submit]");
submit.click();
Set<Cookie> cookieSet = driver.manage().getCookies();
driver.close();
for (Cookie cookie : cookieSet) {
sb.Name()+"="+Value()+";");
}
String String();
ains("gsid_CTandWM")){
return result;
}else{
throw new Exception("weibo login failed");
}
}
}
}
* 这⾥有个⾃定义路径/home/hu/data/weibo(WebCollector1 crawler=new WebCollector1("/home/hu/data/weibo");),是⽤来保存到嵌⼊式数据库Berkeley DB。
* 总体上来⾃Webcollector 作者的sample。
2)JS动态⽣成HTML元素的爬取
import java.util.List;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import cn.edu.hfut.awler.DeepCrawler;
import cn.edu.hfut.del.Links;
import cn.edu.hfut.del.Page;
/*
* JS爬取
* Refer: blog.csdn/smilings/article/details/7395509
*/
public class WebCollector3 extends DeepCrawler {
public WebCollector3(String crawlPath) {
super(crawlPath);
// TODO Auto-generated constructor stub
}
@Override
public Links visitAndGetNextLinks(Page page) {
/*HtmlUnitDriver可以抽取JS⽣成的数据*/
// HtmlUnitDriver Driver(page,BrowserVersion.CHROME);
// String content = PhantomJSDriver(page);
WebDriver driver = WebDriver(page);
// List<WebElement> divInfos=driver.findElementsByCssSelector("#feed_content");
List<WebElement> divInfos=driver.findElements(By.cssSelector("#feed_content span"));
for(WebElement divInfo:divInfos){
System.out.println("Text是:" + Text());
}
return null;
}
public static void main(String[] args) {
WebCollector3 crawler=new WebCollector3("/home/hu/data/wb");
for(int page=1;page<=5;page++)
// crawler.addSeed("www.sogou/web?query="+de("编程")+"&page="+page); crawler.addSeed("cq.qq/baoliao/detail.htm?294064");
try {
crawler.start(1);
} catch (Exception e) {
e.printStackTrace();
}
}
}
PageUtils.java
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.ie.InternetExplorerDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import cn.edu.hfut.del.Page;
public class PageUtils {
public static HtmlUnitDriver getDriver(Page page) {
HtmlUnitDriver driver = new HtmlUnitDriver();
driver.setJavascriptEnabled(true);
<(Url());
return driver;
}
public static HtmlUnitDriver getDriver(Page page, BrowserVersion browserVersion) {
HtmlUnitDriver driver = new HtmlUnitDriver(browserVersion);
driver.setJavascriptEnabled(true);
<(Url());
return driver;
}
public static WebDriver getWebDriver(Page page) {
// WebDriver driver = new HtmlUnitDriver(true);
// System.setProperty("webdriver.chrome.driver", "D:\\Installs\\Develop\\crawling\\");
// WebDriver driver = new ChromeDriver();
System.setProperty("phantomjs.binary.path", "D:\\Installs\\Develop\\crawling\\phantomjs-2.0.0-windows\\bin\\"); WebDriver driver = new PhantomJSDriver();
<(Url());
// JavascriptExecutor js = (JavascriptExecutor) driver;
// js.executeScript("function(){}");
return driver;
}
public static String getPhantomJSDriver(Page page) {
selenium获取cookie
Runtime rt = Runtime();
Process process = null;
try {
process = rt.exec("D:\\Installs\\Develop\\crawling\\phantomjs-2.0.0-windows\\bin\\ " +
"D:\\workspace\\crawlTest1\\src\\crawlTest1\\parser.js " +
InputStream in = InputStream();
InputStreamReader reader = new InputStreamReader(
in, "UTF-8");
BufferedReader br = new BufferedReader(reader);
StringBuffer sbf = new StringBuffer();
String tmp = "";
while((tmp = br.readLine())!=null){
sbf.append(tmp);
}
String();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
2.1)HtmlUnitDriver getDriver是selenium 1.x的作法,已经outdate了,现在⽤WebDriver getWebDriver
driver类型优点缺点应⽤
真实浏览器driver真实模拟⽤户⾏为效率、稳定性低兼容性测试
HtmlUnit速度快js引擎不是主流的浏览器⽀持的包含少量js的页⾯测试PhantomJS速度中等、模拟⾏为接近真实不能模拟不同/特定浏览器的⾏为⾮GUI的功能性测试
* 真实浏览器driver 包括 Firefox, Chrome, IE
2.3)⽤PhantomJSDriver的时候,遇上错误:ClassNotFoundException: org.openqa.selenium.browserlaunchers.Proxies,原因竟然是selenium 2.44 的bug,后来通过maven到phantomjsdriver-1.2.1.jar 才解决了。
2.4)另外,我还试了PhantomJS 原⽣调⽤(也就是不⽤selenium,直接调⽤PhantomJS,见上⾯的⽅法),原⽣要调⽤JS,这⾥的parser.js代码如下:
system = require('system')
address = system.args[1];//获得命令⾏第⼆个参数接下来会⽤到
//console.log('Loading a web page');
var page = require('webpage').create();
var url = address;
//console.log(url);
page.open(url, function (status) {
//Page is loaded!
if (status !== 'success') {
console.log('Unable to post!');
} else {
//此处的打印,是将结果⼀流的形式output到java中,java通过InputStream可以获取该输出内容
console.t);
}
});
3)后话
3.1)HtmlUnitDriver + PhantomJSDriver是当前最可靠的动态抓取⽅案。
3.2)这过程中⽤到很多包、exe,遇到很多的墙~,有需要的朋友可以我要。
Reference
... ...
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论