⽹络爬⾍之html获取和解析(Java)
⽹络爬⾍之html获取和解析
最近同学让帮忙写⼀个⽹页分析程序以获取⽹页中相关内容。为此,忙活了两天,终于写了⼀个从⽹页中获取表格内容程序。虽然⽐较简单,但是想了想,能够为那些想写⽹络爬⾍程序筒⼦们提供⼀定帮助。⽹页分析和内容获取是⽹络爬⾍中必不可少的步骤。
⼀个完整的⽹络爬⾍包含了接个步骤:
1. 获取对应url的html内容。
2. 分析html内容,获取链接。
3. 不断迭代前两个步骤,直到喊停。
其实不难发现,真正关键的是后⾯两个步骤,第⼀个是html内容的分析,其次就是迭代算法设计也就是实际的爬⾍策略设计。⽽我们现在主要讨论第⼀个部分html内容的分析,接下来我们主要介绍⽹上⼀个⽐较直到jar包,htmlparse如何对html进⾏分析,以获取html想要内容,另外要说明的是,本⽂并没有对htmlparse进⾏更详细的描述,只是告知如何使⽤htmlparse达到你的分析html的⽬的。更深层次的使⽤,还需筒⼦们⾃⼰去学习和挖掘。
在学习之前,先分享⼀下htmlparse的wiki⽹址以及下载地址:
wiki: htmlparser.sourceforge/
下载链接: sourceforge/projects/htmlparser/files/Integration-Builds/2.0-20060923/
下载完后,看完htmlparse后,htmlparse包含如下⼏个jar包:filterbuilder.jar、htmlexer.jar、htmlparser.jar、sitecapture.jar、thumblina.jar。 htmlexer.jar负责html词法构成,我是这么理解的。整个htmlparser将html每⼀个标签例如html、p、div、table都称为tag。整个htmlexer可以看作是对html整个页⾯标签的解释。htmlparser.jar的主要功能则是对html进⾏解析,然后利⽤过滤条件获取你想要的内容。
简要描述后,看看我这两天的⼯作吧。
⾸先⾸先是获取种⼦url对应的html内容,这个jdk中包含了对应的url和urlconnection可以帮我们完成,具体看如下代码:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.HttpURLConnection;
import java.MalformedURLException;
import java.URL;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
public class HtmlRetrieve {
/**
* @param html_url the url of html.
* @return if url is not exist, return null else return the content of html.
* */
public String GetContentOfHtml(String html_url){
URL url;
try {
url = new URL(html_url);
HttpURLConnection urlConn = (HttpURLConnection)url.openConnection();
if(urlConn != null)
{
BufferedReader reader = new BufferedReader(new InputStream(),HtmlEncoding.gbk_encoding));
StringBuffer strBuffer = new StringBuffer();
String line;
while((line = adLine())!=null)
{
strBuffer.append(line);
}
urlConn.disconnect();
String();
}
} catch (Exception e) {
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
/**
* This function help save html content into a file.
* @param filePath the path of file.
* @param html_content the content of html need to be saved.
*/
public void SaveToFile(String filePath,String html_content)
{
try {
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(new File(filePath)));
bufferedWriter.write(html_content);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void main(String []args){
HtmlRetrieve htR = new HtmlRetrieve();
try {
String content = htR.GetContentOfHtml("www.szse/szseWeb/FrontController.szse?ACTIONID=7&CATALOGID=1265_xyjy&txtKsrq=2000-11-08&txtZzrq System.out.println(content);
//NodeList nodelist = parse.parse(null);
//htR.HtmlParse(htR.GetContentOfHtml("istock.jrj/list,600071.html"));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
两个函数,⼀个从种⼦url中获取内容,另⼀个将获取的内容保存到⽂件中。个⼈觉得⽐较简单,如果有所不明⽩,可以参考java的api
⽂档。下⾯代码是对html中表格内容的获取程序。在html表格内容获取,实际是对html中的内容进⾏了过滤,对与htmlparse⽀持过滤条
件的设置,同时也运⾏过滤条件组合使⽤,具体可以⾃⼰去看看,htmlparse的filter包中的详细各种filter类。⽽Parser类是htmlparser⾄
关重要的类,在htmlparse中有⼏种构建parser的⽅法,你可以给parser传递⼀个url参数进⾏构建,也可以讲html的内容给parser构造函
数进⾏构建。在下⾯的参考代码中,我采⽤的是传递html内容,记住记得设置采⽤的解析编码。另外还可以只给parser传递html的⽚段代
码进⾏⽚段代码的解析。
import java.URL;
import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.filters.StringFilter;
import org.htmlparser.filters.TagNameFilter;
import des.TagNode;
import des.TextNode;
import org.htmlparser.sax.Attributes;
import org.htmlparser.tags.TableColumn;
import org.htmlparser.tags.TableRow;
import org.htmlparser.tags.TableTag;
import org.htmlparser.tags.TableTag;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.HtmlPage;
/**
*
* @author kalvin
* this class helps you to get content in html.
*
*/
public class HtmlParse {
public HtmlParse()
{
}
/**
* function: retrieve the table content u can add some filter condition.
* In htmlParse jar, it has many filters to help your filter the conent of html.
* @param html_content
* @param encoding
* @param className
* @param filter_str
* @return
*/
public String ParseHtmlTableFromHtml(String html_content,String encoding,Class className,final String filter_str) {
Parser parser = ateParser(html_content, encoding);
if(parser != null)
{
NodeClassFilter nodeClassFilter = new NodeClassFilter(className){
private static final long serialVersionUID = 1L;
public boolean accept(Node node){
Text().startsWith(filter_str))
{
return true;
}
else{
return false;
}
}
};
StringBuffer strBuffer = new StringBuffer();
if(strBuffer != null)
{
try {
NodeList nodeList = actAllNodesThatMatch(nodeClassFilter);
Node node = null;
if(nodeList != null)
{
int size = nodeList.size();
for(int i = 0; i < size; i++)
{
node = nodeList.elementAt(i);
if(node != null)
{
strBuffer.Html());
}
}
}
} catch (ParserException e) {
} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String();
}
}
return null;
}
public static void main(String args[])
{
HtmlParse htmlParse = new HtmlParse();
HtmlRetrieve htmlRetrieve = new HtmlRetrieve();
String html_content = htmlRetrieve.GetContentOfHtml("istock.jrj/list,600071.html");
String filter_str = "table class=\"table\" id=\"topiclisttitle\"";
String table_content = htmlParse.ParseHtmlTableFromHtml(html_content,HtmlEncoding.gbk_encoding,TableTag.class, filter_str); if(table_content != null)
System.out.println(table_content);
/*Parser parse;
try {
parse = new Parser("istock.jrj/list,600071.html");
parse.setEncoding("GBK");
NodeFilter nodeFilter = new NodeClassFilter(TableTag.class){
public boolean accept(Node node)
{
Text().startsWith("table class=\"table\" id=\"topiclisttitle\""))
{
return true;
}else
{
return false;
}
}
};
TagNameFilter tagNameFilter = new TagNameFilter("tr");
TagNameFilter tdTagNameFilter = new TagNameFilter("td");
StringFilter trStringFilter = new StringFilter("cls-data-tr");
HasAttributeFilter attributeFilter = new HasAttributeFilter("class='cls-data-tr'");
AndFilter andFilter = new AndFilter(attributeFilter,trStringFilter);
NodeList nodeList = actAllNodesThatMatch(nodeFilter);
int size = nodeList.size();
System.out.println(size);
Node node = nodeList.elementAt(0);
System.out.Html());
/*Node node = null;
for(int i= 0; i < size; i++)
{
node = nodeList.elementAt(i);
NodeList tdNodeList = Children();
tdNodeList = actAllNodesThatMatch(tdTagNameFilter);
int tdNodeSize = tdNodeList.size();
for(int j=0; j< tdNodeSize; j++)
{
{
node = tdNodeList.elementAt(j);
System.out.PlainTextString());
System.out.println()
}
XmlDocument document = new XmlDocument();
}*/
//System.out.String());
/*} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
*/
}
}
中间⽤到了页⾯编码,还包含⼀个页⾯编码定义的类如下:
/
**
*
* @author kalvinhtml如何下载
* This class support some encodings.
*/
public class HtmlEncoding {
public static final String gbk_encoding = "GBK";
public static final String utf8_encoding = "utf-8";
public static final String utf16_encoding = "utf-16";
}
最后是⼀个测试类,将获取的table内容进⾏解析。
import org.dom4j.Document;
import org.dom4j.io.SAXReader;
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.TagNameFilter;
import des.TagNode;
import org.htmlparser.tags.TableTag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
/
**
*
* @author kalvin
* Html are constructed by mang tags.
* eg:
* <html><head><title></title></head><body><table></table></body></html> * html,head,title,body,table.All of these are tags.
*/
public class ParseTable {
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论