搜索引擎论文--688IT编程网

The Design and Realization of Open-Source Search

Engine Based on Nutch

Guojun Yu 1Xiaoyao Xie *,2Zhijie Liu 3

Key Laboratory of Information and Computing Science of Guizhou Province

Guizhou Normal University Network Center

Guiyang,China

xyx@gznu.edu (corresponding author:Xiaoyao Xie)

Abstract —Search engines nowadays are becoming more and

more necessary and popular in surf surfing ing the Internet Internet.

.However,how these search engines like G oogle or B aidu work works s is unknown to many people.This paper,through a research into Open-source search engine Nutch,introduces how a common search engine works.By using Nutch,a search engine which

belongs to Guizhou Normal University University’

’s website is designed and at last,through the improvement of Nutch Nutch’

’s sorting algorithm and experiment experiment,

,it can be found that Nutch is very suitable for working in home-search home-search..Keywords-Search Engine Engine;;Nutch Nutch;;Lucene Lucene;;Java Open Source Source;

;I.

I NTRODUCTION

Nutch is an open-source search engine based on Lucene

Java,which is an open-source information retrieval library supported by the Apache Software Foundati

on for the search and index component,providing a crawler program,an Index engine and a Query engine[1].Nutch consists of the following three parts:

(1)Pages collection (fetch).The program of collecting pages,by timely collection or incremental collection,chooses the URLs,through which pages are to be visited and then fetched to the local disk by the crawler.

(2)Creating index.The program of creating index converts the pages or other files into the txt-document,divides them into segments,filters some useless information and then,creates and assists indexes which are composed by some smaller indexes based on key words or inverted documents.

(3)Searcher.The program of searcher accepts user’s query words through segmentation and filtering and then divides them into groups of key words,according to which correspondent pages are matched

in treasury index.Then,it puts the matches in order by sorting and returns the results to the users.

nutch搜索引擎The overall framework of Nutch is listed in

figure

Figure 1

II.

ACKGROUND

On account of the fact that there are so many sites under Guizhou Normal University’s website,not only the pages but also some other resources like doc,pfd are needed to be indexed.In this sense,adding the text analyzer module to the design based on Nutch’s framework,the whole design is composed by the crawler design module,the text analyzer module,the index module and the search module as listed in figure2.

Figure2

III.

HE PROCESS OF THE WORKFLOW

An Analyzsis of the Nutch’Crawler

A Web crawler is a kind of robot or software agent.In general,it starts with a list of URLs to visit,called the seeds.When visiting these URLs,the crawler identifies all the hyperlinks in the page and adds them to the list of URLs to visit,called the crawl frontier [2].URLs from the frontier are recursively visited according to a set of policies.See figure3referenced from

[2].

Figure3

There are four factors affecting the crawler’s ability referenced by [3]:

Depth:the depth of the download

topN:the amount of page hyperlinks before the download

Threads:the threads which the download programmer uses

Delay:the delay time of the host visiting The work process of the Nutch’s Crawler includes four steps as follows:

1.Create the initial collection of the URL.

2.Begin the Fetching based on the pre-defined Depth,topN,Threads and Delay.

3.Create the new URL waiting list and start the new round of Fetching like in Figure 4referenced by [8].

4.Unite the resources downloaded in the local disk.

Page Voice Elimination

After getting the content,the pages include a lot of tags and other ad information.It is necessary to eliminate these spasms and get the effective document.Here the program must complete two missions.See figure 6referenced by [9].

1.Analyze the inner html pages’basis information and distinguish the structure of the pages.

2.At the same time,eliminate the voice of the page and avoid the same results.

Figure 5

Under the directory of the Nutch workspace,there are some folders listed as follows:

Crawldb Directory:This folder stores the URLs downloaded and the time when they were downloaded.

Linkdb Directory:This folder stores the relationship between the URLs,which is form the parsed results after the download.

Segments:This folder stores the pages and resources that the crawler has fetched.The amount of the directories is related to the depth of the crawler’fetch.For much better management,the folders are named in their time.

Creating the Index

At the heart of all search engines is the concept of indexing,which means processing the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching.

Nutch’s Documents are analyzed and disposed by Lucene.Lucene is a high performance,scalable Information Retrieval (IR)library [4].It lets you add indexing and searching capabilities to your applications.Lucene is a mature,free,open-source project implemented in Java.Figure 6referenced by [6]displays the framework of the Lucene.And there are three steps to complete the work referenced by [5]-[6].

Figure6

The first step:Document Converting

Lucene does not care about the source of the data,its format,or even its language as long as you can convert it to text.This means you can use Lucene to index and search data stored in files,web pages on remote web servers, documents stored in local file systems,simple text files, Microsoft Word documents,HTML or PDF files,or any other formats,from which you can extract textual information.Figure7referenced by[6]telling

more.

Figure7

The second step:Analysis

Once you have prepared the data for indexing and have created Lucene Documents populated with Fields,you can call Index Writer’s add-Document(Document)method and hand your data off to Lucene to index.When you do that, Lucene first analyzes the data to make it more suitable for indexing.To do so,it splits the textual data into chunks,or tokens,and performs a number of optional operations on them.For instance,the tokens could be lowercased before indexing to make searches case-insensitive.Typically it’s also desirable to remove all frequent but meaningless tokens from the input,such as stop words(a,an,the,in,on,and soon)in English text.

An important point about analyzers is that they are used internally for fields flagged to be tokenized.Documents such as HTML,Microsoft Word,XML contain meta-data such as the author,the title,the last modified date,and potentially much more.When you are indexing rich documents,this meta-data should be separated and indexed as separate fields.

The third step:Storing the Index

An inverted index(also referred to as postings file or inverted file)is an index data structure storing a mapping from content,such as words or numbers,to its locations in a database file,or in a document or a set of documents,in this case allowing full text search.The inverted file may be the database file itself,rather than its index.It is the most popular data structure used in document retrieval systems.

With the inverted index created,the query can now be resolved by jumping to the word id(via random access)in the inverted index.Random access is generally regarded as being faster than sequential access.

The main Classes which achieve three steps listed as follows:Index Writer,Directory,Analyzer,Document, and Field.

D.The Disposal of the Chinese Words Segmentation

A major hurdle(unrelated to Lucene)remains when we are dealing with various languages,handling text encoding. The Standard Analyzer is still the best built-in general-purpose analyzer,even accounting for CJK characters. However,the Sandbox CJK Analyzer seems better suited for Chinese Words analysis[6].When we are indexing documents in multiple languages into a single index,using a per-Document analyzer is more appropriate.

At last,under the directory of the Nutch workspace, there are some folders which store the index listed as follows:

Indexes:stores individual index directories.

Index:stores the last directory according to the Lucene’s format,which is combined by some individual indexes.

E.The Design and Realization of the Searching Module

Searching is the process of looking up words in an index to find documents where they appear.The quality of a search is typically described using precision and recall metrics[7].Recall measures how well the search system finds relevant documents,whereas precision measures how well the system filters out the irrelevant documents. However,we must consider a number of other factors when thinking about searching.Support for single and multi-term queries,phrase queries,wildcards,result ranking,and sorting is also important as a friendly syntax for entering those queries.

Figure7shows the process of the searching.

Pretreatment means carrying on text treatment. Segmentation through the class Query Parser and mixing a term in accordance with the Lucene format are two examples.

The main classes which achieve these functions are listed as follows:Index Search,Term,Query,Term Query, Hits.

F.Sorting Search Results

Some common search Sorting models are Boolean logic model,Fuzzy logic model,Vector logic model and Probability searching model.In some applications we mainly use vector logic model which calculates the weighted parameters through the TF-IDF method.

In this process,through calculation from the key words and the document’s relativity,we can get the value of the relativity between the key words and each document.And then,we sort these values,putting the document which meets the need(the value is higher)forward to the user,But this model has some limits:First,Web has mass data.The page includes a lot of insignificant and iterant messages which affect the information that users really want.The model cannot deal with these messages well.Second,the model does not take the links into account.In fact,the other goal of the search engine is to find the page which users often visit.Through the page the search engine could de

cide the importance of links of another page,like Page Rank.

Lucene’s sorting model is improved based upon vector model,listed as follows:

Lucene sorting algorithm[6]:

score_d=sum_t(tf_q*idf_t/norm_q*tf_d*idf_t/ norm_d_t)

score_d:Document(d)’score.

sum_t:Term(t)’summation.

tf_q:The square root of t’s frequence.

tf_d:The square root of t’s frequence in d.

idf_t:log(numDocs/docFreq_t+1)+1.0。

0numDocs:The amount of Document in the index.

docFreq_t:The amount of Document included in the index.

norm_q:sqrt(sum_t((tf_q*idf_t)^2))。

norm_d_t:the square root of the amount of the tokens in d,which has a same domain with t

But it also has some shortcomings.For instance,the precision of the query is not very good,and it does not show the weightiness of the web page.

Under this situation,we have done some improvement to Lucene’s sorting algorithm which is shown as follows.

The improved algorithm:

Score_d=k1*OldScore+k2*PrScore+k3*ReScore +k4*homePageScore

Score_d:Record d’s score.

OldScore:The d’s score is calculated by the Lucene’s sorting algorithm.

PrScore:Record d’s PageRank score.

ReScore:Record d’s score when if the document has been queried for a second time.

ReScore=rescore+(hitNum-1)*increment。

homePageScore:Record the homepage’s score.

K1,K2,K3,K4are Weight coefficient PR(A)=

(1-d) +d(PR(1)/C(1)+...+PR(n)/C(n))。

PageRank,second query,The home PageScore has optimized the precision of the searching process.

IV.

XPERIMENTS

A.Crawl Testing

First,the test starts with crawling Guizhou Normal University’s u.edu.Figure8shows the process.

Figure8

B.Search Testing

We deploy the project into the tomcat,start the tomcat visit the localhost:8080/Mynutch/search.jsp,and then Figure9appears.

Figure 9

Then we can see the results from Figure

10.

Figure 10

ONCLUSION

Nutch is an open-source search engine framework based on Lucene Java.On the basis of the web crawlers and NDFS files system Nutch provides,a proficient search engine system can be developed to search,find out,filter,segment

information and then provides users with searching service.One of the biggest advantages of it is open-source,in accordance with which some algorithms of search engine and data framework can be developed.Furthermore,by using Nutch,enterprises can invent an appropriate web search engine according to their own characteristics and needs.

A CKNOWLEDGMENT

The Authors thank Professor Xiaoyao Xie for his excellent suggestion to pursue this topic,and also thank Professor Zhijie Liu from Guizhou Normal University’s Network Center for his help of giving admission to authors to use WebCrawler to fetch and monitor the Guizhou Normal University’s website.

R EFERENCES

[1]/wiki/Nutch .2008.

[2]/wiki/Webcrawler .2008.

[3]

Shkapenyuk V ,Suel T .Design and Implementation of a High Performance Distributed Web Crawler//Proc .of the 18th ICDE Conf..SanJ ose ,California,USA:[s.n].,2002:357-368.[4]/wiki/Lucene .2007.[5]/.2008

[6]

Erik hatcher.Otis Gospodnetic,Lucene in Action,MANNING Press,June 15,2007Junghoo Cho,Hector Garcia-Molina and Lawrence Page ，Efficient crawling through URL ordering [7]Risvik K N ,Michelsen R .Search Engines and Web Dynamics.Computer Networks ,39(3):289-302,2002.

[8]

LEI Kai,WANGDong-hai,Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine ,Computer Engineering Vol.34No.13July 2008

[9]

Song R,Xin G,Shi S,etal.Exploring url hit priors for web search //P roc of EC IRp06.Berlin:Sp ringer,:277-288,2006.

688IT编程网

搜索引擎论文

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

搜索引擎论文

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式