The Design and Realization of Open-Source Search
Engine Based on Nutch
Guojun Yu 1Xiaoyao Xie *,2Zhijie Liu 3
Key Laboratory of Information and Computing Science of Guizhou Province
Guizhou Normal University Network Center
Guiyang,China
xyx@gznu.edu (corresponding author:Xiaoyao Xie)
Abstract —Search engines nowadays are becoming more and
more necessary and popular in surf surfing ing the Internet Internet.
.However,how these search engines like G oogle or B aidu work works s is unknown to many people.This paper,through a research into Open-source search engine Nutch,introduces how a common search engine works.By using Nutch,a search engine which
belongs to Guizhou Normal University University’
’s website is designed and at last,through the improvement of Nutch Nutch’
’s sorting algorithm and experiment experiment,
,it can be found that Nutch is very suitable for working in home-search home-search..Keywords-Search Engine Engine;;Nutch Nutch;;Lucene Lucene;;Java Open Source Source;
;I.
I NTRODUCTION
Nutch is an open-source search engine based on Lucene
Java,which is an open-source information retrieval library supported by the Apache Software Foundati
on for the search and index component,providing a crawler program,an Index engine and a Query engine[1].Nutch consists of the following three parts:
(1)Pages collection (fetch).The program of collecting pages,by timely collection or incremental collection,chooses the URLs,through which pages are to be visited and then fetched to the local disk by the crawler.
(2)Creating index.The program of creating index converts the pages or other files into the txt-document,divides them into segments,filters some useless information and then,creates and assists indexes which are composed by some smaller indexes based on key words or inverted documents.
(3)Searcher.The program of searcher accepts user’s query words through segmentation and filtering and then divides them into groups of key words,according to which correspondent pages are matched
in treasury index.Then,it puts the matches in order by sorting and returns the results to the users.
nutch搜索引擎The overall framework of Nutch is listed in
figure
Figure 1
II.
ACKGROUND
On account of the fact that there are so many sites under Guizhou Normal University’s website,not only the pages but also some other resources like doc,pfd are needed to be indexed.In this sense,adding the text analyzer module to the design based on Nutch’s framework,the whole design is composed by the crawler design module,the text analyzer module,the index module and the search module as listed in figure2.
Figure2
III.
HE PROCESS OF THE WORKFLOW
A.
An Analyzsis of the Nutch’Crawler
A Web crawler is a kind of robot or software agent.In general,it starts with a list of URLs to visit,called the seeds.When visiting these URLs,the crawler identifies all the hyperlinks in the page and adds them to the list of URLs to visit,called the crawl frontier [2].URLs from the frontier are recursively visited according to a set of policies.See figure3referenced from
[2].
Figure3
There are four factors affecting the crawler’s ability referenced by [3]:
Depth:the depth of the download
topN:the amount of page hyperlinks before the download
Threads:the threads which the download programmer uses
Delay:the delay time of the host visiting The work process of the Nutch’s Crawler includes four steps as follows:
1.Create the initial collection of the URL.
2.Begin the Fetching based on the pre-defined Depth,topN,Threads and Delay.
3.Create the new URL waiting list and start the new round of Fetching like in Figure 4referenced by [8].
4.Unite the resources downloaded in the local disk.
B.
Page Voice Elimination
After getting the content,the pages include a lot of tags and other ad information.It is necessary to eliminate these spasms and get the effective document.Here the program must complete two missions.See figure 6referenced by [9].
1.Analyze the inner html pages’basis information and distinguish the structure of the pages.
2.At the same time,eliminate the voice of the page and avoid the same results.
Figure 5
Under the directory of the Nutch workspace,there are some folders listed as follows:
Crawldb Directory:This folder stores the URLs downloaded and the time when they were downloaded.
Linkdb Directory:This folder stores the relationship between the URLs,which is form the parsed results after the download.
Segments:This folder stores the pages and resources that the crawler has fetched.The amount of the directories is related to the depth of the crawler’fetch.For much better management,the folders are named in their time.
C.
Creating the Index
At the heart of all search engines is the concept of indexing,which means processing the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching.
Nutch’s Documents are analyzed and disposed by Lucene.Lucene is a high performance,scalable Information Retrieval (IR)library [4].It lets you add indexing and searching capabilities to your applications.Lucene is a mature,free,open-source project implemented in Java.Figure 6referenced by [6]displays the framework of the Lucene.And there are three steps to complete the work referenced by [5]-[6].
Figure6
The first step:Document Converting
Lucene does not care about the source of the data,its format,or even its language as long as you can convert it to text.This means you can use Lucene to index and search data stored in files,web pages on remote web servers, documents stored in local file systems,simple text files, Microsoft Word documents,HTML or PDF files,or any other formats,from which you can extract textual information.Figure7referenced by[6]telling
more.
Figure7
The second step:Analysis
Once you have prepared the data for indexing and have created Lucene Documents populated with Fields,you can call Index Writer’s add-Document(Document)method and hand your data off to Lucene to index.When you do that, Lucene first analyzes the data to make it more suitable for indexing.To do so,it splits the textual data into chunks,or tokens,and performs a number of optional operations on them.For instance,the tokens could be lowercased before indexing to make searches case-insensitive.Typically it’s also desirable to remove all frequent but meaningless tokens from the input,such as stop words(a,an,the,in,on,and soon)in English text.
An important point about analyzers is that they are used internally for fields flagged to be tokenized.Documents such as HTML,Microsoft Word,XML contain meta-data such as the author,the title,the last modified date,and potentially much more.When you are indexing rich documents,this meta-data should be separated and indexed as separate fields.
The third step:Storing the Index
An inverted index(also referred to as postings file or inverted file)is an index data structure storing a mapping from content,such as words or numbers,to its locations in a database file,or in a document or a set of documents,in this case allowing full text search.The inverted file may be the database file itself,rather than its index.It is the most popular data structure used in document retrieval systems.
With the inverted index created,the query can now be resolved by jumping to the word id(via random access)in the inverted index.Random access is generally regarded as being faster than sequential access.
The main Classes which achieve three steps listed as follows:Index Writer,Directory,Analyzer,Document, and Field.
D.The Disposal of the Chinese Words Segmentation
A major hurdle(unrelated to Lucene)remains when we are dealing with various languages,handling text encoding. The Standard Analyzer is still the best built-in general-purpose analyzer,even accounting for CJK characters. However,the Sandbox CJK Analyzer seems better suited for Chinese Words analysis[6].When we are indexing documents in multiple languages into a single index,using a per-Document analyzer is more appropriate.
At last,under the directory of the Nutch workspace, there are some folders which store the index listed as follows:
Indexes:stores individual index directories.
Index:stores the last directory according to the Lucene’s format,which is combined by some individual indexes.
E.The Design and Realization of the Searching Module
Searching is the process of looking up words in an index to find documents where they appear.The quality of a search is typically described using precision and recall metrics[7].Recall measures how well the search system finds relevant documents,whereas precision measures how well the system filters out the irrelevant documents. However,we must consider a number of other factors when thinking about searching.Support for single and multi-term queries,phrase queries,wildcards,result ranking,and sorting is also important as a friendly syntax for entering those queries.
Figure7shows the process of the searching.
Pretreatment means carrying on text treatment. Segmentation through the class Query Parser and mixing a term in accordance with the Lucene format are two examples.
The main classes which achieve these functions are listed as follows:Index Search,Term,Query,Term Query, Hits.
F.Sorting Search Results
Some common search Sorting models are Boolean logic model,Fuzzy logic model,Vector logic model and Probability searching model.In some applications we mainly use vector logic model which calculates the weighted parameters through the TF-IDF method.
In this process,through calculation from the key words and the document’s relativity,we can get the value of the relativity between the key words and each document.And then,we sort these values,putting the document which meets the need(the value is higher)forward to the user,But this model has some limits:First,Web has mass data.The page includes a lot of insignificant and iterant messages which affect the information that users really want.The model cannot deal with these messages well.Second,the model does not take the links into account.In fact,the other goal of the search engine is to find the page which users often visit.Through the page the search engine could de
cide the importance of links of another page,like Page Rank.
Lucene’s sorting model is improved based upon vector model,listed as follows:
Lucene sorting algorithm[6]:
score_d=sum_t(tf_q*idf_t/norm_q*tf_d*idf_t/ norm_d_t)
score_d:Document(d)’score.
sum_t:Term(t)’summation.
tf_q:The square root of t’s frequence.
tf_d:The square root of t’s frequence in d.
idf_t:log(numDocs/docFreq_t+1)+1.0。
0numDocs:The amount of Document in the index.
docFreq_t:The amount of Document included in the index.
norm_q:sqrt(sum_t((tf_q*idf_t)^2))。
norm_d_t:the square root of the amount of the tokens in d,which has a same domain with t
But it also has some shortcomings.For instance,the precision of the query is not very good,and it does not show the weightiness of the web page.
Under this situation,we have done some improvement to Lucene’s sorting algorithm which is shown as follows.
The improved algorithm:
Score_d=k1*OldScore+k2*PrScore+k3*ReScore +k4*homePageScore
Score_d:Record d’s score.
OldScore:The d’s score is calculated by the Lucene’s sorting algorithm.
PrScore:Record d’s PageRank score.
ReScore:Record d’s score when if the document has been queried for a second time.
ReScore=rescore+(hitNum-1)*increment。
homePageScore:Record the homepage’s score.
K1,K2,K3,K4are Weight coefficient PR(A)=
(1-d) +d(PR(1)/C(1)+...+PR(n)/C(n))。
PageRank,second query,The home PageScore has optimized the precision of the searching process.
IV.
XPERIMENTS
A.Crawl Testing
First,the test starts with crawling Guizhou Normal University’s u.edu.Figure8shows the process.
Figure8
B.Search Testing
We deploy the project into the tomcat,start the tomcat visit the localhost:8080/Mynutch/search.jsp,and then Figure9appears.
Figure 9
Then we can see the results from Figure
10.
Figure 10
V.
ONCLUSION
Nutch is an open-source search engine framework based on Lucene Java.On the basis of the web crawlers and NDFS files system Nutch provides,a proficient search engine system can be developed to search,find out,filter,segment
information and then provides users with searching service.One of the biggest advantages of it is open-source,in accordance with which some algorithms of search engine and data framework can be developed.Furthermore,by using Nutch,enterprises can invent an appropriate web search engine according to their own characteristics and needs.
A CKNOWLEDGMENT
The Authors thank Professor Xiaoyao Xie for his excellent suggestion to pursue this topic,and also thank Professor Zhijie Liu from Guizhou Normal University’s Network Center for his help of giving admission to authors to use WebCrawler to fetch and monitor the Guizhou Normal University’s website.
R EFERENCES
[1]/wiki/Nutch .2008.
[2]/wiki/Webcrawler .2008.
[3]
Shkapenyuk V ,Suel T .Design and Implementation of a High Performance Distributed Web Crawler//Proc .of the 18th ICDE Conf..SanJ ose ,California,USA:[s.n].,2002:357-368.[4]/wiki/Lucene .2007.[5]/.2008
[6]
Erik hatcher.Otis Gospodnetic,Lucene in Action,MANNING Press,June 15,2007Junghoo Cho,Hector Garcia-Molina and Lawrence Page ,Efficient crawling through URL ordering [7]Risvik K N ,Michelsen R .Search Engines and Web Dynamics.Computer Networks ,39(3):289-302,2002.
[8]
LEI Kai,WANGDong-hai,Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine ,Computer Engineering Vol.34No.13July 2008
[9]
Song R,Xin G,Shi S,etal.Exploring url hit priors for web search //P roc of EC IRp06.Berlin:Sp ringer,:277-288,2006.
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论