外文译文正文: 探索搜索引擎爬虫 随着网络难以想象的急剧扩张,从Web中提取知识逐渐成为一种受欢迎的途径。这是由于网络的便利和丰富的信息。通常需要使用基于网络爬行的搜索引擎来到我们需要的网页。本文描述了搜索引擎的基本工作任务。概述了搜索引擎与网络爬虫之间的联系。 关键词:爬行,集中爬行,网络爬虫 1.导言 在网络上WWW是一种服务,驻留在链接到互联网的电脑上,并允许最终用户访问是用标准的接口软件的计算机中的存储数据。万维网是获取访问网络信息的宇宙,是人类知识的体现。 搜索引擎是一个计算机程序,它能够从网上搜索并扫描特定的关键字,尤其是商业服务,返回的它们发现的资料清单,抓取搜索引擎数据库的信息主要通过接收想要发表自己作品的作家的清单或者通过“网络爬虫”、“蜘蛛”或“机器人”漫游互联网捕捉他们访问过的页面的相关链接和信息。 网络爬虫是一个能够自动获取万维网的信息程序。网页检索是一个重要的研究课题。爬虫是软件组件,它访问网络中的树结构,按照一定的策略,搜索并收集当地库中检索对象。 本文的其余部分组织如下:第二节中,我们解释了Web爬虫背景细节。在第3节中,我们讨论爬虫的类型,在第4节中我们将介绍网络爬虫的工作原理。在第5节,我们搭建两个网络爬虫的先进技术。在第6节我们讨论如何挑选更有趣的问题。 2.调查网络爬虫 网络爬虫几乎同网络本身一样古老。第一个网络爬虫,马修格雷浏览者,写于1993年春天,大约正好与首次发布的OCSA Mosaic网络同时发布。在最初的两次万维网会议上发表了许多关于网络爬虫的文章。然而,在当时,网络i现在要小到三到四个数量级,所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。 显然,所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。但是,由于搜索引擎是一项竞争性质的业务,这些抓取的设计并没有公开描述。 有两个明显的例外:股沟履带式和网络档案履带式。不幸的是,说明这些文献中的爬虫程序是太简洁以至于能够进行重复。 原谷歌爬虫(在斯坦福大学开发的)组件包括五个功能不同的运行流程。服务器进程读取一个URL出来然后通过履带式转发到多个进程。每个履带进程运行在不同的机器,是单线程的,使用异步I/O采用并行的模式从最多300个网站来抓取数据。爬虫传输下载的页面到一个能进行网页压缩和存储的存储服务器进程。然后这些页面由一个索引进程进行解读,从HTML页面中提取链接并将他们保存到不同的磁盘文件中。一个URL解析器进程读取链接文件,并将相对的网址进行存储,并保存了完整的URL到磁盘文件然后就可以进行读取了。通常情况下,因为三到四个爬虫程序被使用,所有整个系统需要四到八个完整的系统。 在谷歌将网络爬虫转变为一个商业成果之后,在斯坦福大学仍然在进行这方面的研究。斯坦福Web Base项目组已实施一个高性能的分布式爬虫,具有每秒可以下载50到100个文件的能力。Cho等人又发展了文件更新频率的模型以报告爬行下载集合的增量。 互联网档案馆还利用多台计算机来检索网页。每个爬虫程序被分配到64个站点进行检索,并没有网站被分配到一个以上的爬虫。每个单线程爬虫程序读取到其指定网站网址列表的种子从磁盘到每个站点的队列,然后用异步I/O来从这些队列同时抓取网页。一旦一个页面下载完毕,爬虫提取包含在其中的链接。如果一个链接提到它被包含在页面中的网站,它被添加到适当的站点排队;否则被记录在磁盘。每隔一段时间,合并成一个批处理程序的具体地点的种子设置这些记录“跨网站”的网址,过滤掉进程中的重复项。Web Fountian爬虫程序分享了魔卡托结构的几个特点:它是分布式的,连续,有礼貌,可配置的。不幸的是,写这篇文章,WebFountain是在其发展的早期阶段,并尚未公布其性能数据。 3.搜索引擎基本类型 A.富马酸丙酚替诺福韦片不良反应是什么基于爬虫的搜索引擎 基于爬虫的搜索引擎自动创建自己的清单。计算机程序“蜘蛛”建立他们没有通过人的选择。他们不是通过学术分类进行组织,而是通过计算机算法把所有的网页排列出来。这种类型的搜索引擎往往是巨大的,常常能取得了大龄的信息,它允许复杂的搜索范围内搜索以前的搜索的结果,使你能够改进搜索结果。这种类型的搜素引擎包含了网页中所有的链接。所以人们可以通过匹配的单词到他们想要的网页。 B.人力页面目录 这是通过人类选择建造的,即他们依赖人类创建列表。他们以主题类别和科目类别做网页的分类。人力驱动的目录,永远不会包含他们网页所有链接的。他们是小于大多数搜索引擎的。 C.混合搜索引擎 一种混合搜索引擎以传统的文字为导向,如谷歌搜索引擎,如雅虎目录搜索为基础的搜索引擎,其中每个方案比较操作的元数据集不同,当其元数据的主要资料来自一个网络爬虫或分类分析所有互联网文字和用户的搜索查询。与此相反,混合搜索引擎可能有一个或多个元数据集,例如,包括来自客户端的网络元数据,将所得的情境模型中的客户端上下文元数据俩认识这两个机构。 4.爬虫的工作原理 网络爬虫是搜索引擎必不可少的组成部分:运行一个网络爬虫是一个极具挑战的任务。有技术和可靠性问题,更重要的是有社会问题。爬虫是最脆弱的应用程序,因为它涉及到交互的几百几千个Web服务器和各种域名服务器,这些都超出了系统的控制。网页检索速度不仅由一个人的自己互联网连接速度有关,同时也受到了要抓取的网站的速度。特别是如果一个是从多个服务器抓取的网站,总爬行时间可以大大减少,如果许多下载是并行完成。虽然有众多的网络爬虫应用程序,他们在核心内容上基本上是相同的。以下是应用程序网络爬虫的工作过程: 1)下载网页 2)通过下载的页面解析和检索所有的联系 3)对于每一个环节检索,重复这个过程。 网络爬虫可用于通过对完整的网站的局域网进行抓取。 可以指定一个启动程序爬虫跟随在HTML页中到所有链接。这通常导致更多的链接,这之后将再次跟随,等等。一个网站可以被视为一个树状结构看,根本是启动程序,在这根的HTML页的所有链接是根子链接。随后循环获得更多的链接。 一个网页服务器提供若干网址清单给爬虫。网络爬虫开始通过解析一个指定的网页,标注该网页指向其他网站页面的超文本链接。然后他们分析这些网页之间新的联系,等等循环。网络爬虫软件不实际移动到各地不同的互联网上的电脑,而是像电脑病毒一样通过智能代理进行。每个爬虫每次大概打开大约300个链接。这是索引网页必须的足够快的速度。一个爬虫互留在一个机器。爬虫只是简单的将HTTP请求的文件发送到互联网的其他机器,就像一个网上浏览器的链接,当用户点击。所有的爬虫事实上是自动化追寻链接的过程。网页检索可视为一个队列处理的项目。当检索器访问一个网页,它提取到其他网页的链接。因此,爬虫置身于这些网址的一个队列的末尾,并继续爬行到下一个页面,然后它从队列前面删除。 A.资源约束 爬行消耗资源:下载页面的带宽,支持私人数据结构存储的内存,来评价和选择网址的CPU,以及存储文本和链接以及其他持久性数据的磁盘存储。 B.机器人协议 机器人文件给出排除一部分的网站被抓取的指令。类似地,一个简单的文本文件可以提供有关的新鲜和出版对象的流行信息。对信息允许抓取工具优化其收集的数据刷新策略以及更换对象的政策。 C.元搜索引擎 一个元搜索引擎是一种没有它自己的网页数据库的搜索引擎。它发出的搜索支持其他搜索引擎所有的数据库,从所有的搜索引擎查询并为用户提供的结果。较少的元搜索可以让您深入到最大,最有用的搜索引擎数据库。他们往往返回最小或免费的搜索引擎和其他免费目录并且通常是小和高度商业化的结果。 5.爬行技术 A:主题爬行 一个通用的网络爬虫根据一个URL的特点设置来收集网页。凡为主题爬虫的设计有一个特定的主题的文件,从而减少了网络流量和下载量。主题爬虫的目标是有选择地寻相关的网页的主题进行预先定义的设置。指定的主题不使用关键字,但使用示范文件。 不是所有的收集和索引访问的Web文件能够回答所有可能的特殊查询,有一个主题爬虫爬行分析其抓起边界,到链接,很可能是最适合抓取相关,并避免不相关的区域的Web。 这导致在硬件和网络资源极大地节省,并有助于于保持在最新状态的数据。主题爬虫有三个主要组成部分一个分类器,这能够判断相关网页,决定抓取链接的拓展,过滤器决定过滤器抓取的网页,以确定优先访问中心次序的措施,以及均受量词和过滤器动态重新配置的优先的控制的爬虫。 最关键的评价是衡量主题爬行收获的比例,这是在抓取过程中有多少比例相关网页被采用和不相干的网页是有效地过滤掉,这收获率最高,否则主题爬虫会花很多时间在消除不相关的网页,而且使用一个普通的爬虫可能会更好。 B:分布式检索 检索网络是一个挑战,因为它的成长性和动态性。随着网络规模越来越大,已经称为必须并行处理检索程序,以完成在合理的时间内下载网页。一个单一的检索程序,即使在是用多线程在大型引擎需要获取大量数据的快速上也存在不足。当一个爬虫通过一个单一的物理链接被所有被提取的数据所使用,通过分配多种抓取活动的进程可以帮助建立一个可扩展的易于配置的系统,它具有容错性的系统。拆分负载降低硬件要求,并在同一时间增加整体下载速度和可靠性。每个任务都是在一个完全分布式的方式,也就是说,没有中央协调器的存在。 6、挑战更多“有趣”对象的问题 搜索引擎被认为是一个热门话题,因为它收集用户查询记录。检索程序优先抓取网站根据一些重要的度量,例如相似性(对有引导的查询),返回链接数网页排名或者其他组合/变化最精Najork等。表明,首先考虑广泛优先搜索收集高品质页面,并提出一种网页排名。然而,目前,搜索策略是无法准确选择“最佳”路径,因为他们的认识仅仅是局部的。由于在互联网上可得到的信息数量非常庞大目前不可能实现全面的索引。因此,必须采用剪裁策略。主题爬行和智能检索,是发现相关的特定主题或主题集网页技术。 结论 在本文中,我们得出这样的结论实现完整的网络爬行覆盖是不可能实现,因为受限于整个万维网的巨大规模和资源的可用性。通常是通过一种阈值的设置(网站访问人数,网站上树的水平,与主题等规定),以限制对选定的网站上进行抓取的过程。此信息是在搜索引擎可用于存储/刷新最相关和最新更新的网页,从而提高检索的内容质量,同时减少陈旧的内容和缺页。 |
外文译文原文: Discussion on Web Crawlers of Search Engine Abstract-With the precipitous expansion of the Web,extracting knowledge from the Web is becoming gradually important and popular.This is due to the Web’s convenience and richness of information.To find Web pages, one typically uses search engines that are based on the Web crawling framework.This paper describes the basic task performed search engine.Overview of how the Web crawlers are related with search engine. Keywords Distributed Crawling, Focused Crawling,Web Crawlers Ⅰ.INTRODUCTION WWW on the Web is a service that resides on computers that are connected to the Internet and allows end users to access data that is stored on the computers using standard interface software. The World Wide Web is the universe of network-accessible information,an embodiment of human knowledge. Search engine is a computer program that searches for particular keywords and returns a list of documents in which they were found,especially a commercial service that scans documents on the Internet. A search engine finds information for its database by accepting listings sent it by authors who want exposure,or by getting the information from their “Web crawlers,””spiders,” or “robots,”programs that roam the Internet storing links to and information about each page they visit. Web Crawler is a program, which fetches information from the World Wide Web in an automated manner.Web crawling is an important research issue. Crawlers are software components, which visit portions of Web trees, according to certain strategies,and collect retrieved objects in local repositories. x86安装linux The rest of the paper is organized as: in Section 2 we explain the background details of Web crawlers.In Section 3 we discuss on types of crawler, in Section 4 we will explain the working of Web crawler. In Section 5 we cover the two advanced techniques of Web crawlers. In the Section 6 we discuss the problem of selecting more interesting pages. Ⅱ.SURVEY OF WEB CRAWLERS Web crawlers are almost as old as the Web itself.The first crawler,Matthew Gray’s Wanderer, was written in the spring of 1993,roughly coinciding with the first release Mosaic.Several papers about Web crawling were presented at the first two World Wide Web conference.However,at the time, the Web was three to four orders of magnitude smaller than it is today,so those systems did not address the scaling problems inherent in a crawl of today’s Web. Obviously, all of the popular search engines use crawlers that must scale up to substantial portions of the Web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described.sql数据库功能选择 There are two notable exceptions:the Goole crawler and the Internet Archive crawler.Unfortunately,the descriptions of these crawlers in the literature are too terse to enable reproducibility. The original Google crawler (developed at Stanford) consisted of five functional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes.Each crawler process ran on a different machine,was single-threaded,and used asynchronous I/O to fetch data from up to 300 Web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk.The page were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file. A URLs resolver process read the link file, relative the URLs contained there in, and saved the absolute URLs to the disk file that was read by the URL server. Typically,three to four crawler machines were used, so the entire system required between four and eight machines. Research on Web crawling continues at Stanford even after Google has been transformed into a commercial effort.The Stanford Web Base project has implemented a high performance distributed crawler,capable of downloading 50 to 100 documents per second.Cho and others have also developed models of documents update frequencies to inform the download schedule of incremental crawlers. The Internet Archive also used multiple machines to crawl the Web.Each crawler process was assigned up to 64 sites to crawl, and no site was assigned to more than one crawler.Each single-threaded crawler process read a list of seed URLs for its assigned sited from disk int per-site queues,and then used asynchronous I/O to fetch pages from these queues in parallel. Once a page was downloaded, the crawler extracted the links contained in it.If a link referred to the site of the page it was contained in, it was added to the appropriate site queue;otherwise it was logged to disk .Periodically, a batch process merged these logged “cross-sit” URLs into the site--specific seed sets, filtering out duplicates in the process. 爬虫软件 appThe Web Fountain crawler shares several of Mercator’s characteristics:it is distributed,continuous(the authors use the term”incremental”),polite, and configurable.Unfortunately,as of this writing,Web Fountain is in the early stages of its development, and data about its performance is not yet available. Ⅲ.BASIC TYPESS OF SEARCH ENGINE A.Crawler Based Search Engines Crawler based search engines create their listings automatically.Computer programs ‘spider’ build them not by human selection. They are not organized by subject categories; a computer algorithm ranks all pages. Such kinds of search engines are huge and often retrieve a lot of information -- for complex searches it allows to search within the results of a previous search and enables you to refine search results. These types of search engines contain full text of the Web pages they link to .So one cann find pages by matching words in the pages one wants; B. Human Powered Directories These are built by human They depend on humans to create listings. They are organized into subject categories and subjects do classification of pages.Human powered directories never contain full text of the Web page they link to .They are smaller than most search engines. C.Hybrid Search Engine A hybrid search engine differs from traditional text oriented search engine such as Google or a directory-based search engine such as Yahoo in which each program operates by comparing a set of meta data, the primary corpus being the meta data derived from a Web crawler or taxonomic analysis of all internet text,and a user search query.In contrast, hybrid search engine may use these two bodies of meta data in addition to one or more sets of meta data that can, for example, include situational meta data derived from the client’s network that would model the context awareness of the client. Ⅳ.WORKING OF A WEB CRAWLER 通配符 代表几个字符Web crawlers are an essential component to search engines;running a Web crawler is a challenging task.There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of Web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of one’s own Internet connection ,but also by the speed of the sites that are to be crawled.Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced,if many downloads are done in parallel. Despite the numerous applications for Web crawlers,at the core they are all fundamentally the same. Following is the process by which Web crawlers work: 1.Download the Web page. 2.Parse through the downloaded page and retrieve all the links. 3.For each link retrieved,repeat the process. The Web crawler can be used for crawling through a whole site on the Inter-/Intranet. You specify a start-URL and the Crawler follows all links found in that HTML page.This usually leads to more links,which will be followed again, and so on.A site can be seen as a tree-structure,the root is the start-URL;all links in that root-HTML-page are direct sons of the root. Subsequent links are then sons of the previous sons. A single URL Server serves lists of URLs to a number of crawlers.Web crawler starts by parsing a specified Web page,noting any hypertext links on that page that point to other Web pages.They then parse those pages for new links,and so on,recursively.Web Crawler software doesn’t actually move around to different computers on the Internet, as viruses or intelligent agents do. Each crawler keeps roughly 300 connections open at once.This is necessary to retrieve Web page at a fast enough pace. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a Web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Web crawling can be regarded as processing items in a queue. When the crawler visits a Web page,it extracts links to other Web pages.So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue. A. Resource Constraints Crawlers consume resources: network bandwidth to download pages,memory to maintain private data structures in support of their algorithms,CUP to evaluate and select URLs,and disk storage to store the text and links of fetched pages as well as other persistent data. B.Robot Protocol file gives directives for excluding a portion of a Web site to be crawled. Analogously,a simple text file can furnish information about the freshness and popularity fo published objects.This information permits a crawler to optimize its strategy for refreshing collected data as well as replacing object policy. C.Meta Search Engine A meta-search engine is the kind of search engine that does not have its own database of Web pages.It sends search terms to the databases maintained by other search engines and gives users the result that come from all the search engines queried. Fewer meta searchers allow you to delve into the largest, most useful search engine databases. They tend to return results from smaller add/or search engines and miscellaneous free directories, often small and highly commercial. Ⅴ.CRAWLING TECHNIQUES A.Focused Crawling A general purpose Web crawler gathers as many pages as it can from a particular set of URL’s.Where as a focused crawler is designed to only gather documents on a specific topic,thus reducing the amount of network traffic and downloads.The goal of the focused crawler is to selectively seek out pages that are relevant to a predefined set of topics.The topics re specified not using keywords,but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focused crawler has three main components;: a classifier which makes relevance judgments on pages,crawled to decide on link expansion,a distiller which determines a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurable priority controls which is governed by the classifier and distiller. The most crucial evaluation of focused crawling is to measure the harvest ratio, which is rate at which relevant pages are acquired and irrelevant pages are effectively filtered off from the crawl.This harvest ratio must be high ,otherwise the focused crawler would spend a lot of time merely eliminating irrelevant pages, and it may be better to use an ordinary crawler instead. B.Distributed Crawling Indexing the Web is a challenge due to its growing and dynamic nature.As the size of the Web sis growing it has become imperative to parallelize the crawling process in order to finish downloading the pages in a reasonable amount of time.A single crawling process even if multithreading is used will be insufficient for large - scale engines that need to fetch large amounts of data rapidly.When a single centralized crawler is used all the fetched data passes through a single physical link.Distributing the crawling activity via multiple processes can help build a scalable, easily configurable system,which is fault tolerant system.Splitting the load decreases hardware requirements and at the same time increases the overall download speed and reliability. Each task is performed in a fully distributed fashion,that is ,no central coordinator exits. Ⅵ.PROBLEM OF SELECTING MORE “INTERESTING” A search engine is aware of hot topics because it collects user queries.The crawling process prioritizes URLs according to an importance metric such as similarity(to a driving query),back-link count,Page Rank or their combinations/variations.Recently Najork et al. Showed that breadth-first search collects high-quality pages first and suggested a variant of Page Rank.However,at the moment,search strategies are unable to exactly select the “best服务介绍模板怎么写” paths because their knowledge is only partial.Due to the enormous amount of information available on the Internet a total-crawling is at the moment impossible,thus,prune strategies must be applied.Focused crawling and intelligent crawling,are techniques for discovering Web pages relevant to a specific topic or set of topics. CONCLUSION In this paper we conclude that complete web crawling coverage cannot be achieved, due to the vast size of the whole WWW and to resource availability.Usually a kind of threshold is set up(number of visited URLs, level in the website tree,compliance with a topic,etc.)to limit the crawling process over a selected website.This information is available in search engines to store/refresh most relevant and updated web pages,thus improving quality of retrieved contents while reducing stale content and missing pages. |
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论