网络爬虫设计与实现毕业设计论文

来源：榕意旅游网

毕业设计（论文）说明书

题目：网络爬虫设计与实现

学院软件学院专业软件工程

毕业设计（论文）任务书

题目：网络爬虫设计与实现

独创声明

本人郑重声明：所呈交的毕业设计(论文)，是本人在指导老师的指导下，独立进行研究工作所取得的成果，成果不存在知识产权争议。尽我所知，除文中已经注明引用的内容外，本设计（论文）不含任何其他个人或集体已经发表或撰写过的作品成果。对本文的研究做出重要贡献的个人和集体均已在文中以明确方式标明。

本声明的法律后果由本人承担。

作者签名: 二〇一〇年九月二十日

毕业设计（论文）使用授权声明

本人完全了解滨州学院关于收集、保存、使用毕业设计（论文）的规定。

本人愿意按照学校要求提交学位论文的印刷本和电子版，同意学校保存学位论文的印刷本和电子版，或采用影印、数字化或其它复制手段保存设计（论文）；同意学校在不以营利为目的的前提下，建立目录检索与阅览服务系统，公布设计（论文）的部分或全部内容，允许他人依法合理使用。

（保密论文在解密后遵守此规定）

作者签名: 二〇一〇年九月二十日

一、原始依据（包括设计或论文的工作基础、研究条件、应用环

境、工作目的等。）

互联网是一个庞大的非结构化的数据库，将数据有效的检索并组织呈现出来有着巨大的应用前景。搜索引擎作为一个辅助人们检索信息的工具成为用户访问万维网的入口和指南。但是，这些通用性搜索引擎也存在着一定的局限性。不同领域、不同背景的用户往往具有不同的检索目的和需求，通用搜索引擎所返回的结果包含大量用户不关心的网页。所以需要一个能基于主题搜索的满足特定需求的网络爬虫。

为了解决上述问题，参照成功的网络爬虫模式，对网络爬虫进行研究，从而能够为网络爬虫实现更深入的主题相关性，提供满足特定搜索需求的网络爬虫。

二、参考文献

[1]Winter．中文搜索引擎技术解密：网络蜘蛛 [M]．北京：人民邮电出版社，2004年．

[2]Sergey等．The Anatomy of a Large-Scale Hypertextual Web Search Engine [M]．北京：清华大学出版社，1998年．

[3]Wisenut．WiseNut Search Engine white paper [M]．北京：中国电力出版社，2001年．

[4]Gary R.Wright W.Richard Stevens．TCP-IP协议详解卷3：TCP事务协议，HTTP，NNTP和UNIX域协议 [M]．北京：机械工业出版社，2002 年1月. [5]罗刚王振东．自己动手写网络爬虫[M]．北京：清华大学出版社，2010年10月.

[6]李晓明，闫宏飞，王继民．搜索引擎：原理、技术与系统——华夏英才基金学术文库[M]．北京：科学出版社，2005年04月.

三、设计（研究）内容和要求（包括设计或研究内容、主要指标与技术参数，并根据课题性质对学生提出具体要求。）

本课题的主要目的是设计面向主题的网络爬虫程序，同时需要满足的是具有一定的性能，要考虑到网络爬虫的各种需求。

网络爬虫应用宽度搜索技术。对url进行分析，去重。网络爬虫使用多线程技术，让爬虫具备更强大的抓取能力。网络爬虫要实现对特定主题的爬取。网络爬虫还要完成信息提取任务，对于抓取回来的网页提取出来:新闻、电子图书、行业信息等。对网络爬虫的连接网络设置连接及读取时间，避免无限制的等待。研究网络爬虫的原理并实现爬虫的相关功能。

最终实现的网络爬虫应该能根据设定的主题，从设定的url进行一定深度的搜索，并最终得到需要的数据。

指导教师（签字）

年月日

审题小组组长（签字）

年月日

天津大学本科生毕业设计（论文）开题报告

课题名称网络爬虫设计与实现学院名称软件学院学生姓名张凤龙专业名称软件工程指导教师陈锦言

（内容包括：课题的来源及意义，国内外发展状况，本课题的研究目标、研究内容、研究方法、研究手段和进度安排，实验方案的可行性分析和已具备的实验条件以及主要参考文献等。）一．课题的来源及意义互联网是一个庞大的非结构化的数据库，将数据有效的检索并组织呈现出来有着巨大的应用前景。搜索引擎作为一个辅助人们检索信息的工具成为用户访问万维网的入口和指南。但是，这些通用性搜索引擎也存在着一定的局限性。不同领域、不同背景的用户往往具有不同的检索目的和需求，通用搜索引擎所返回的结果包含大量用户不关心的网页。为了解决这个问题，一个灵活的爬虫有着无可替代的重要意义。二．国内外发展状况对于网络爬虫的研究从上世纪九十年代就开始了，目前爬虫技术已经趋见成熟，网络爬虫是搜索引擎的重要组成部分。网络上比较著名的开源爬虫包括Nutch，Larbin，Heritrix。网络爬虫最重要的是网页搜索策略（广度优先和最佳度优先）和网页分析策略（基于网络拓扑的分析算法和基于网页内容的网页分析算法）。三．研究目标本论文主要研究搜索引擎的搜索器（网络爬虫程序）的设计与实现,实现简单的可在后台自动运行的爬虫程序。 1.可以多线程进行抓取。 2.可以进行面向主题的抓取。

四．研究内容本课题研究的内容是如何使网络爬虫灵活高效。 1.如何具备更强的抓取能力。 2.如何分辨重复的网页内容。 3.如何确定主题相关性。 4.对于网络时延等的处理。五．研究方法网络爬虫应用宽度搜索技术。对url进行分析，去重。网络爬虫使用多线程技术，让爬虫具备更强大的抓取能力。网络爬虫还要完成信息提取任务，对于抓取回来的网页提取出来新闻等信息。对网络爬虫的连接网络设置连接及读取时间，避免无限制的等待。研究网络爬虫的原理并实现爬虫的相关功能。六．研究手段参考网上开源的网络爬虫和各种网络爬虫相关的书籍，在windows系统环境下开发。五．本课题进度安排： 2010.12.20—2011.03.10 查阅资料完成任务书，完成开题报告 2011.03.11—2011.03.12 开题报告会 2011.03.13—2011.04.24 查阅资料，进行论文基本章节的写作，完成初稿，并完成进行代码编写 2011.04.25—2011.04.30 毕业设计中期报告会 2011.05.01—2011.05.22 系统设计结束并再次检查系统的可靠性。 2011.05.23—2011.06.22 完成论文及答辩六．本课题可行性分析网络爬虫目前已经比较普遍，国内外有众多对网络爬虫的研究成果，大部分的技术难题已经有解决方案。所以本课题的可行性较高。

八．实验条件 Windows 操作系统；互联网九．主要参考文献 [1]Winter．中文搜索引擎技术解密：网络蜘蛛 [M]．北京：人民邮电出版社，2004年． [2]Sergey等．The Anatomy of a Large-Scale Hypertextual Web Search Engine [M]．北京：清华大学出版社，1998年． [3]Wisenut．WiseNut Search Engine white paper [M]．北京：中国电力出版社，2001年． [4]Gary R.Wright W.Richard Stevens．TCP-IP协议详解卷3：TCP事务协议，HTTP，NNTP和UNIX域协议 [M]．北京：机械工业出版社，2002 年1月. [5]罗刚王振东．自己动手写网络爬虫[M]．北京：清华大学出版社，2010年10月. [6]李晓明，闫宏飞，王继民．搜索引擎：原理、技术与系统——华夏英才基金学术文库[M]．北京：科学出版社，2005年04月. 选题是否合适：是□ 否□ 课题能否实现：能□ 不能□ 指导教师（签字）年月日选题是否合适：是□ 否□ 课题能否实现：能□ 不能□ 审题小组组长（签字）年月日

摘要

本课题的主要目的是设计面向主题的网络爬虫程序，同时需要满足的是具有一定的性能，考虑到网络爬虫的各种需求。

网络爬虫应用宽度搜索技术。对url进行分析，去重。网络爬虫使用多线程技术，让爬虫具备更强大的抓取能力。对网络爬虫的连接网络设置连接及读取时间，避免无限制的等待。为了适应不同需求，使网络爬虫可以根据预先设定的主题实现对特定主题的爬取。研究网络爬虫的原理并实现爬虫的相关功能。

关键词：网络爬虫；面向主题；多线程

ABSTRACT

The main purpose of this project is to design subject-oriented web crawler process which is also required to meet certain performance, taking into account the diverse needs of web crawlers. Web Crawler uses the technology. of Breadth-first search.Web crawler uses multi-threaded technology, so that spiders crawl can have more powerful capabilities.Set connection time and read time of the web connection of the Web crawler , to avoid unlimited waiting.In order to meet different needs, so that crawlers can achieve pre-set theme crawling a specific topic.Research the principle web crawler and and realize the related functions. Key words：Web crawler; subject-oriented; multi-threading

天津大学2007届本科生毕业设计（论文）

第一章概述 .................................... 1

1.1 课题背景 ....................................... 1

1.2 网络爬虫的历史和分类 .......................... 1 1.2.1 网络爬虫的历史 .............................. 1 1.2.2 网络爬虫的分类 .............................. 2 1.3 网络爬虫的发展趋势 ............................ 3

第二章相关技术背景 ............................ 5

2.1 网络爬虫的定义 ................................ 5 2.2 网页搜索策略介绍 .............................. 5 2.2.1 广度优先搜索策略 ............................ 5 2.2.2 最佳优先搜索策略 ............................ 6 2.3 判断相关度算法 ................................ 6

第三章网络爬虫模型的分析和概要设计 ............ 8

3.1 网络爬虫的模型分析 ............................ 8 3.2 网络爬虫的搜索策略 ............................ 8 3.3 网络爬虫的主题相关度判断 ...................... 9 3.4 网络爬虫的概要设计 ........................... 11

第四章网络爬虫模型的设计和实现 ............... 14

4.1 网络爬虫总体设计 ............................. 14 4.2 网络爬虫具体设计 ............................. 14

天津大学2007届本科生毕业设计（论文）

4.2.1 爬取网页 ................................... 14 4.2.2 分析网页 ................................... 15 4.2.3 判断相关度 ................................. 16 4.2.4 保存网页信息 ............................... 17 4.2.5 数据库设计和存储 ........................... 17 4.2.6 4.2.7 4.2.8 第五章第六章多线程的实现 ............................... 17 附加功能 ................................... 18 整体流程 ................................... 18

测试 ................................... 20 总结和展望 ............................. 24

天津大学2007届本科生毕业设计（论文）

第一章概述

1.1 课题背景

网络爬虫，是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁，自动索引，模拟程序或者蠕虫。

网络检索功能起于互联网内容爆炸性发展所带来的对内容检索的需求。搜索引擎不断的发展,人们的需求也在不断的提高,网络信息搜索已经成为人们每天都要进行的内容.如何使搜索引擎能时刻满足人们的需求。最初的检索功能通过索引站的方式实现，而有了网络机器人，即网络爬虫这个技术之后，搜索引擎的时代便开始一发不可收拾了。

1.2 网络爬虫的历史和分类 1.2.1 网络爬虫的历史

在互联网发展初期，网站相对较少，信息查找比较容易。然而伴随互联网爆炸性的发展，普通网络用户想找到所需的资料简直如同大海捞针，这时为满足大众信息检索需求的专业搜索网站便应运而生了。

现代意义上的搜索引擎的祖先，是1990年由蒙特利尔大学学生Alan Emtage发明的Archie。虽然当时World Wide Web还未出现，但网络中文件传输还是相当频繁的，而且由于大量的文件散布在各个分散的FTP主机中，查询起来非常不便，因此Alan Archie工作原理与现在的搜索引擎已经很接近，它依靠脚本程序自动搜索网上的文件，然后对有关信息进行索引，供使用者以一定的表达式查询。由于 Archie深受用户欢迎，受其启发，美国内华达System Computing Services大学于1993年开发了另一个与之非常相似的搜索工具，不过此时的搜索工具除了索引文件外，已能检索网页。

当时，“机器人”一词在编程者中十分流行。电脑“机器人”（Computer Robot）是指某个能以人类无法达到的速度不间断地执行某项任务的软件程序。由于专门用于检索信息的“机器人”程序象蜘蛛一样在网络间爬来爬去，因此，搜索引擎的“机器人”程序就被称为“蜘蛛”程序。世界上第一个用于监测互联网发展规模的“机器人”程序是Matthew Gray开发的World wide Web Wanderer。刚

天津大学2007届本科生毕业设计（论文）

开始它只用来统计互联网上的服务器数量，后来则发展为能够检索网站域名。与Wanderer相对应，Martin Koster于1993年10月创建了ALIWEB，它是Archie的HTTP版本。ALIWEB不使用“机器人”程序，而是靠网站主动提交信息来建立自己的链接索引，类似于现在我们熟知的Yahoo。

随着互联网的迅速发展，使得检索所有新出现的网页变得越来越困难，因此，在Matthew Gray的Wanderer基础上，一些编程者将传统的“蜘蛛”程序工作原理作了些改进。其设想是，既然所有网页都可能有连向其他网站的链接，那么从跟踪一个网站的链接开始，就有可能检索整个互联网。到1993年底，一些基于此原理的搜索引擎开始纷纷涌现，其中以JumpStation、The World Wide Web Worm（Goto的前身，也就是今天Overture），和Repository-Based Software Engineering (RBSE) spider最负盛名。

然而JumpStation和WWW Worm只是以搜索工具在数据库中找到匹配信息的先后次序排列搜索结果，因此毫无信息关联度可言。而RBSE是第一个在搜索结果排列中引入关键字串匹配程度概念的引擎最早现代意义上的搜索引擎出现于1994年7月。当时Michael Mauldin将John Leavitt的蜘蛛程序接入到其索引程序中，创建了大家现在熟知的Lycos。同年4月，斯坦福（Stanford）大学的两名博士生，David Filo和美籍华人杨致远（Gerry Yang）共同创办了超级目录索引Yahoo，并成功地使搜索引擎的概念深入人心。从此搜索引擎进入了高速发展时期。目前，互联网上有名有姓的搜索引擎已达数百家，其检索的信息量也与从前不可同日而语。比如最近风头正劲的Google，其数据库中存放的网页已达30亿之巨。

随着互联网规模的急剧膨胀，一家搜索引擎光靠自己单打独斗已无法适应目前的市场状况，因此现在搜索引擎之间开始出现了分工协作，并有了专业的搜索引擎技术和搜索数据库服务提供商。象国外的Inktomi，它本身并不是直接面向用户的搜索引擎，但向包括Overture（原GoTo）、 LookSmart、MSN、HotBot等在内的其他搜索引擎提供全文网页搜索服务。国内的百度也属于这一类（注），搜狐和新浪用的就是它的技术。因此从这个意义上说，它们是搜索引擎的搜索引擎。

1.2.2 网络爬虫的分类

网络爬虫种类繁多，如果按照部署在哪里分，可以分成：

1，服务器侧：一般是一个多线程程序，同时下载多个目标HTML，可以用PHP，

天津大学2007届本科生毕业设计（论文）

Java, Python等做，一般综合搜索引擎的爬虫这样做。但是，如果对方讨厌爬虫，很可能封掉服务器的IP，服务器IP又不容易改，另外耗用的带宽也是较贵。

2，客户端：很适合部署定题爬虫，或者叫聚焦爬虫。做一个与Google，百度等竞争的综合搜索引擎成功的机会微乎其微，而垂直搜诉或者比价服务或者推荐引擎，机会要多得多，这类爬虫不是什么页面都取的，而是只取关心的页面，而且只取页面上关心的内容，例如提取黄页信息，商品价格信息，还有提取竞争对手广告信息的。这类爬虫可以部署很多，而且可以很有侵略性。可以低成本大量部署，由于客户端IP地址是动态的，所以很难被目标网站封锁。

1.3 网络爬虫的发展趋势

目前，大多数的搜索引擎都是基于关键词的搜索引擎。基于关键字匹配的搜索技术有较大的局限性：首先，它不能区分同形异义。其次，不能联想到关键字的同义词。

Web商业化至今，搜索引擎始终保持着网络上被使用最多的服务项目的地位，然而，随着网上内容的爆炸式增长和内容形式花样的不断翻新，搜索引擎越来越不能满足挑剔的网民们的各种信息需求。

搜索引擎的发展面临着两大难题：一是如何跟上Internet的发展速度，二是如何为用户提供更精确的查询结果。所以，传统的引擎不能适应信息技术的高速发展，新一代智能搜索引擎作为一种高效搜索引擎技术的在当今的网络信息时代日益引起业界人士的关注。搜索引擎己成为一个新的研究、开发领域。因为它要用到信息检索、人工智能、计算机网络、分布式处理、数据库、数据挖掘、数字图书馆、自然语言处理等多领域的理论和技术，所以具有综合性和挑战性。又由于搜索引擎有大量的用户，有很好的经济价值，所以引起了世界各国计算机科学界和信息产业界的高度关注，目前的研究、开发十分活跃，并出现了很多值得注意的动向。

目前传统搜索引擎下，百度、谷歌等大厂商垄断了网络索引市场，因为它们的存在，日益庞大的互联网内容才能突破网络黑暗状态，变成可知的一个世界。然而，传统搜索引擎并不能支持定制搜索和信息处理、挖掘，只能以WEB1.0的形式存在。

可以预见将来互联网信息抓取、挖掘和再处理，将成为人们越来越多的需求，而满足这种需求的，就是各种各样的爬虫与相关的信息处理工具。现在网络上流行的信息采集工具、网站聚合工具，都是未来新一代爬虫的先驱，甚至已经具备其特点。但是互联网本身，不管1.0还是2.0，还没有为爬虫时代的到来做好充分

天津大学2007届本科生毕业设计（论文）

准备。现在游行的SEO，就是强势搜索引擎条件下对网站结构产生的影响。爬虫时代到来之后，互联网上会出现专门的信息站点，就是提供给爬虫看的站点。传统的网络爬虫技术主要应用于抓取静态Web 网页，随着AJAX/Web2.0的流行，如何抓取AJAX 等动态页面成了搜索引擎急需解决的问题，因为AJAX颠覆了传统的纯HTTP 请求/响应协议机制，如果搜索引擎依旧采用“爬”的机制，是无法抓取到AJAX 页面的有效数据的。

AJAX 采用了JavaScript 驱动的异步请求/响应机制，以往的爬虫们缺乏JavaScript语义上的理解，基本上无法模拟触发JavaScript的异步调用并解析返回的异步回调逻辑和内容。

另外，在AJAX的应用中，JavaScript 会对DOM结构进行大量变动，甚至页面所有内容都通过JavaScript 直接从服务器端读取并动态绘制出来。这对习惯了DOM 结构相对不变的静态页面简直是无法理解的。由此可以看出，以往的爬虫是基于协议驱动的，而对于AJAX 这样的技术，所需要的爬虫引擎必须是基于事件驱动的。

天津大学2007届本科生毕业设计（论文）

第二章相关技术背景

2.1 网络爬虫的定义

定义1：网络爬虫是一个自动提取网页的程序，它为搜索引擎从Web上下载网页，是搜索引擎的重要组成部分。通用网络爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL列表；在抓取网页的过程中，不断从当前页面上抽取新的URL放入待爬行队列，直到满足系统的停止条件。

定义2：主题网络爬虫就是根据一定的网页分析算法过滤与主题无关的链接，保留主题相关的链接并将其放入待抓取的URL队列中；然后根据一定的搜索策略从队列中选择下一步要抓取的网页URL，并重复上述过程，直到达到系统的某一条件时停止。所有被网络爬虫抓取的网页将会被系统存储，进行一定的分析、过滤，并建立索引，对于主题网络爬虫来说，这一过程所得到的分析结果还可能对后续的抓取过程进行反馈和指导。

定义3：如果网页p中包含超链接l，则p称为链接l的父网页。

定义4：如果超链接l指向网页t，则网页t称为子网页，又称为目标网页。主题网络爬虫的基本思路就是按照事先给出的主题，分超链接和已经下载的网页内容，预测下一个待抓取的URL及当前网页的主题相关度，保证尽可能多地爬行、下载与主相关的网页，尽可能少地下载无关网页。

2.2 网页搜索策略介绍

网页的抓取策略可以分为深度优先、广度优先和最佳优先三种。深度优先在很多情况下会导致爬虫的陷入(trapped)问题，目前常见的是广度优先和最佳优先方法。

2.2.1 广度优先搜索策略

广度优先搜索策略是指在抓取过程中，在完成当前层次的搜索后，才进行下一层次的搜索。该算法的设计和实现相对简单。在目前为覆盖尽可能多的网页，一般使用广度优先搜索方法。也有很多研究将广度优先搜索策略应用于聚焦爬虫中。其基本思想是认为与初始URL在一定链接距离内的网页具有主题相关性的概率很大。另外一种方法是将广度优先搜索与网页过滤技术结合使用，先用广度优先策略抓取网页，再将其中无关的网页过滤掉。这些方法的缺点在于，随着

天津大学2007届本科生毕业设计（论文）

抓取网页的增多，大量的无关网页将被下载并过滤，算法的效率将变低。

2.2.2 最佳优先搜索策略

最佳优先搜索策略按照一定的网页分析算法，预测候选URL与目标网页的相似度，或与主题的相关性，并选取评价最好的一个或几个URL进行抓取。它只访问经过网页分析算法预测为“有用”的网页。存在的一个问题是，在爬虫抓取路径上的很多相关网页可能被忽略，因为最佳优先策略是一种局部最优搜索算法。因此需要将最佳优先结合具体的应用进行改进，以跳出局部最优点。将在第4节中结合网页分析算法作具体的讨论。研究表明，这样的闭环调整可以将无关网页数量降低30%~90%。

2.3 判断相关度算法

主题爬虫的系统组成最初考虑是对页面的过滤，不像普通爬虫对所有页面的链接进行处理，先对页面与受限领域的主题相关度进行分析，只有当其主题相关度符合要求时才处理该页面中的链接，因为如果该页面和本领域比较相关，它所包含的链接和领域相关的几率也较大，这样提高了爬行精度，虽然会遗漏少数页面，但综合效果是令人满意的。因此，主题相关度的分析是主题爬虫设计的关键。（一）主题相关度计算模型

垂直搜索引擎与通用搜索引擎最大的区别在于垂直搜索引擎是面向某个领域的，因而垂直搜索引擎的网络蜘蛛只采集与主题相关的网页，与主题无关的网页将被丢弃，将此类网络蜘蛛称为主题蜘蛛[6-8]。主题蜘蛛将网页下载到本地后，需要使用基于内容的主题判别方法计算该网页的主题相关度值，主题相关度低于某一阈值的网页被丢弃。主题相关度的计算方法有布尔模型和向量空间模型两种模型算法[10]。

1.布尔模型。在主题判别时，布尔模型是很容易实现的。在布尔模型[9]中，一个文档通过一个关键词集合来表示。同时，某个主题也以关键词集合的形式来表示。在判断文档与某主题的相关度的过程中，相当于是计算两个关键词集合的交集。对基于布尔模型的主题判别模型来说，交集中含有的元素越多，则认为与主题的相关度就越高。

2.空间向量模型。向量空间模型[11]（Vector Space Model）由Salton等人于20世纪60年代末提出，是一种简便、高效的文本表示模型，其理论基础是代数学。与布尔模型不同，向量空间模型把用户的查询要求和数据库文档信息表示成由检索项构成的向量空间中的点（向量），而通过计算向量之间的距离来判定文档和查询之间的相似程度（例如，用它们之间夹角的余弦作为相似性度量）。

天津大学2007届本科生毕业设计（论文）

然后，根据相似程度排列查询结果。在向量空间模型中，文档被形式化为n维空间中的向量，把关键词的个数n作为空间向量的维数，每个关键词的权值作为每一维分量的大小，则主题用向量表示为： A=(a1,a2,…,an),i=1,2,…,n,ai=wi

对于页面进行分析，统计关键词出现的频率，并求出频率之比，以出现的频率最高的关键词作为基准，其频率用xi=1表示，通过频率比，求出其他关键词的频率，则该页面对应向量的每一维分量为xiwi。指定一个阈值r，当cos<α，β>=r时就可以认为该页面和主题是比较相关的，r的取值需要根据经验和实际要求确定，如果想获得较多的页面，可以把r设小一点，要获得较少的页面可以把r设的大一点。

（二）布尔模型与空间向量模型分析

布尔模型的主要缺陷在于每个关键词的权重都是一样的，它不支持设定关键词的相对重要性，但是其优点也较为明显，它易于实现，计算代价较小。向量空间模型最大优点在于它在知识表示方法上的巨大优势。在该模型中，文档的内容被形式化为多维空间中的一个点，以向量的形式给出。也正是因为把文档以向量的形式定义到实数域中，才使得模式识别和其他领域中各种成熟的算法和计算方法得以采用，极大地提高了自然语言文档的可计算性和可操作性。通过对空间向量模型和布尔模型的介绍，我们知道现在垂直搜索引擎大多采用空间向量模型计算主题相关性。这样极大的提高到主题爬虫的效率，也极大的提高了垂直搜索引擎的应用效率，给客户带来了高效的查询效果。与在进行页面的主题相关度分析后，当其主题相关度符合要求时将处理该页面中的所有链接，但其中的链接指向的页面也可能有许多偏离了主题，这一点在网页的标题上就可以看出，现在大多数网页的标题已经很明显的给出了文本的主要描述对象，所以传统的空间模型策略没有注意到网页标题这个重要的角色。针对此提出了一种基于网页标题的空间向量模型主题相关度计算方法。

天津大学2007届本科生毕业设计（论文）

第三章网络爬虫模型的分析和概要设计

3.1 网络爬虫的模型分析

首先建立URL任务列表，即开始要爬取的URL。由URL任务列表开始，根据预先设定的深度爬取网页，同时判断URL是否重复，按照一定算法和排序方式搜索页面，然后对页面按照一定算法进行分析，并提取相关URL，最后将所得URL返回任务列表。之后将任务列表中URL重新开始爬取，从而使网络爬虫进行循环运行。

3.2 网络爬虫的搜索策略

本文的搜索策略为广度优先搜索策略。如下图3-1所示。

图3-1 广度优先搜索策略示意图

1）定义一个状态结点

采用广度优先搜索算法解答问题时，需要构造一个表明状态特征和不同状态之间关系的数据结构，这种数据结构称为结点。不同的问题需要用不同的数据结构描述。

2）确定结点的扩展规则

根据问题所给定的条件，从一个结点出发，可以生成一个或多个新的结点，这个

天津大学2007届本科生毕业设计（论文）

过程通常称为扩展。结点之间的关系一般可以表示成一棵树，它被称为解答树。搜索算法的搜索过程实际上就是根据初始条件和扩展规则构造一棵解答树并寻找符合目标状态的结点的过程。

广度优先搜索算法中，解答树上结点的扩展是沿结点深度的“断层”进行，也就是说，结点的扩展是按它们接近起始结点的程度依次进行的。首先生成第一层结点，同时检查目标结点是否在所生成的结点中，如果不在，则将所有的第一层结点逐一扩展，得到第二层结点，并检查第二层结点是否包含目标结点，...对长度为n+1的任一结点进行扩展之前，必须先考虑长度为n的结点的每种可能的状态。因此，对于同一层结点来说，求解问题的价值是相同的，我们可以按任意顺序来扩展它们。这里采用的原则是先生成的结点先扩展。

结点的扩展规则也就是如何从现有的结点生成新结点。对不同的问题，结点的扩展规则也不相同，需要按照问题的要求确定。 3）搜索策略

为了便于进行搜索，要设置一个表存储所有的结点。因为在广度优先搜索算法中，要满足先生成的结点先扩展的原则，所以存储结点的表一般设计成队列的数据结构。

搜索的步骤一般是：

（1）从队列头取出一个结点，检查它按照扩展规则是否能够扩展，如果能则产生一个新结点。

（2）检查新生成的结点，看它是否已在队列中存在，如果新结点已经在队列中出现过，就放弃这个结点，然后回到第（1）步。否则，如果新结点未曾在队列中出现过，则将它加入到队列尾。

（3）检查新结点是否目标结点。如果新结点是目标结点，则搜索成功，程序结束；若新结点不是目标结点，则回到第（1）步，再从队列头取出结点进行扩展......。

最终可能产生两种结果：找到目标结点，或扩展完所有结点而没有找到目标结点。

3.3 网络爬虫的主题相关度判断

天津大学2007届本科生毕业设计（论文）

主题蜘蛛将网页下载到本地后，需要使用基于内容的主题判别方法计算该网页的主题相关度值，主题相关度低于某一阈值的网页被丢弃。

（一）什么是网页标题

通常浏览一个网页时，通过浏览器顶端的蓝色显示条出现的信息就是“网页标题”。

在网页HTML代码中，网页标题位于标签之间。

网页标题是对于一个网页的高度概括，一般来说，网站首页的标题就是网站的正式名称，而网站中文章内容页面的标题就是这文章的题目，栏目首页的标题通常是栏目名称。当然这种一般原则并不是固定不变的，在实际工作中可能会有一定的变化，但是无论如何变化，总体上仍然会遵照这种规律[12]。

例如，现在会看到很多网站的首页标题较长，除了网站名称之外，还有网站相关业务之类的关键词，这主要是为了在搜索引擎搜索结果中获得排名优势而考虑的，也属于正常的搜索引擎优化方法。因为一般的公司名称（或者品牌名称）中可能不包含核心业务的关键词，在搜索结果排名中将处于不利地位。（二）网页标题的重要性

以Google为例，Google会对其标题标签（meta title）中出现的关键字给予较高的权值。所以应当确保在网站的标题标签中包含了最重要的关键词，即应围绕最重要的关键词来决定网页标题的内容。不过网页的标题不可过长，一般最好在35到40个字符之间。在实际操作中，网页标题不宜过短或过长。太短无法完整的表达网页信息，太长不仅不利于用户识别，而且对搜索引擎来说也加大了识别核心关键词的难度；网页标题应概括网页的核心内容。搜索引擎在进行搜索的时候，搜索结果的内容一般是网页标题、网页摘要信息和链接，要引起用户的关注，高度总结了网页内容的标题至关重要。比如戴尔中国的网站www.dell.com.cn首页标题为“戴尔中国（Dell China）—计算机，笔记本电脑，台式机，打印机，工作站，服务器，存储器，电子产品及附件等”。戴尔的首页标题中不但涵盖了最重要的公司信息，而且还包括公司的主要产品，这就是核心关键词，当用“笔记本电脑”、“台式电脑”这些关键词在谷歌中进行搜索时，戴尔公司的网页都排在第一屏的前几条位置。

（二）但是与此同时需要注意的还有网页正文的重要性，因为网页的标题和关键字很可能与正文无关，虚假关键词是通过在META中设置与网站内容无关的关键词，如在Title中设置热门关键词，以达到误导用户进入网站的目的。同样的情况也包括链接关键词与实际内容不符的情况。

天津大学2007届本科生毕业设计（论文）

具体判断主题相关度的步骤

1.对标题及正文的特征项的选取是通过分词后与主题集合匹配，并通过词频计算来得到与主题向量维数相等的标题向量和正文向量。 2.通过向量空间模型计算出主题和标题的相关度B。 3.通过空间向量向量模型计算主题与正文的相关度C。 4.主题与整个网页的相关度：A=4×B+C。

5.通过详细计算，设定相关度阈值为2，网页与主题的相关度A>2，则认为该网页与主题相关的。

3.4 网络爬虫的概要设计

本网络爬虫的开发目的，通过网络爬虫技术一个自动提取网页的程序，实现搜索引擎从自己想要访问的网上下载网页，再根据已下载的网页上继续访问其它的网页，并将其下载直到满足用户的需求。

根据现实中不同用户的实际上的各种需求，本项目简单实现主题爬虫，本网络爬虫需要达到如下几个目标：

1.设计基于多线程的网络爬虫，客户端向服务器发送自己设定好请求。如图3-7所示。

天津大学2007届本科生毕业设计（论文）

线程1搜索元URL如www.baidu.com 互联网 URL配置文件 URL配置文件列表临界区线程2搜索元URL如www.baidu.com 线程N

图3-2 多线程网络爬虫概要设计图模型

2.通过 http将Web服务器上协议站点的网页代码提取出来。 3.根据一定的正则表达式提取出客户端所需要的信息。

4.广度优先搜索可从网页中某个链接出发，访问该链接网页上的所有链接，访问完成后，再通过递归算法实现下一层的访问。

本网络爬虫最终将设计成一个能够自动读写配置文件并且在后台自动执行的网络爬虫程序。网络爬虫工作流程图如图3-3所示。

天津大学2007届本科生毕业设计（论文）

满足条件停止根据宽度有限算法搜索目标URL网络蜘蛛循环爬行从配置文件中读取初始URL作为源URL开始获取网页以正则表达式过滤网页标签提取目标URL

图3-3 网络爬虫工作流程图

结束13

天津大学2007届本科生毕业设计（论文）

第四章网络爬虫模型的设计和实现

4.1 网络爬虫总体设计

根据本网络爬虫的概要设计本网络爬虫是一个自动提取网页的程序，根据设定的主题判断是否与主题相关，再根据已下载的网页上继续访问其它的网页，并将其下载直到满足用户的需求。 1.设计基于多线程的网络爬虫。

2.通过 http将待爬取URL列表对应的URL的网页代码提取出来。 3.提取出所需要的信息并且通过算法判断网页是否和设定的主题相关。 4.广度优先搜索，从网页中某个链接出发，访问该链接网页上的所有链接，访问完成后，再通过递归算法实现下一层的访问，重复以上步骤。

总的来说爬虫程序根据输入获得URL任务列表，即初始URL种子，把初始种子保存在临界区中，按照广度搜索运算法搜索抓取网页并提取URL返回到临届区中，通过判断主题相关度算法判断相关度，取出不相关网页，，从而使整个爬虫程序循环运行下去。

4.2 网络爬虫具体设计 4.2.1 爬取网页

主要用到的技术如下：

继承HTMLEditorKit类，改写其中的 HTMLEditorKit.Parser getParser()属性protect为public，用下列函数爬取网页： public class XXXXX extends HTMLEditorKit { public HTMLEditorKit.Parser getParser() {

return super.getParser(); } }

步骤如下：

1首先建立URL连接。

URLConnection url_C = url_test.openConnection(); 2设置连接超时时间和读取超时时间。 url_C.setConnectTimeout(10000);

天津大学2007届本科生毕业设计（论文）

url_C.setReadTimeout(10000);

3.用输入流，BufferedReader读取，并且将网页内容存储为字符串。

4.2.2 分析网页

继承ParserCallback获得网页内容

// 得到标题文本

protected String urlTitle = new String(); // 得到某一网页上的所有链接

protected Vector links = new Vector(); protected Vector linkname = new Vector(); // 得到网页上的正文文本

protected String paragraphText = new String(); protected String linkandparagraph = new String(); protected String encode = new String(); public Parser(String baseurl) { }

public String getEncode() { }

// 获得该网页标题

public String getURLtitle() { }

// 获得该网页的所有链接 public Vector getLinks() { }

// 获得所有该网页的链接名 public Vector getLinkName() // 获得网页正文

public String getParagraphText()

public void handleEndTag(HTML.Tag t, int pos) // 处理简单标签

return links; return urlTitle; return encode; base = baseurl;

天津大学2007届本科生毕业设计（论文）

public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int

pos) // 处理结束标签

之后通过调用HTMLEditorKit.Parser类，生成对象就可以直接得到分析后的网页文件。

public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int // 处理文本标签

public void handleText(char[] data, int pos) pos)

4.2.3 判断相关度

算法实现步骤和算法描述：

5.通过详细计算，设定相关度阈值为2，网页与主题的相关度A>2，则认为该网页与主题相关的。

输入：主题集合文本a.txt，网页url 输出：主题相关度

（1）Get topic（String path）//根据路径获取主题文本集合（2）Compulate topicweight（String topic）//求主题结合权重

（3）sortAndDelRepeat(int[]count)//删除重复元素并排序（4）delRepeat(String[] segment)//删除分词后的重复元素（5）delRepeat(Vector url)//删除得到的URL中的重复元素（6）getParser(String url)//获得Parser实例

（7）String titleStr =p.getURLtitle（）//获取网页标题（8）String bodyStr=p.getParagraphText（）//获取网页文本

（9）String titleStrSeg=segment.segment（titleStr）//网页标题分词（10）String bodyStrSeg=segment.segment（bodyStr）//网页文本分词（11）Compulate title.length，body.length//计算标题向量长度和网页

天津大学2007届本科生毕业设计（论文）

文本向量长度

（12）set topicweight1，titleweight1，bodyweight1；//设置权重（13）Last compulate Relative//计算主题相关性（14）Return relative；//返回结果

根据系统设置首先是下载所有网页，而后判定主题相关性，与主题相关则放置在相关URL库中，不相关的网页则丢弃。

4.2.4 保存网页信息

1.首先建立URL连接。

URLConnection url_C = url_test.openConnection(); 2.新建PagePro类。如下： private String Host; private int Port;

private String ContentType; private int ContentLength; private String Date; private String Url;

3.将数据存入新建的PagePro类中。 4.将数据保存到预先输入的地址。

4.2.5 数据库设计和存储

使用JDBC访问数据库，储存下载的网页URL和下载时间信息。

4.2.6 多线程的实现

设计为4个线程同时进行工作。

1. 从用户输入的起始URL开始，递归获得指定深度的URL。 2. 对每个URL进行分析，判断相关度。 3. 下载与主题相关的网页，并存储在数据库中。

第i个线程对所有URL列表中序列为第0+4i URL的进行同步操作，其中对储存所有URL的列表执行synchronized (all_URL)操作。

天津大学2007届本科生毕业设计（论文）

4.2.7 附加功能

为了检测网络环境，防止因为不良的网络环境影响网络爬虫的爬取效率和正确略，额外添加了实时的ping功能，调用windows的命令解释器的ping功能，测试用户输入网址与当前主机的连接状况，测试当前网络状况是否良好。

4.2.8 整体流程

爬虫代码文件构成如图4-1：

图4-1 代码结构构成截图

HtmlParser.java这个类是改写HTMLEditorKit.Parser getParser()方法为public

HTTP.java是根据输入URL获取网页文档

Parser.java是继承ParserCallback获得网页内容 Relative.java是判断主题与网页内容的相关性 Segment.java是对网页主题和正文进行分词

天津大学2007届本科生毕业设计（论文）

Download.java是下载网页所用，Pagepro.java是为Download.java生成存储对象。

JDBCTest.java对数据库进行操作 mainF.java整合了网络爬虫的功能 Ui.java是界面

Ping.java是调用Ping程序的类具体流程：

第一步：调用HtmlParser.java，Parser.java，获得起始URL的内容，并存储到String中。

第二步：调用Parser.java获得网页下面所有的URL，同时去除重复的部分。第三步：对以上两步进行递归循环，获得指定深度的所有URL列表。第四步：调用Relative.java，Segment.java得到每个URL对应的网页内容与给定主题的阈值，大于给定值则相关，小于给定值则不相关，丢弃该URL。

第五步：调用Download.java和JDBCTest.java将与主题相关的网页下载并存储入数据库。

天津大学2007届本科生毕业设计（论文）

第五章测试

设定只爬取前5个网页，程序运行后的界面如图5-1

图5-1 测试图1

预设目录为，D：test 按下START后，查看目录，可见如图5-2：

天津大学2007届本科生毕业设计（论文）

图5-2 测试图2

查看数据库可见，如图5-3：

图5-3 测试图3

测试Ping功能，分别对正确网址ping和不正确网址ping，如图5-4

天津大学2007届本科生毕业设计（论文）

图5-4 测试图4

图5-5 测试图5

天津大学2007届本科生毕业设计（论文）

图5-6 测试图6

天津大学2007届本科生毕业设计（论文）

第六章总结和展望

2011年3月，我开始了我的毕业论文工作，时至今日，论文基本完成。从最初的茫然，到慢慢的进入状态，再到对思路逐渐的清晰，整个写作过程难以用语言来表达。历经了几个月的奋战，紧张而又充实的毕业设计终于落下了帷幕。回想这段日子的经历和感受，我感慨万千，在这次毕业设计的过程中，我拥有了无数难忘的回忆和收获。

3月初，在与导师的交流讨论中我的题目定了下来，是面向主题的网络爬虫。当选题报告，开题报告定下来的时候，我当时便立刻着手资料的收集工作中，当时面对浩瀚的书海真是有些茫然，不知如何下手。我将这一困难告诉了导师，在导师细心的指导下，终于使我对自己现在的工作方向和方法有了掌握。在搜集资料的过程中，我认真准备了一个笔记本。我在学校图书馆，大工图书馆搜集资料，还在网上查找各类相关资料，将这些宝贵的资料全部记在笔记本上，尽量使我的资料完整、精确、数量多，这有利于论文的撰写。然后我将收集到的资料仔细整理分类，及时拿给导师进行沟通。

4月初，资料已经查找完毕了，我开始着手论文的写作。在写作过程中遇到困难我就及时和导师联系，并和同学互相交流，请教专业课老师。在大家的帮助下，困难一个一个解决掉，论文也慢慢成型。

4月底，平台设计已经完成。5月开始相关代码编写工作。为了完成满意的平台设计，我仔细温习了数据库原理相关知识。深入了解并掌握数据库基础知识，挖掘出数据库课程中的难点和重点，对于其中的难点，要充分考虑学生的学习能力，帮助学生以一种最容易接受的方式掌握知识。对于课程中的重点，要强调突出，有规律反复出现，帮助学生更高效消化知识。在设计平台中，要注意平台的可行性和有效性，选择既重要又适合以学习软件形式出现的知识点作为材料，参考优秀的国内外学习辅助平台，又考虑到数据库课程的特殊性。在设计初期，由于没有设计经验，觉得无从下手，空有很多设计思想，却不知道应该选哪个，经过导师的指导，我的设计渐渐有了头绪，通过查阅资料，逐渐确立系统方案。在整个过程中，我学到了新知识，增长了见识。在今后的日子里，我仍然要不断地充实自己，争取在所学领域有所作为。

脚踏实地，认真严谨，实事求是的学习态度，不怕困难、坚持不懈、吃苦耐劳的精神是我在这次设计中最大的收益。我想这是一次意志的磨练，是对我实际能力的一次提升，也会对我未来的学习和工作有很大的帮助。

在这次毕业设计中也使我们的同学关系更进一步了，同学之间互相帮助，有什么不懂的大家在一起商量，听听不同的看法对我们更好的理解知识，所以在这里非常感谢帮助我的同学。

天津大学2007届本科生毕业设计（论文）

在此更要感谢我的导师和专业老师，是你们的细心指导和关怀，使我能够顺利的完成毕业论文。在我的学业和论文的研究工作中无不倾注着老师们辛勤的汗水和心血。老师的严谨治学态度、渊博的知识、无私的奉献精神使我深受启迪。从尊敬的导师身上，我不仅学到了扎实、宽广的专业知识，也学到了做人的道理。在此我要向我的导师致以最衷心的感谢和深深的敬意。

天津大学2007届本科生毕业设计（论文）

参考文献

[1]Winter．中文搜索引擎技术解密：网络蜘蛛 [M]．北京：人民邮电出版社，2004年．

[2]Sergey等．The Anatomy of a Large-Scale Hypertextual Web Search Engine [M]．北京：清华大学出版社，1998年．

[3]Wisenut．WiseNut Search Engine white paper [M]．北京：中国电力出版社，2001年．

[6]李晓明，闫宏飞，王继民．搜索引擎：原理、技术与系统——华夏英才基金学术文库[M]．北京：科学出版社，2005年04月.

天津大学2007届本科生毕业设计（论文）

外文资料

ABSTRACT

Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.

A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon. 1. INTRODUCTION

A recent Pew Foundation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly

天津大学2007届本科生毕业设计（论文）

over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.

Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is (a) Fetch a page

(b) Parse it to extract all linked URLs

Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as yahoo.com, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.

If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]).

However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.

2. Web pages are changing rapidly. If “change” means “any change”, then about

天津大学2007届本科生毕业设计（论文）

40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17].

These two factors imply that to obtain a reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally.

A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently.

The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research. 2. CRAWLING

Web crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these

天津大学2007届本科生毕业设计（论文）

crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.

The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.

The original Google crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system. For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4.

Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such as its top-level domain).

The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.

UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multiple

天津大学2007届本科生毕业设计（论文）

independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.

Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded.

Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult the disk-based set.

Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper. 3. CACHING

In most computer systems, memory is hierarchical, that is, there exist two or more

天津大学2007届本科生毕业设计（论文）

levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right circumstances, caching greatly improves the performance of the overall system and hence it is a fundamental technique in the design of operating systems, discussed at length in any standard textbook [21, 37]. In the web context, caching is often mentioned

in the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.

Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of the caching algorithm is to minimize the number of misses.

Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:

_ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instance, a link to yahoo.com is a much more common occurrence than a link to the authors’ home pages.

_ Temporal correlation or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latter

天津大学2007届本科生毕业设计（论文）

terminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf. Section 4.2, and second, because pages on a given host tend to be explored sequentially and they tend to share many links. For example, many pages on a Computer Science department server are likely to share links to other Computer Science departments in the world, notorious papers, etc.

Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.

We now describe some standard caching algorithms, whose performance we evaluate in Section 5. 3.1 Infinite cache (INFINITE)

This is a theoretical algorithm that assumes that the size of the cache is larger than the number of distinct requests. 3.2 Clairvoyant caching (MIN)

More than 35 years ago, L´aszl´o Belady [2] showed that if the entire sequence of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance. 3.3 Least recently used (LRU)

The LRU algorithm evicts the item in the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefore the number of misses will be minimized in the spirit of Belady’s algorithm. Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. This

天津大学2007届本科生毕业设计（论文）

queue has a processing time cost and a memory cost. The latter is usually ignored in caching situations where the items are large. 3.4 CLOCK

CLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves

sequentially through the array, turning the mark bits off, until an unmarked location is found. The cache item corresponding to the unmarked location is evicted and replaced by X.

3.5 Random replacement (RANDOM)

Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.

In most practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Section 6. 3.6 Static caching (STATIC)

If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.

There are two issues with this approach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in most practical situations.

In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the current crawl. (There are also efficient techniques for discovering the

天津大学2007届本科生毕业设计（论文）

most popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a first pass over our input to determine the k most popular URLs, and then preload the cache with these URLs, which is what we did in our experiments.

The second issue above is the very reason we decided to test STATIC: if STATIC performs well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests are highly correlated. 4. EXPERIMENTAL SETUP

We now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl. 4.1 Mercator crawler architecture

A Mercator crawling system consists of a number of crawling processes, usually running on separate machines. Each crawling process is responsible for a subset of all web servers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.

Each worker thread repeatedly performs the following operations: it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for short); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic properties. For example, it discards all URLs belonging to web servers that contacted us and asked not be crawled.

The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes using a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment) will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawling

天津大学2007届本科生毕业设计（论文）

processes. Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been discovered previously. The new URLs are forwarded to the URL Frontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that today’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be configured to maintain this set as a distributed in-memory hash table (where each crawling process maintains the subset of URLs assigned to it); however, this DUE implementation (which reduces URLs to 8-byte checksums, and uses the first 3 bytes of the checksum to index into the hash table) requires about 5.2 bytes per URL, meaning that it takes over 5 GB of RAM per crawling machine to maintain a set of 1 billion URLs per machine. These memory requirements are too steep in many settings, and in fact, they exceeded the hardware available to us for this experiment. Therefore, we used an alternative DUE

implementation that buffers incoming URLs in memory, but keeps the bulk of URLs (or rather, their 8-byte checksums) in sorted order on disk. Whenever the in-memory buffer fills up, it is merged into the disk file (which is a very expensive operation due to disk latency) and newly discovered URLs are passed on to the Frontier. Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it possible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead of sending them to the peer node, thereby reducing the amount of network traffic. This reduction is particularly important in a scenario where the individual crawling machines are not connected via a high-speed LAN (as they were in our experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “close to it”.

Mercator performs an approximation of a breadth-first search traversal of the

天津大学2007届本科生毕业设计（论文）

web graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the scheduling of downloads is moderated by Mercator’s politeness policy, which limits the load placed by the crawler on any particular web server. Mercator’s politeness policy guarantees that no server ever receives multiple requests from Mercator in parallel; in addition, it guarantees that the next request to a server will only be issued after a multiple (typically 10_) of the time it took to answer the previous request has passed. Such a politeness policy is essential to any large-scale web crawler; otherwise the crawler’s operator becomes inundated with complaints. 4.2 Our web crawl

Our crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.

The crawl ran from July 12 until September 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links. Earlier studies reported only an average of 8 links [9] or 17 links per page [33]. We offer three explanations as to why we found more links per page. First, we configured Mercator to not limit itself to URLs found in anchor tags, but rather to extract URLs from all tags that may contain them (e.g. image tags). This configuration increases both the mean and the median number of links per page. Second, we configured it to download pages up to 16 MB in size (a setting that is significantly higher than usual), making it possible to encounter pages with tens of thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If we

天津大学2007届本科生毕业设计（论文）

only consider unique links3 per page, then the average number of links is 42.74 and the median is 17.

The links extracted from these HTML pages, plus about 38 million HTTP redirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the effectiveness of various caching algorithms, we instrumented Mercator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs. After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique fingerprints, which we determined using an in-memory hash table on a

very-large-memory machine. This data reduction step left us with four condensed host splitter logs (one per crawling machine), ranging from 51 GB to 57 GB in size and containing between 6.4 and 7.1 billion URLs.

In order to explore the effectiveness of caching with respect to inter-process communication in a distributed crawler, we also extracted a sub-trace of the Host Splitter logs that contained only those URLs that were sent to peer crawlers. These logs contained 4.92 billion URLs, or about 19.5% of all URLs. We condensed the sub-trace logs in the same fashion. We then used the condensed logs for our simulations.

5. SIMULATION RESULTS

We studied the effects of caching with respect to two streams of URLs: 1. A trace of all URLs extracted from the pages assigned to a particular machine. We refer to this as the full trace.

2. A trace of all URLs extracted from the pages assigned to a particular machine that were sent to one of the other machines for processing. We refer to this trace as the cross subtrace, since it is a subset of the full trace.

The reason for exploring both these choices is that, depending on other

天津大学2007届本科生毕业设计（论文）

architectural decisions, it might make sense to cache only the URLs to be sent to other machines or to use a separate cache just for this purpose.

We fed each trace into implementations of each of the caching algorithms described above, configured with a wide range of cache sizes. We performed about 1,800 such experiments. We first describe the algorithm implementations, and then present our simulation results. 5.1 Algorithm implementations

The implementation of each algorithm is straightforward. We use a hash table to find each item in the cache. We also keep a separate data structure of the cache items, so that we can choose one for eviction. For RANDOM, this data structure is simply a list. For CLOCK, it is a list and a clock handle, and the items also contain “mark” bits. For LRU, it is a heap, organized by last access time. STATIC needs no extra data structure, since it never evicts items. MIN is more complicated since for each item in the cache, MIN needs to know when the next request for that item will be. We therefore describe MIN in more detail. Let A be the trace or sequence of requests, that is, At is the item requested at time t. We create a second sequence Nt containing the time when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally,

To generate the sequence Nt, we read the trace A backwards, that is, from tmax down to 0, and use a hash table with key At and value t. For each item At, we probe the hash table. If it is not found, we set Nt = 1and store (At; t) in the table. If it is found, we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) in the hash table. Given Nt, implementing MIN is easy: we read At and Nt in parallel, and hence for each item requested, we know when it will be requested next. We tag each item in the cache with the time when it will be requested next, and if necessary, evict the item with the highest value for its next request, using a heap to identify itquickly. 5.2 Results

We present the results for only one crawling host. The results for the other three hosts are quasi-identical. Figure 2 shows the miss rate over the entire trace (that is, the percentage of misses out of all requests to the cache) as a function of the size of the

天津大学2007届本科生毕业设计（论文）

cache. We look at cache sizes from k = 20 to k = 225. In Figure 3 we present the same data relative to the miss-rate of MIN, the optimum off-line algorithm. The same simulations for the cross-trace are depicted in Figures 4 and 5.

For both traces, LRU and CLOCK perform almost identically and only slightly worse than the ideal MIN, except in the critical region discussed below. RANDOM is only slightly inferior to CLOCK and LRU, while STATIC is generally much worse. Therefore, we conclude that there is considerable locality of reference in the trace, as explained in Section 3.6. For very large caches, STATIC appears to do better than MIN. However, this is just an artifact of our accounting scheme: we only charge for misses and STATIC is not charged for the initial loading of the cache. If STATIC were instead charged k misses for the initial loading of its cache, then its miss rate would be of course worse than MIN’s.

6. CONCLUSIONS AND FUTURE DIRECTIONS

After running about 1,800 simulations over a trace containing 26.86 billion URLs, our main conclusion is that URL caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this size is a critical point, that is, a substantially smaller cache is ineffectual while a substantially larger cache brings little additional benefit. For practical purposes our investigation is complete: In view of our discussion in Section 5.2, we recommend a cache size of between 100 to 500 entries per crawling thread. All caching strategies perform roughly the same; we recommend using either CLOCK or RANDOM, implemented using a scatter table with circular chains. Thus, for 500 crawling threads, this cache will be about 2MB – completely insignificant compared to other data structures needed in a crawler. If the intent is only to reduce cross machine traffic in a distributed crawler, then a slightly smaller cache could be used. In either case, the goal should be to have a miss rate lower than 20%.

However, there are some open questions, worthy of further research. The first open problem is to what extent the crawl order strategy (graph traversal method) affects the caching performance. Various strategies have been proposed [14], but there are indications [30] that after a short period from the beginning of the crawl the

天津大学2007届本科生毕业设计（论文）

general strategy does not matter much. Hence, we believe that caching performance will be very similar for any alternative crawling strategy. We can try to implement other strategies ourselves, but ideally we would use independent crawls.

Unfortunately, crawling on web scale is not a simple endeavor, and it is unlikely that we can obtain crawl logs from commercial search engines.

In view of the observed fact that the size of the cache needed to achieve top performance depends on the number of threads, the second question is whether having a per-thread cache makes sense. In general, but not always, a global cache performs better than a collection of separate caches, because common items need to be stored only once. However, this assertion needs to be verified in the URL caching context. The third open question concerns the explanation we propose in Section 5 regarding the scope of the links encountered on a given host. If our model is correct then it has certain implications regarding the appropriate model for the web graph, a topic of considerable interest among a wide variety of scientists: mathematicians, physicists, and computer scientists. We hope that our paper will stimulate research to estimate the cache performance under various models. Models where caching performs well due to correlation of links on a given host are probably closer to reality. We are making our URL traces available for this research by donating them to the Internet Archive.

天津大学2007届本科生毕业设计（论文）

中文译文

要在网络上爬行非常简单：基本的算法是：（a）取得一个网页（b）解析它

提取所有的链接URLs（c）对于所有没有见过的URLs重复执行（a）-（c）。但是，网络的大小（估计有超过40亿的网页）和他们变化的频率（估计每周有7%的变化）使这个计划由一个微不足道的设计习题变成一个非常严峻的算法和系统设计挑战。实际上，光是这两个要素就意味着如果要进行及时地，完全地爬行网络，步骤（a）必须每秒钟执行大约1000次，因此，成员检测（c）必须每秒钟执行超过10000次，并有非常大的数据储存到主内存中。这个要求有一个分布式构造，使得成员检测更加复杂。

一个非常重要的方法加速这个检测就是用cache（高速缓存），这个是把见过的URLs存入主内存中的一个（动态）子集中。这个论文最主要的成果就是仔细的研究了几种关于网络爬虫的URL缓存技术。我们考虑所有实际的算法：随机置换，静态cache，LRU，和CLOCK，和理论极限：透视cache和极大的cache。我们执行了大约1800次模拟，用不同的cache大小执行这些算法，用真实的log日志数据，获取自一个非常大的33天的网络爬行，大约执行了超过10亿次的http请求。

我们的主要的结论是 cache是非常高效的-在我们的机制里，一个有大约50000个入口的cache可以完成80%的速率。有趣的是，这cache的大小下降到一个临界点：一个足够的小一点的cache更有效当一个足够的大一点的cache只能带来很小的额外好处。我们推测这个临界点是固有的并且冒昧的解释一下这个现象。 1.介绍

皮尤基金会最新的研究指出：“搜索引擎已经成为互联网用户不可或缺的工具”，估计在2002年中期，初略有超过1半的美国人用网络搜索获取信息。因此，一个强大的搜索引擎技术有巨大的实际利益，在这个论文中，我们集中于一方面的搜索技术，也就是搜集网页的过程，最终组成一个搜索引擎的文集。搜索引擎搜集网页通过很多途径，他们中，直接提交URL，回馈内含物，然后从非web源文件中提取URL，但是大量的文集包含一个进程叫 crawling 或者 SPIDERing，他们递归的探索互联网。基本的算法是： Fetch a page

Parse it to extract all linked URLs

For all the URLs not seen before，repeat（a）-(c)

网络怕从一般开始于一些种子URLs。有些时候网络爬虫开始于一个正确连接的页面，或者一个目录就像：yahoo.com，但是因为这个原因相关的巨大的部分网络资源无法被访问到。（估计有超过20%）如果把网页看作图中的节点，把超链接看作定向的移动在这些节点之间，那么网络爬虫就变成了一个进程就像数学中的图的遍历一样。不同的遍历策略决定着先不访问哪个节点，下一个访问哪个节点。2种标准的策略是深度优先算法和广度优先算法-他们容易被实现所以在很多入门的算法课中都有教。但是，在网络上爬行并不是一个微不足道的设计习题，而是一个非常严峻的算法和系统设计挑战因为以下2点原因：

网络非常的庞大。现在，Google需要索引超过30亿的网页。很多研究都指出，

天津大学2007届本科生毕业设计（论文）

在历史上，网络每9-12个月都会增长一倍。

网络的页面改变很频繁。如果这个改变指的是任何改变，那么有40%的网页每周会改变。如果我们认为页面改变三分之一或者更多，那么有大约7%的页面每周会变。

这2个要素意味着，要获得及时的，完全的网页快照，一个搜索引擎必须访问1亿个网页每天。因此，步骤（a）必须执行大约每秒1000次，成员检测的步骤（c）必须每秒执行超过10000次，并有非常大的数据储存到主内存中。另外，网络爬虫一般使用一个分布式的构造来平行地爬行更多的网页，这使成员检测更为复杂：这是可能的成员问题只能回答了一个同行节点，而不是当地。一个非常重要的方法加速这个检测就是用cache（高速缓存），这个是把见过的URLs存入主内存中的一个（动态）子集中。这个论文最主要的成果就是仔细的研究了几种关于网络爬虫的URL缓存技术。我们考虑所有实际的算法：随机置换，静态cache，LRU，和CLOCK，和理论极限：透视cache和极大的cache。我们执行了大约1800次模拟，用不同的cache大小执行这些算法，用真实的log日志数据，获取自一个非常大的33天的网络爬行，大约执行了超过10亿次的http请求。

这个论文像这样组织的：第2部分讨论在文学著作中几种不同的爬行解决方案和什么样的cache最适合他们。第3部分介绍关于一些cache的技术和介绍了关于cache几种理论和实际算法。第4部分我们实现这些算法，在实验机制中。第5部分描述和讨论模拟的结果。第6部分是我们推荐的实际算法和数据结构关于URLcache。第7部分是结论和指导关于促进研究。 2.CRAWLING

网络爬虫的出现几乎和网络同期，而且有很多的文献描述了网络爬虫。在这个部分，我们呈现一个摘要关于这些爬虫程序，并讨论问什么大多数的网络爬虫会受益于URL cache。

网络爬虫用网络存档雇员多个爬行进程，每个一次性完成一个彻底的爬行对于64个hosts 。爬虫进程储存非本地的URLs到磁盘；在爬行的最后，一批工作将这些URLs加入到下个爬虫的每个host的种子sets中。最初的google 爬虫，实现不同的爬虫组件通过不同的进程。一个单独的URL服务器进行维护需要下载的URL的集合；爬虫程序获取的网页；索引进程提取关键字和超链接；URL解决进程将相对路径转换给绝对路径。这些不同的进程通过文件系统通信。

这个论文的中实验我们使用的meractor网络爬虫。Mercator使用了一个独立的集合，通信网络爬虫进程。每个爬虫进程都是一个有效的web服务器子集；URLs的分配基于URLs主机组件。没有责任通过TCP传送这个URL给网络爬虫，有责任把这些URLs绑在一起减少TCP开销。我们描述 mercator很多的细节在第4部分。

任何网络爬虫必须维护一个集合，装那些需要被下载的URLs。此外，不能重复地下载同一个URL，必须要个方法避免加入URLs到集合中超过一次。一般的，达到避免可以用维护一个发现URLs的集合。如果数据太多，可以存入磁盘，或者储存经常被访问的URLs。 3.CACHING

在大多数的计算机系统里面，内存是分等级的，意思是，存在2级或更多级的内存，表现出不同的空间和速度。举个例，在一个典型的工作站里，有一个非

天津大学2007届本科生毕业设计（论文）

常小但是非常快的内存，一个大，但是比较慢的RAM内存，一个非常大胆是很慢的disk内存。在一个网络环境中，也是分层的。Caching就是一种想法储存经常用到的项目从慢速内存到快速内存。 Caching术语就像下面：cache是内存用来储存同等大小的元素。一个cache有k的大小，那么可以储存k个项目.在每个时间段,cache接受到来自一个项目的请求.如果这个请求项目在这个cache中，这种情况将会引发一个碰撞并且不需要进一步的动作。另一方面，这种情况叫做丢失或者失败。如果cache没有k个项目，那个丢失的项目被加入cache。另一方面，算法必须选择驱逐一个项目来空出空间来存放那个丢失的项目，或者不加入那个丢失的项目。Caching算法的目标是最小化丢失的个数。

清楚的，cache越大，越容易避免丢失。因此，一个caching算法的性能要在看在一个给定大小的cache中的丢失率。一般的，caching成功有2个原因：

不一致的请求。一些请求比其他一些请求多。时间相关性或地方的职权范围。 3.1 无限cache(INFINITE)

这是一个理论的算法，假想这个cache的大小要大于明显的请求数。 3.2 透视cache（MIN）

超过35年以前，L?aszl?o Belady表示如果能提前知道完整的请求序列，就能剔除下一个请求最远的项目。这个理论的算法叫MIN，因为他达到了最小的数量关于丢失在任何序列中，而且能带来一个飞跃性的性能提升。 3.3 最近被用到（LRU）

LRU算法剔除最长时间没用被用到的项目。LRU的直觉是一个项目如果很久都没被用过，那么在将来它也会在很长时间里不被用到。

尽管有警告“过去的执行不能保证未来的结果”，实际上，LRU一般是非常有效的。但是，他需要维护一个关于请求的优先权队列。这个队列将会有一个时间浪费和空间浪费。 3.4 CLOCK

CLOCK是一个非常流行的接近于LRU，被发明与20世纪60年代末。一个排列标记着M0，M1，….Mk对应那些项目在一个大小为k的cache中。这个排列可以看作一个圈，第一个位置跟着最后一个位置。CLOCK控制指针对一个项目在cache中。当一个请求X到达，如果项目X在cache中，然后他的标志打开。否则，关闭标记，知道一个未标记的位置被剔除然后用X置换。 3.5 随机置换（RANDOM）

随机置换完全忽视过去。如果一个项目请求没在cache中，然后一个随机的项目将被从cache中剔除然后置换.

在大多数实际的情况下，随机替换比CLOCK要差，但并不是差很多。 3.6 静态caching（STATIC）

如果我们假设每个项目有一个确定的固定的可能性被请求，独立的先前的访问历史，然后在任何时间一个撞击在大小为k的cache里的概率最大，如果一个cache中包含那k个项目有非常大的概率被请求。

有2个问题关于这个步骤：第一，一般这个概率不能被提前知道；第二，独立的请求，虽然理论上有吸引力，是对立的地方参考目前在大多数实际情况。在我们的情况中，第一种情况可以被解决：我们可以猜想上次爬行发现的最

天津大学2007届本科生毕业设计（论文）

常用的k个URLs适合于这次的爬行的最常用的k个URLs。（也有有效的技术可以发现最常用的项目在一个流数据中。因此，一个在线的步骤可以运行的很好）当然，为了达到模拟的目的，我们可以首先忽略我们的输入，去确定那个k个最常用 URLs，然后预装这些URLs到我们做实验的cache中。第二个情况是我们决定测试STATIC的很大的原因：如果STATIC运行的很好，Sname结论是这里有很少的位置被涉及。如果STATIC运行的相对差，那么我们可以推断我们的数据显然是真实被提及的位置，连续的请求是密切相关的。 4 实验机制

4.1 Meractor 爬虫体系

一个Meractor爬虫体系有一组爬虫进程组成，一般在不同的机器上运行。每个爬虫进程都是总网络服务器的子集，然后由一些工作线程组成（一般有500个），他们负责下载和处理网页从这些服务器。

举一个例子，每个工作现场在一个系统里用4个爬行进程。

每个工作现场重复地完成以下的操作：它获得一个URL从URL边境里，一个磁盘数据结构维护被下载的URL集合；用HTTP协议下载对应的网页到一个缓冲区中；如果这个网页是HTML，提取所有的超链接。把提取出来的超链接流转换为完全链接然后运行通过URL过滤器，丢弃一些基于syntactic properties的链接。比如，它丢弃那些基于服务器联系我们，不能被爬行的链接。

URL流被送进主机Splitter里面，主机splitter用来分配URL给爬虫进程通过URL的主机名。直到大多数的超链接被关联，大部分的URL（在我们的实验中是81。5%）将被分配给本地爬虫进程；剩下的传说通过TCP给适当的爬虫进程。

本地URLs流和从爬虫中得到的URLs流都要送到复制URL消除器中（DUE）。DUE会除去那些被访问过的URLs。新的URLs会被送到URL边境中去以供以后下载。

为了避免重复URLs，DUE必须维护发现的URLs的集合。假设今天的网络包括几十亿有效的URLs，内存就需要维护这个集合是非常重要的。 Mercator可以被认为可以维护这个集合通过一个分布式的内存中的hash table（这个地方是每个爬虫进程维护URLs的子集时分配给它）；但是DUE执行（这个强制URLs成8-byte的checksums，而且用前3 —bytes来用作hash table的索引）需要大约5.2bytes 每个URl，意思就是它会用5GB的RAM每个爬虫机器来维护一个10亿个URLs的集合每台机器。这个内存需求非常不合理在很多的设置里，而且实际上，它对于我们超过了硬件的适用性在这个实验里。因此，我们用一个选择性的DUE来执行那个缓冲器引入URLs到内存中，但是保存大多的URLs（或者更好，他们的8-bytes checksum）到一个排序好的队列在磁盘中。

基于磁盘的DUE和主机Splitter都受益于URL caching。给基于磁盘的DUE加一个cache可以使它丢弃引入的URLs，发生碰撞在cache中，替代加入他们到内存缓存区中。而且有个结果是，内存缓存区要慢些，而且不频繁地和磁盘文件链接。将cache加入到一个主机Splitter中可以丢弃引入的重复的URLs代替将它们传入每个节点，这样可以减少总的网络通信。这个通信的减少非常重要在爬虫进程没有通过高速LAN连接的时候，但是被替代成球形分布式。在这样一个装置中，每个爬虫负责让web servers关掉它。 Mercator执行一个遍历通过广度优先算法在网络图上。每个爬虫进程中的线程同时执行。更重要的是，下载的行程被Mercator的 politeness policy调节，它限制每个爬虫的负载咋一

天津大学2007届本科生毕业设计（论文）

些特殊的网络服务器中。Mercator的politeness policy保证没有服务器不断同时收到多个请求；而且，它还保证下一个请求会在它的上一个请求几倍的时间内完成（通常是10倍）。这样一个 politeness policy基本在任何一个大量搜索的网络爬虫中，否则爬虫将会陷入繁重的处理中。 4.2 我们的网络爬虫

我们的网络爬虫由4个Compaq XP1000工作站组成，每个装备一个667MHz的Alpha processor，1.5GB的RAM，144GB的磁盘，和一个100Mbit/sec的以太网连接。每个机器定位于Palo Alto Internet Exchange,十分接近于Internet的backbone。

这个爬虫运行从7月12到2002年的9月3日，虽然它活跃地爬行只有33天。下载时由于不同的硬件和网络故障。爬行过程中，那4个机器完成了10.4亿的下载尝试，7.84亿成功下载。4.29亿的成功下载文档是HTML页面。这些页面包含了大约268.3亿个超链接，相当于每个页面有62.55个超链接；但是，中间的数值每个超链接只有24个，暗示平均的超链接数被一些包含很多链接的页面扩大了。早期的论文报道每个页面平均只有8个超链或者17个超链接。我们提供了3个解释关于为什么我们每个页面找到了更过的超链接。首先，我们认为Mercator并没有限制发现URLs在anchor tags，但是更好的是提取所有的tags在可能包含他们的地方。第二，我们认为他下载页面一直16MB的大小（一个设置显著地大于平常），让它可能遇到上万个的超链接页面.第三，大部分的论文报道那些每个页面中唯一的超链接。如果我们只考虑每个页面中唯一的超链接，那么平均值是47.74，而中间值为 17.

那些超链接从这些HTML中提取出来，加上大约3800万的HTTP跳转，在这个爬行中，流入到Host Splitter中。为了去测试不同caching算法的效率，我们通过Mercator的Host Splitter组件将所有的引入URLs打日志到磁盘中.四个爬虫中的Host Splitter接收并日志记录了总共268.6亿个URLs。

完成爬行后，我们浓缩了Host Splitter日志文件。我们把每个URL hash化为一个64—bit的识别码。我们确信没有故意的碰撞在排序最初的URL日志文档，而且计算了唯一的URLs个数。然后我们把这些唯一的URL 数和唯一的识别码数比较，我们决定用一个内存hash table在一个内存很大的机器里。数据减少的过程，大小距离51GB到57GB，而且包含64亿和71亿个URLs。为了发现caching的效率，在一个分布式的爬虫程序里的交互的进程通信，我们也获得了一个日志文件，记录那些URLs被传给每个爬虫。这个日志文件包含49.2亿个URLs，大约相当于全部URLs的19.5%。我们也浓缩了这个日志文件用同样的方式。然后我们会用这个浓缩的日志文件到我们的模拟中。

天津大学2007届本科生毕业设计（论文）

毕业设计（论文）原创性声明和使用授权说明

原创性声明

本人郑重承诺：所呈交的毕业设计（论文），是我个人在指导教师的指导下进行的研究工作及取得的成果。尽我所知，除文中特别加以标注和致谢的地方外，不包含其他人或组织已经发表或公布过的研究成果，也不包含我为获得及其它教育机构的学位或学历而使用过的材料。对本研究提供过帮助和做出过贡献的个人或集体，均已在文中作了明确的说明并表示了谢意。

作者签名：日期：指导教师签名：日期：

使用授权说明

本人完全了解大学关于收集、保存、使用毕业设计（论文）的规定，即：按照学校要求提交毕业设计（论文）的印刷本和电子版本；学校有权保存毕业设计（论文）的印刷本和电子版，并提供目录检索与阅览服务；学校可以采用影印、缩印、数字化或其它复制手段保存论文；在不以赢利为目的前提下，学校可以公布论文的部分或全部内容。

天津大学2007届本科生毕业设计（论文）

作者签名：日期：

天津大学2007届本科生毕业设计（论文）

学位论文原创性声明

本人郑重声明：所呈交的论文是本人在导师的指导下独立进行研究所取得的研究成果。除了文中特别加以标注引用的内容外，本论文不包含任何其他个人或集体已经发表或撰写的成果作品。对本文的研究做出重要贡献的个人和集体，均已在文中以明确方式标明。本人完全意识到本声明的法律后果由本人承担。

作者签名：

日期：年月日

学位论文版权使用授权书

本学位论文作者完全了解学校有关保留、使用学位论文的规定，同意学校保留并向国家有关部门或机构送交论文的复印件和电子版，允许论文被查阅和借阅。本人授权大学可以将本学位论文的全部或部分内容编入有关数据库进行检索，可以采用影印、缩印或扫描等复制手段保存和汇编本学位论文。

涉密论文按学校规定处理。

作者签名：

日期：年月日

导师签名：日期：年月日

天津大学2007届本科生毕业设计（论文）

独创声明

本声明的法律后果由本人承担。

作者签名: 二〇一〇年九月二十日

毕业设计（论文）使用授权声明

本人完全了解**学院关于收集、保存、使用毕业设计（论文）的规定。

（保密论文在解密后遵守此规定）

作者签名: 二〇一〇年九月二十日

天津大学2007届本科生毕业设计（论文）

基本要求：写毕业论文主要目的是培养学生综合运用所学知识和技能，理论联系实际，独立分析，解决实际问题的能力，使学生得到从事本专业工作和进行相关的基本训练。毕业论文应反映出作者能够准确地掌握所学的专业基础知识，基本学会综合运用所学知识进行科学研究的方法，对所研究的题目有一定的心得体会，论文题目的范围不宜过宽，一般选择本学科某一重要问题的一个侧面。

毕业论文的基本教学要求是：

1、培养学生综合运用、巩固与扩展所学的基础理论和专业知识，培养学生独立分析、解决实际问题能力、培养学生处理数据和信息的能力。2、培养学生正确的理论联系实际的工作作风，严肃认真的科学态度。3、培养学生进行社会调查研究；文献资料收集、阅读和整理、使用；提出论点、综合论证、总结写作等基本技能。

毕业论文是毕业生总结性的独立作业，是学生运用在校学习的基本知识和基础理论，去分析、解决一两个实际问题的实践锻炼过程，也是学生在校学习期间学习成果的综合性总结，是整个教学活动中不可缺少的重要环节。撰写毕业论文对于培养学生初步的科学研究能力，提高其综合运用所学知识分析问题、解决问题能力有着重要意义。毕业论文在进行编写的过程中，需要经过开题报告、论文编写、论文上交评定、论文答辩以及论文评分五个过程，其中开题报告是论文进行的最重要的一个过程，也是论文能否进行的一个重要指标。

撰写意义：1.撰写毕业论文是检验学生在校学习成果的重要措施，也是提高教学质量的重要环节。大学生在毕业前都必须完成毕业

天津大学2007届本科生毕业设计（论文）

论文的撰写任务。申请学位必须提交相应的学位论文，经答辩通过后，方可取得学位。可以这么说，毕业论文是结束大学学习生活走向社会的一个中介和桥梁。毕业论文是大学生才华的第一次显露，是向祖国和人民所交的一份有份量的答卷，是投身社会主义现代化建设事业的报到书。一篇毕业论文虽然不能全面地反映出一个人的才华，也不一定能对社会直接带来巨大的效益，对专业产生开拓性的影响。但是，实践证明，撰写毕业论文是提高教学质量的重要环节，是保证出好人才的重要措施。

2.通过撰写毕业论文，提高写作水平是干部队伍“四化”建设的需要。党中央要求，为了适应现代化建设的需要，领导班子成员应当逐步实现“革命化、年轻化、知识化、专业化”。这个“四化”的要求，也包含了对干部写作能力和写作水平的要求。

3.提高大学生的写作水平是社会主义物质文明和精神文明建设的需要。在新的历史时期，无论是提高全族的科学文化水平，掌握现代科技知识和科学管理方法，还是培养社会主义新人，都要求我们的干部具有较高的写作能力。在经济建设中，作为领导人员和机关的办事人员，要写指示、通知、总结、调查报告等应用文；要写说明书、广告、解说词等说明文；还要写科学论文、经济评论等议论文。在当今信息社会中，信息对于加快经济发展速度，取得良好的经济效益发挥着愈来愈大的作用。写作是以语言文字为信号，是传达信息的方式。信息的来源、信息的收集、信息的储存、整理、传播等等都离不开写作。

天津大学2007届本科生毕业设计（论文）

论文种类：毕业论文是学术论文的一种形式，为了进一步探讨和掌握毕业论文的写作规律和特点，需要对毕业论文进行分类。由于毕业论文本身的内容和性质不同，研究领域、对象、方法、表现方式不同，因此，毕业论文就有不同的分类方法。

按内容性质和研究方法的不同可以把毕业论文分为理论性论文、实验性论文、描述性论文和设计性论文。后三种论文主要是理工科大学生可以选择的论文形式，这里不作介绍。文科大学生一般写的是理论性论文。理论性论文具体又可分成两种：一种是以纯粹的抽象理论为研究对象，研究方法是严密的理论推导和数学运算，有的也涉及实验与观测，用以验证论点的正确性。另一种是以对客观事物和现象的调查、考察所得观测资料以及有关文献资料数据为研究对象，研究方法是对有关资料进行分析、综合、概括、抽象，通过归纳、演绎、类比，提出某种新的理论和新的见解。

按议论的性质不同可以把毕业论文分为立论文和驳论文。立论性的毕业论文是指从正面阐述论证自己的观点和主张。一篇论文侧重于以立论为主，就属于立论性论文。立论文要求论点鲜明，论据充分，论证严密，以理和事实服人。驳论性毕业论文是指通过反驳别人的论点来树立自己的论点和主张。如果毕业论文侧重于以驳论为主，批驳某些错误的观点、见解、理论，就属于驳论性毕业论文。驳论文除按立论文对论点、论据、论证的要求以外，还要求针锋相对，据理力争。按研究问题的大小不同可以把毕业论文分为宏观论文和微观论文。凡届国家全局性、带有普遍性并对局部工作有一定指导意义的论文，称

天津大学2007届本科生毕业设计（论文）

为宏观论文。它研究的面比较宽广，具有较大范围的影响。反之，研究局部性、具体问题的论文，是微观论文。它对具体工作有指导意义，影响的面窄一些。

另外还有一种综合型的分类方法，即把毕业论文分为专题型、论辩型、综述型和综合型四大类：

1．专题型论文。这是分析前人研究成果的基础上，以直接论述的形式发表见解，从正面提出某学科中某一学术问题的一种论文。如本书第十二章例文中的《浅析领导者突出工作重点的方法与艺术》一文，从正面论述了突出重点的工作方法的意义、方法和原则，它表明了作者对突出工作重点方法的肯定和理解。2．论辩型论文。这是针对他人在某学科中某一学术问题的见解，凭借充分的论据，着重揭露其不足或错误之处，通过论辩形式来发表见解的一种论文。3．综述型论文。这是在归纳、总结前人或今人对某学科中某一学术问题已有研究成果的基础上，加以介绍或评论，从而发表自己见解的一种论文。4．综合型论文。这是一种将综述型和论辩型两种形式有机结合起来写成的一种论文。如《关于中国民族关系史上的几个问题》一文既介绍了研究民族关系史的现状，又提出了几个值得研究的问题。因此，它是一篇综合型的论文。

写作步骤：毕业论文是高等教育自学考试本科专业应考者完成本科阶段学业的最后一个环节，它是应考者的总结性独立作业，目的在于总结学习专业的成果，培养综合运用所学知识解决实际问题的能力。从文体而言，它也是对某一专业领域的现实问题或理论问题

天津大学2007届本科生毕业设计（论文）

进行科学研究探索的具有一定意义的论说文。完成毕业论文的撰写可以分两个步骤，即选择课题和研究课题。

首先是选择课题。选题是论文撰写成败的关键。因为，选题是毕业论文撰写的第一步，它实际上就是确定“写什么”的问题，亦即确定科学研究的方向。如果“写什么”不明确，“怎么写”就无从谈起。

教育部自学考试办公室有关对毕业论文选题的途径和要求是“为鼓励理论与工作实践结合，应考者可结合本单位或本人从事的工作提出论文题目，报主考学校审查同意后确立。也可由主考学校公布论文题目，由应考者选择。毕业论文的总体要求应与普通全日制高等学校相一致，做到通过论文写作和答辩考核，检验应考者综合运用专业知识的能力”。但不管考生是自己任意选择课题，还是在主考院校公布的指定课题中选择课题，都要坚持选择有科学价值和现实意义的、切实可行的课题。选好课题是毕业论文成功的一半。

第一、要坚持选择有科学价值和现实意义的课题。科学研究的目的是为了更好地认识世界、改造世界，以推动社会的不断进步和发展。因此，毕业论文的选题，必须紧密结合社会主义物质文明和精神文明建设的需要，以促进科学事业发展和解决现实存在问题作为出发点和落脚点。选题要符合科学研究的正确方向，要具有新颖性，有创新、有理论价值和现实的指导意义或推动作用，一项毫无意义的研究，即使花很大的精力，表达再完善，也将没有丝毫价值。具体地说，考生可从以下三个方面来选题。首先，要从现实的弊端中选题，学习了专业知识，不能仅停留在书本上和理论上，还要下一番功夫，理论

天津大学2007届本科生毕业设计（论文）

联系实际，用已掌握的专业知识，去寻找和解决工作实践中急待解决的问题。其次，要从寻找科学研究的空白处和边缘领域中选题，科学研究。还有许多没有被开垦的处女地，还有许多缺陷和空白，这些都需要填补。应考者应有独特的眼光和超前的意识去思索，去发现，去研究。最后，要从寻找前人研究的不足处和错误处选题，在前人已提出来的研究课题中，许多虽已有初步的研究成果，但随着社会的不断发展，还有待于丰富、完整和发展，这种补充性或纠正性的研究课题，也是有科学价值和现实指导意义的。

第二、要根据自己的能力选择切实可行的课题。毕业论文的写作是一种创造性劳动，不但要有考生个人的见解和主张，同时还需要具备一定的客观条件。由于考生个人的主观、客观条件都是各不相同的，因此在选题时，还应结合自己的特长、兴趣及所具备的客观条件来选题。具体地说，考生可从以下三个方面来综合考虑。首先，要有充足的资料来源。“巧妇难为无米之炊”，在缺少资料的情况下，是很难写出高质量的论文的。选择一个具有丰富资料来源的课题，对课题深入研究与开展很有帮助。其次，要有浓厚的研究兴趣，选择自己感兴趣的课题，可以激发自己研究的热情，调动自己的主动性和积极性，能够以专心、细心、恒心和耐心的积极心态去完成。最后，要能结合发挥自己的业务专长，每个考生无论能力水平高低，工作岗位如何，都有自己的业务专长，选择那些能结合自己工作、发挥自己业务专长的课题，对顺利完成课题的研究大有益处。

天津大学2007届本科生毕业设计（论文）

致谢

这次论文的完成，不止是我自己的努力，同时也有老师的指导，同学的帮助，以及那些无私奉献的前辈，正所谓你知道的越多的时候你才发现你知道的越少，通过这次论文，我想我成长了很多，不只是磨练了我的知识厚度，也使我更加确定了我今后的目标：为今后的计算机事业奋斗。在此我要感谢我的指导老师——***老师，感谢您的指导，才让我有了今天这篇论文，您不仅是我的论文导师，也是我人生的导师，谢谢您！我还要感谢我的同学，四年的相处，虽然我未必记得住每分每秒，但是我记得每一个有你们的精彩瞬间，我相信通过大学的历练，我们都已经长大，变成一个有担当，有能力的新时代青年，感谢你们的陪伴，感谢有你们，这篇论文也有你们的功劳，我想毕业不是我们的相处的结束，它是我们更好相处的开头，祝福你们！我也要感谢父母，这是他们给我的，所有的一切；感谢母校，尽管您不以我为荣，但我一直会以我是一名农大人为荣。

通过这次毕业设计，我学习了很多新知识，也对很多以前的东西有了更深的记忆与理解。漫漫求学路，过程很快乐。我要感谢信息与管理科学学院的老师，我从他们那里学到了许多珍贵的知识和做人处事的道理，以及科学严谨的学术态度，令我受益良多。同时还要感谢学院给了我一个可以认真学习，天天向上的学习环境和机会。

即将结束*大学习生活，我感谢****大学提供了一次在**大接受教育的机会，感谢院校老师的无私教导。感谢各位老师审阅我的论文。

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文