毕业论文摘要翻译

好些天没有翻译了,答辩之后,人就松懈了下来。现在把毕业论文的摘要翻译贴出来,也不知道翻译得咋样...
------------------------------------------------------------------------------------------------------------------------

随着互联网的迅猛发展,通用搜索引擎逐步显现其局限性,对此,定向抓取相关网页资源的聚焦爬虫应运而生。聚焦爬虫是一个自动下载网页的程序,它根据既定的抓取目标,有选择的访问万维网上的网页与相关的链接,获取所需要的信息。与通用爬虫不同,聚焦爬虫并不追求大的覆盖,而将目标定为抓取与某一特定主题内容相关的网页,为面向主题的用户查询准备数据资源。

Heritrix由于其灵活的模块式体系结构设计,为开发者扩展相关部件定制符合特定需求的聚焦爬虫提供了基础。

开发垂直搜索引擎的时候,为了方便全文检索工具对数据资料建立索引,需要进一步处理网络爬虫获取的数据,特别是网页数据。而HTMLParser提供了提取文本信息的API,使我们摆脱繁琐的正则匹配过程。

本文主要介绍如何基于开源爬虫Heritrix进行扩展定制面向竹藤领域的网络爬虫,利用HTMLParser包对爬取的结果进行再次解析处理,并采用LAMP+jQuery 技术开发一个简单的竹藤数据搜索引擎。


关键字    竹藤,网络爬虫, Heritrix, HTMLParser, LAMP


ABSTRACT

With the rapid development of the Internet,general search engine shows its limitations gradually,for solving these problems,Focused crawler which directionally grabs related web resources emerges at its proper moment。 Focused crawler is a program that downloads web page automatically 。According to a given target , it selectively  visits the web page and related links on the Internet, acquire the information we need。 Differing from general web crawler , in contrast to pursue a large coverage, focused crawler sets the target to grab the web page related to a specific topic, prepare data resource for subject- oriented user。

Because of its flexible architecture design,Heritrix provides a framework for developer to customizea web crawler meeting the needs of the Bamboo & rattan field through extensions.

When developing a vertical search engine,for the convenience of Full Text Search service to create index on datas, it is essential  to parser the datas web crawler acquired, especially web page. HTMLParser provides some APIs to extract text, so we can free ourselves from  the fussy process of pattern parser.

This paper primarily introduces how to develop a web crawler gearing to the needs of the Bamboo & rattan field,use the package of HTMLParser to parser the web pages that web crawler acquires and use LAMP and jQuery to develop a simple search engine of data resource related to Bamboo & rattan。


Key words   Bamboo & rattan,Web Crawler,  Heritrix, HTMLParser, LAMP

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章