如何自己寫一個網絡爬蟲

原創

2020-02-24 12:48

這裏是維基百科對網絡爬蟲的詞條頁面。網絡爬蟲以叫網絡蜘蛛，網絡機器人，這是一個程序，其會自動的通過網絡抓取互聯網上的網頁，這種技術一般可能用來檢查你的站點上所有的鏈接是否是都是有效的。當然，更爲高級的技術是把網頁中的相關數據保存下來，可以成爲搜索引擎。

從技相來說，實現抓取網頁可能並不是一件很困難的事情，困難的事情是對網頁的分析和整理，那是一件需要有輕量智能，需要大量數學計算的程序才能做的事情。下面一個簡單的流程：

在這裏，我們只是說一下如何寫一個網頁抓取程序。

首先我們先看一下，如何使用命令行的方式來找開網頁。

telnet somesite.com 80
GET /index.html HTTP/1.0
按回車兩次

使用telnet就是告訴你其實這是一個socket的技術，並且使用HTTP的協議，如GET方法來獲得網頁，當然，接下來的事你就需要解析HTML文法，甚至還需要解析Javascript，因爲現在的網頁使用Ajax的越來越多了，而很多網頁內容都是通過Ajax技術加載的，因爲，只是簡單地解析HTML文件在未來會遠遠不夠。當然，在這裏，只是展示一個非常簡單的抓取，簡單到只能做爲一個例子，下面這個示例的僞代碼：

取網頁
for each 鏈接 in 當前網頁所有的鏈接
{
        if(如果本鏈接是我們想要的 || 這個鏈接從未訪問過)
        {
                處理對本鏈接
                把本鏈接設置爲已訪問
        }
}

require “rubygems”
require “mechanize”

class Crawler < WWW::Mechanize

  attr_accessor :callback
  INDEX = 0
  DOWNLOAD = 1
  PASS = 2

  def initialize
    super
    init
    @first = true
    self.user_agent_alias = “Windows IE 6″
  end

  def init
    @visited = []
  end

  def remember(link)
    @visited << link
  end

  def perform_index(link)
    self.get(link)
    if(self.page.class.to_s == “WWW::Mechanize::Page”)
      links = self.page.links.map {|link| link.href } - @visited
      links.each do |alink|
        start(alink)
      end
    end
  end

  def start(link)
    return if link.nil?
    if([email protected]?(link))
      action = @callback.call(link)
      if(@first)
        @first = false
        perform_index(link)
      end
      case action
        when INDEX
          perform_index(link)
        when DOWNLOAD
          self.get(link).save_as(File.basename(link))
        when PASS
          puts “passing on #{link}”
      end
    end
  end

  def get(site)
    begin
      puts “getting #{site}”
      @visited << site
      super(site)
    rescue
      puts “error getting #{site}”
    end
  end
end

上面的代碼就不必多說了，大家可以去試試。下面是如何使用上面的代碼：

require “crawler”

x = Crawler.new
callback = lambda do |link|
  if(link =~/\\.(zip|rar|gz|pdf|doc)
    x.remember(link)
    return Crawler::PASS
  elsif(link =~/\\.(jpg|jpeg)/)
    return Crawler::DOWNLOAD
  end
  return Crawler::INDEX;
end

x.callback = callback
x.start(”http://somesite.com”)

下面是一些和網絡爬蟲相關的開源網絡項目

arachnode.net is a .NET crawler written in C# using SQL 2005 and Lucene and is released under the GNU General Public License.
DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search (http://wikiasearch.com ) uses to crawl the web.
Heritrix is the Internet Archive’s archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on
原文鏈接：http://coolshell.cn/articles/27.html#more-27

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

如何自己寫一個網絡爬蟲

致遠OA及相關OA系統集成與二次開發

EXCEL公式使用總結

System.Object未被引用的程序集中定義

Java 信號量（semaphore）搭配CountDownLatch 實現多線程處理循環內邏輯並限制創建線程數

[轉帖]linux命令top內存顯示M兆或者G

【面試準備】項目經驗——接口自動化項目

索引，視圖和存儲過程的利弊

java環境配置及dos下編譯java文件

進程線程的的作用和區別

Tomcat下獲取項目絕對路徑問題

2014-3-29騰訊實習生筆試經驗總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結