【網絡爬蟲】【python】網絡爬蟲（二）：網易微博爬蟲軟件開發實例（附軟件源碼）

原創

2020-07-03 10:57

對於urllib2的學習，這裏先推薦一個教程《IronPython In Action》，上面有很多簡明例子，並且也有很詳盡的原理解釋：http://www.voidspace.org.uk/python/articles/urllib2.shtml

最基本的爬蟲，主要就是兩個函數的使用urllib2.urlopen()和re.compile()。

一、網頁抓取簡單例子

先來看一個最簡單的例子，以百度音樂頁面爲例，訪問返回頁面html的string形式，程序如下：

# -*- coding: utf8 -*-
import urllib2
response = urllib2.urlopen('http://music.baidu.com')
html = response.read()
print html

這個例子主要說下urllib2.open()函數，其作用是：用一個request對象來映射發出的http請求（這裏的請求頭不一定是http，還可以是ftp:或file:等），http基於請求和應答機制，即客戶端提出請求request，服務端應答response。

urllib2用你請求的地址創建一個request對象，調用urlopen並將結果返回作爲response對象，並且可以用.read()來讀取response對象的內容。所以上面的程序也可以這麼寫：

# -*- coding: utf8 -*-
import urllib2
request = urllib2.Request(‘http://music.baidu.com’)
response = urllib2.urlopen(request)
html = response.read()
print html

二、網易微博爬蟲實例

仍舊以之前的微博爬蟲爲例，抓取新浪微博一個話題下所有頁面，並以html文件形式儲存在本地，路徑爲當前工程目錄。url=http://s.weibo.com/wb/蘋果手機&nodup=1&page=20

源碼如下：

# -*- coding:utf-8 -*-
'''
#=====================================================
#     FileName: sina_html.py
#         Desc: download html pages from sina_weibo and save to local files 
#       Author: DianaCody
#       Version: 1.0
#        Since: 2014-09-27 15:20:21
#=====================================================
'''

import string, urllib2

# sina tweet's url = 'http://s.weibo.com/wb/topic&nodup=1&page=20' 
def writeHtml(url, start_page, end_page):
    for i in range(start_page, end_page+1):
        FileName = string.zfill(i, 3)
        HtmlPath = FileName + '.html'
        print 'Downloading No.' + str(i) + ' page and save as ' + FileName + '.html...'
        f = open(HtmlPath, 'w+')
        html = urllib2.urlopen(url + str(i)).read()
        f.write(html)
        f.close()

def crawler():
    url = 'http://s.weibo.com/wb/iPhone&nodup=1&page='
    s_page = 1;
    e_page = 10;
    print 'Now begin to download html pages...'
    writeHtml(url, s_page, e_page)

if __name__ == '__main__':
    crawler()

程序運行完畢後，html頁面存放在當前工程目錄下，在左側Package Explorer裏刷新一下，可以看到抓回來的html頁面，這裏先抓了10個頁面，打開一個看看：

html頁面的源碼：

剩下的就是正則解析提取字段了，主要用到python的re模塊。

三、網易微博爬蟲軟件開發（python版）

上面只是給出了基本爬取過程，後期加上正則解析提取微博文本數據，中文字符編碼處理等等，下面給出這個爬蟲軟件。（已轉換爲可執行exe程序）

完整源碼：

# -*- coding:utf-8 -*-
'''
#=====================================================
#     FileName: tweet163_crawler.py
#         Desc: download html pages from 163 tweet and save to local files 
#       Author: DianaCody
#      Version: 1.0
#        Since: 2014-09-27 15:20:21
#=====================================================
'''

import string
import urllib2
import re
import chardet

# sina tweet's url = 'http://s.weibo.com/wb/topic&nodup=1&page=20' 
# 163 tweet's url = 'http://t.163.com/tag/topic&nodup=1&page=20'
def writeHtml(url, start_page, end_page):
    for i in range(start_page, end_page+1):
        FileName = string.zfill(i, 3)
        HtmlPath = FileName + '.html'
        print 'Downloading No.' + str(i) + ' page and save as ' + FileName + '.html...'
        f = open(HtmlPath, 'w+')
        html = urllib2.urlopen(url + str(i)).read()
        f.write(html)
        f.close()

def crawler(key, s_page, e_page):
    url = 'http://t.163.com/tag/'+ key +'&nodup=1&page='    
    print 'Now begin to download html pages...'
    writeHtml(url, s_page, e_page)

def regex():
    start_page = 1
    end_page = 9
    for i in range(start_page, end_page):
        HtmlPath = '00'+str(i)+'.html'
        page = open(HtmlPath).read()
        
        # set encode format
        charset = chardet.detect(page)
        charset = charset['encoding']
        if charset!='utf-8' and charset!='UTF-8':
            page = page.decode('gb2312', 'ignore').encode("utf-8")
        unicodePage = page.decode('utf-8')
        
        pattern = re.compile('"content":\s".*?",', re.DOTALL)
        contents = pattern.findall(unicodePage)
        for content in contents:
            print content

if __name__ == '__main__':
   
    key = str(raw_input(u'please input you search key: \n'))
    begin_page = int(raw_input(u'input begin pages:\n'))  
    end_page = int(raw_input(u'input end pages:\n'))
    crawler(key, begin_page, end_page)
    print'Crawler finished... \n'
    print'The contents are: '
    regex()
    raw_input()

實現自定義輸入關鍵詞，指定要爬取的頁面數據，根據關鍵詞提取頁面中的微博信息數據。

自定義搜索關鍵字
自定義爬取頁面數目
非登錄，爬取當天微博信息數據存儲於本地文件
解析微博頁面獲取微博文本內容信息
軟件爲exe程序，無python環境也可運行

1.軟件功能

實時爬取微博信息數據，數據源 http://t.163.com/tag/searchword/

2.軟件演示

1.自定義關鍵詞、抓取頁面數量

2.爬取結果顯示微博文本內容

3.軟件下載

軟件已經放到github，地址 https://github.com/DianaCody/Spider_python/。

軟件地址： https://github.com/DianaCody/Spider_python/tree/master/Tweet163_Crawler/release

exe的軟件也可以在這裏下載：點擊下載

http://download.csdn.net/detail/dianacody/8001441

原創文章，轉載請註明出處：http://blog.csdn.net/dianacody/article/details/39741413

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【網絡爬蟲】【python】網絡爬蟲（二）：網易微博爬蟲軟件開發實例（附軟件源碼）

一、網頁抓取簡單例子

二、網易微博爬蟲實例

三、網易微博爬蟲軟件開發（python版）

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

【HBase】HBase筆記：HBase的Region機制

【網絡爬蟲】【java】微博爬蟲（二）：如何抓取HTML頁面及HttpClient使用

linux創建守護進程

【網絡爬蟲】【java】微博爬蟲（四）：數據處理——jsoup工具解析html、dom4j讀寫xml

打包python文件爲exe文件（PyInstaller工具使用方法）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結