爬蟲爬取糗事百科

    在靜覓網站學了一下爬蟲,自己也按照作者的思路寫一下糗事百科的爬取。當中也學習了BeautifulSoap,也用它來練習替代正則表達式,發現還是原來的正則表達式簡潔,所以還是貼上用正則表達式寫的爬蟲吧。

# -*- coding:utf-8 -*-
import urllib2
import re


class QIUSHI(object):
    def __init__(self):
        self.page = 1
        self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        self.headers = {'User-Agent': self.user_agent}
        self.stories = []
        self.enable = False

    def get_page(self, page):
        try:
            url = 'http://www.qiushibaike.com/hot/page/' + str(page)
            request = urllib2.Request(url, headers=self.headers)
            response = urllib2.urlopen(request)
            page_code = response.read().decode('utf-8')
            return page_code
        except urllib2.URLError, e:
            if hasattr(e, "code"):
                print e.code
            if hasattr(e, "reason"):
                print e.reason
            return None

    def get_stories(self, page):
        page_stories = []
        page_content = self.get_page(page)
        if not page_content:
            print u'頁面加載失敗...'
            return None
        pattern = re.compile(r'<div.*?author.*?<h2>(.*?)</h2>.*?<div.*?>' +
                             '(.*?)</div>.*?<i.*?>(\d*?)</i>.*?' +
                             '<i.*?>(\d*?)</i>', re.S)
        items = re.findall(pattern, page_content)
        for item in items:
            replaceBR = re.compile(r'<br/>')
            text = re.sub(replaceBR, '\n', item[1])
            page_stories.append([item[0], text, item[2], item[3]])
        return page_stories

    def load_page(self):
        if len(self.stories) < 2:
            page_stories = self.get_stories(self.page)
            if page_stories:
                self.stories.append(page_stories)
                self.page += 1

    def get_one_story(self, page_stories, nowpage):
        print u'輸入回車獲得段子,Q表示退出'
        for item in page_stories:
            recive = raw_input('')
            self.load_page()
            if recive == 'Q':
                self.enable = False
                return
            print '-' * 25
            print u'第%s頁  作者:%s%s\n好笑:%s  評論:%s\n' % \
                (nowpage, item[0], item[1], item[2], item[3])

    def start(self):
        self.enable = True
        self.load_page()
        nowpage = 0
        while self.enable:
            page_stories = self.stories.pop(0)
            nowpage += 1
            self.get_one_story(page_stories, nowpage)

spider = QIUSHI()
spider.start()


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章