作者簡歷地址：http://resume.hackycoder.cn

Python爬蟲一步一步爬取文章

背景

最近在學習機器學習算法，分爲迴歸，分類，聚類等，在學習過程中苦於沒有數據做練習，就想爬取一下國內各大網站的新聞，通過訓練，然後對以後的新聞做一個分類預測。在這樣的背景之下，就開始了我的爬蟲之路。

網站分析

國內各大新聞網站彙總（未完待續）：

搜狐新聞：

時政：http://m.sohu.com/cr/32/?page=2&_smuid=1qCppj0Q1MiPQWJ4Q8qOj1&v=2
社會：http://m.sohu.com/cr/53/?page=2&_smuid=1qCppj0Q1MiPQWJ4Q8qOj1&v=2
天下：http://m.sohu.com/cr/57/?_smuid=1qCppj0Q1MiPQWJ4Q8qOj1&v=2

總的網址：http://m.sohu.com/cr/4/?page=4     第一個4代表類別，第二個4代表頁數

網易新聞

推薦：http://3g.163.com/touch/article/list/BA8J7DG9wangning/20-20.html      主要修改20-20
新聞：http://3g.163.com/touch/article/list/BBM54PGAwangning/0-10.html
娛樂：http://3g.163.com/touch/article/list/BA10TA81wangning/0-10.html
體育：http://3g.163.com/touch/article/list/BA8E6OEOwangning/0-10.html
財經：http://3g.163.com/touch/article/list/BA8EE5GMwangning/0-10.html
時尚：http://3g.163.com/touch/article/list/BA8F6ICNwangning/0-10.html
軍事：http://3g.163.com/touch/article/list/BAI67OGGwangning/0-10.html
手機：http://3g.163.com/touch/article/list/BAI6I0O5wangning/0-10.html
科技：http://3g.163.com/touch/article/list/BA8D4A3Rwangning/0-10.html
遊戲：http://3g.163.com/touch/article/list/BAI6RHDKwangning/0-10.html
數碼：http://3g.163.com/touch/article/list/BAI6JOD9wangning/0-10.html
教育：http://3g.163.com/touch/article/list/BA8FF5PRwangning/0-10.html
健康：http://3g.163.com/touch/article/list/BDC4QSV3wangning/0-10.html
汽車：http://3g.163.com/touch/article/list/BA8DOPCSwangning/0-10.html
家居：http://3g.163.com/touch/article/list/BAI6P3NDwangning/0-10.html
房產：http://3g.163.com/touch/article/list/BAI6MTODwangning/0-10.html
旅遊：http://3g.163.com/touch/article/list/BEO4GINLwangning/0-10.html
親子：http://3g.163.com/touch/article/list/BEO4PONRwangning/0-10.html

未完待續。。。

爬取過程

第一步：簡單的爬取

在這個過程中主要用到了urllib2和BeautifulSoup兩個包，以搜狐新聞爲例，做了一個簡單的爬取內容的爬蟲，沒有做任何的優化等問題，因此會出現假死等情況。

# -*- coding:utf-8 -*-
'''
Created on 2016-3-15

@author: AndyCoder
'''
import urllib2
from bs4 import BeautifulSoup
import socket
import httplib


class Spider(object):
    """Spider"""
    def __init__(self, url):
        self.url = url

    def getNextUrls(self):
        urls = []
        request = urllib2.Request(self.url)
        request.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; \
            WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')
        try:
            html = urllib2.urlopen(request)
        except socket.timeout, e:
            pass
        except urllib2.URLError,ee:
            pass
        except httplib.BadStatusLine:
            pass

        soup = BeautifulSoup(html,'html.parser')
        for link in soup.find_all('a'):
            print("http://m.sohu.com" + link.get('href'))
            if link.get('href')[0] == '/':
                urls.append("http://m.sohu.com" + link.get('href'))
        return urls

def getNews(url):
    print url
    xinwen = ''
    request = urllib2.Request(url)
    request.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; \
        WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')
    try:
        html = urllib2.urlopen(request)
    except urllib2.HTTPError, e:
        print e.code

    soup = BeautifulSoup(html,'html.parser')
    for news in soup.select('p.para'):
        xinwen += news.get_text().decode('utf-8')
    return xinwen


class News(object):
    """
    source:from where 從哪裏爬取的網站
    title:title of news  文章的標題    
    time:published time of news 文章發佈時間
    content:content of news 文章內容
    type:type of news    文章類型
    """
    def __init__(self, source, title, time, content, type):
        self.source = source              
        self.title = title                 
        self.time = time                
        self.content = content            
        self.type = type                


file = open('C:/test.txt','a')
for i in range(38,50):
    for j in range(1,5):
        url = "http://m.sohu.com/cr/" + str(i) + "/?page=" + str(j)
        print url
        s = Spider(url)
        for newsUrl in s.getNextUrls():
            file.write(getNews(newsUrl))
            file.write("\n")
            print "---------------------------"

第二步：遇到的問題

在上述代碼運行過程中，會遇到一些問題，導致爬蟲運行中斷，速度慢等問題。下面列出來幾種問題：

關於代理服務器的問題
關於404等HTTP狀態碼的問題
關於速度慢的問題

第三步：解決辦法

代理服務器

可以從網上尋找一些代理服務器，然後通過設置爬蟲的代理從而解決IP的問題。代碼如下:

def setProxy(pro):

proxy_support=urllib2.ProxyHandler({'https':pro})
opener=urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
urllib2.install_opener(opener)

關於狀態問題，如果尋找不到網頁則直接捨棄，因爲丟棄少量的網頁不影響以後的工作。

def getHtml(url,pro):
urls = []

request = urllib2.Request(url)
setProxy(pro)
request.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; \
    WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')

try:
    html = urllib2.urlopen(request)
    statusCod = html.getcode()
    if statusCod != 200:
        return urls
except socket.timeout, e:
    pass
except urllib2.URLError,ee:
    pass
except httplib.BadStatusLine:
    pass

return html

關於速度慢的問題，可以採用多進程的方式進行爬取。在分析完網址以後，可以在Redis中使用有序的集合作爲一個隊列，既解決了URL重複的問題，又解決了多進程的問題。（暫未實現）

第四步：運行

昨天晚上嘗試運行了一下，爬取搜狐新聞網的部分網頁，大概是50*5*15=3750多個網頁，從而解析出來了2000多條新聞，在網速爲將近1Mbps的情況下，花費了1101s的時間，大概是18分鐘左右。

【Python】爬蟲爬取各大網站新聞（一）

Python爬蟲一步一步爬取文章

背景

網站分析

爬取過程

第一步：簡單的爬取

第二步：遇到的問題

第三步：解決辦法

第四步：運行

集體智慧編程（四）優化

Java防盜鏈（防止網頁從其他地方直接訪問）

elasticsearch Getting Started (三)-探索集羣

【Python】爬蟲爬取各大網站新聞（一）

elasticsearch Getting Started (二)-安裝

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結