帶你寫爬蟲（python）第一篇----抓取安徽理工大學新聞網中所有新聞

最近一直在學習python爬蟲，所以一直想寫個簡單的爬蟲教程，所以第一篇就拿母校新聞網官網來練手了，沒想到寫爬蟲的過程還停一波三折的。（後來發現新聞網頁面還有訪問限制，多次訪問後，本機ip就被限制了，本文不討論這個）

準備工作
- 目標頁面：http://news.aust.edu.cn
- 開發工具：pycharm
- python版本：3.6.4
- 所需庫：請求庫requests，解析庫re，BeautifulSoup
- 這裏不介紹各種庫安裝和使用方法，推薦查看
- BS模塊：Python爬蟲利器二之Beautiful Soup的用法
- re模塊： Python爬蟲入門七之正則表達式

爬取過程分析
第一步：獲取每一個索引頁面的url
第二步：解析每一個索引頁面，獲取每個文章標題和詳細頁的url等信息
第三步：爬取並解析詳情頁的相關信息

第一步:分析頁面
分析所需爬取頁面，瀏覽器打開頁面，主要抓取的就是圖片中框住的部分，本文主要介紹第一個“學校要聞”（第二個“綜合要聞”類比，該文不贅述）

點擊“學校要聞”，進入單獨的頁面：http://news.aust.edu.cn/xxxw.htm，之後按F12進行頁面分析

查看之後，發現這個這些數據都直接在html裏面，不是用ajax加載的，那就比較簡單的了。

第二步:爬取索引頁
打開pycharm，新建一個工程，在工程裏面新建一個文件spider.py

定義第一個函數，抓取索引頁的html代碼，導入requests模塊以及對應的異常處理模塊，這裏使用requests模塊進行抓取，未使用urllib庫

import requests
from requests.exceptions import RequestException

#抓取索引頁面
def get_index_html():
    #目標網頁
    url = 'http://news.aust.edu.cn/xxxw.htm'
    try:
        #使用requests的get方法獲取url的html全部代碼
        response = requests.get(url)
        if response.status_code ==200:
            return response.text
        return None
    except RequestException:
        return None
#主函數
def mian():
    get_index_html()

if __name__ == '__main__':
    main()

如果運行這段代碼控制檯出現html代碼就代表第一步成功了
（若html中這裏出現中文亂碼，請嘗試使用下面的方法）

response = requests.get(url).encoding('utf8')
或者
response.encoding = 'utf8'

若不知道html的編碼，可以用

print(response.apparent_encoding)

這樣打印出來的就是你的網頁的編碼方式

第三步：解析索引頁
導入re模塊，定義一個解析函數，這裏使用re模塊正則表達式進行匹配解析（當然這裏也可以用xpath，BeaytifulSoup等方式，請自行選擇）

#導入re模塊，用於正則匹配使用
import re
#定義解析索引頁函數
def parse_index_html(content):
    #構造正則表達式，使用re.S模式來匹配換行
    pattern = re.compile('<font>(.*?)</font>.*?absmiddle.*?href="(.*?)" target="_blank">(.*?)</a>',re.S)
    #使用findall方法獲取所有結果，結果爲一個list格式
    results = re.findall(pattern,content)
    print(results)

運行結果

當然在這裏也可以將結果用字典輸出

rooturl = 'http://news.aust.edu.cn/'#詳細頁url的最前面的根url(html中寫的是相對地址)
for result in results:
        data = {
            'time':result[0],
            'url':rooturl+result[1],#合併成每一頁的url
            'content':result[2]
        }
        print(data)

結果是這樣

當然還需要找出每一頁的url，我們點擊下一頁，查看每一頁的url有什麼規律，是直接修改的url還是修改的參數

很明顯，總頁數283，往下一頁的話，url就每次-1，所以我們要在這也要順便獲取一下總頁數

#獲取總頁數
    page_pattern = re.compile('&nbsp;1/(.*?)&nbsp')
    pagenum = re.findall(page_pattern,content)
    pagenum1 = int(pagenum[0])#將str的頁數轉成int類型，後面循環要用到
    print(pagenum1)

當然，把這個獲取頁數的代碼放到下面獲取所有url的函數中可能更好

第四步：獲取全部需要抓取的url
定義一個函數

def parse_all_url(content):
    #解析索引頁面上的總頁數
    page_pattern = re.compile('&nbsp;1/(.*?)&nbsp')
    pagenum = re.findall(page_pattern, content)
    pagenum1 = int(pagenum[0])
    #獲取每一頁的url
    rooturl = 'http://news.aust.edu.cn/xxxw/'
    #循環構造所有詳細頁的url
    for page in range(0,pagenum1):
        if page ==0:
            wholeurl = 'http://news.aust.edu.cn/xxxw.htm'
        else :
            wholeurl = rooturl+str(pagenum1-page)+'.htm'
        # print(wholeurl)
        content = get_index_html(wholeurl)#獲取索引頁的html代碼
        result = parse_index_html(content)#解析索引頁的相關內容

第五步：抓取並解析詳細頁信息
構造抓取詳細頁函數，這個和抓取索引頁的函數幾乎一樣

#構造抓取詳細頁函數
def get_detail_html():
    try:
        response = requests.get(url)
        if response.status_code == 200:
            # print(response.apparent_encoding)
            response.encoding = 'utf8'
            return response.text
        return None
    except RequestException:
        return None

構造解析詳細頁函數，由於詳細頁的正文內容中有很多html標籤，所以使用正則匹配比較複雜，這裏使用BeautifulSoup相對而言更加適用。當然要先導入BeautifulSoup模塊

from bs4 import BeautfulSoup
#構造解析詳細頁函數
def parse_detail_html(url):
    content = get_detail_html(url)
    if content:
        #構造bs對象,使用lxml解析方法，更加快速
        soup = BeautifulSoup(content,'lxml')
        results = soup.find('div',class_='imggin').get_text().strip()

這樣運行出來的話，結果在下方

我們可以看到，在content裏面有很多\n（換行符）,與\xa0（不間斷空格符）,我們需要將這些換行符等符號去除掉

results = soup.find('div',class_='imggin').get_text().strip().replace('\n','').replace('\xa0','')

運行結果如下，成功將換行符等去除

未寫完，待續，先把暫時的代碼放在下方（可以直接運行）

# -*- coding: utf-8 -*-
# @Time    : 2018/2/5 14:06
# @Author  : XueLei
# @Site    : 
# @File    : spider.py
# @Software: PyCharm
import re
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException

#獲取索引頁，需要加一個timeout
def get_index_html(url):
    try:
        response = requests.get(url,timeout = 200)#我在這加了一個超時參數
        if response.status_code == 200:
            response.encoding = 'uft8'
            return response.text
        return None
    except RequestException as e:
        print(e)
        return None

#解析索引頁,獲取各文章的標題
def parse_index_html(content):
    #首頁
    rooturl = 'http://news.aust.edu.cn/'
    # 匹配模式，S模式
    pattern = re.compile('<li id=.*?<font>(.*?)</font>.*?absmiddle.*?href="(.*?)" target="_blank">(.*?)</a>',re.S)
    results = re.findall(pattern,content)
    #遍歷結果輸出
    for result in results:
        publish_time = result[0]
        detail_url = rooturl+result[1]
        str1 = '../'#用於去除字符串中的../字符
        if detail_url:
            if str1 in detail_url:#如果字符串中有../
                detail_url = detail_url.replace('../','')
            paracontent = parse_detail_html(detail_url)#抓取並解析詳細頁信息
        title = result[2]
        data = {
            'time':publish_time,
            'url':detail_url,
            'title':title,
            'content':paracontent
        }
        print(data)

#獲取詳細頁
def get_detail_html(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            # print(response.apparent_encoding)
            response.encoding = 'utf8'
            return response.text
        return None
    except RequestException as e:
        return e
#解析詳細頁，這裏使用BeautiulSoup進行解析
def parse_detail_html(url):
    content = get_detail_html(url)
    if content:
        soup = BeautifulSoup(content,'lxml')
        results = soup.find('div',class_ = 'imggin').get_text().strip().replace('\n','').replace('\xa0','')
        return results
    return None


#去除圖片
def remove_all_img():
    return None
#去除標籤
def remove_all_label():
    return None
#獲取每一頁的url
def parse_all_url(content):
    #解析索引頁面上的總頁數
    page_pattern = re.compile('&nbsp;1/(.*?)&nbsp')
    pagenum = re.findall(page_pattern, content)
    page1 = int(pagenum[0])
    #獲取每一頁的url
    rooturl = 'http://news.aust.edu.cn/xxxw/'
    #循環構造url，並調用索引頁解析函數
    for page in range(0,page1):
        if page ==0:
            wholeurl = 'http://news.aust.edu.cn/xxxw.htm'
        else :
            wholeurl = rooturl+str(page1-page)+'.htm'
        print(wholeurl,'這是第'+str(page+1)+"頁")
        content = get_index_html(wholeurl)#獲取每一各索引頁的代碼，用於找到url和標題
        parse_index_html(content)#解析每一索引頁的代碼


def main():
    url = 'http://news.aust.edu.cn/xxxw.htm'#首頁面
    print(url)
    html = get_index_html(url)#這個只是解析總頁數
    # print(html)
    try:
        parse_all_url(html)#獲取總頁數，並構造每一頁的url
    except RequestException as e:
        print(e)


if __name__ == '__main__':
    main()

帶你寫爬蟲（python）第一篇----抓取安徽理工大學新聞網中所有新聞

Wireshark 安裝+使用（一）

Hadoop--MapReduce實現WordCount全步驟

二分查找的幾種變形問題

Hadoop--實分佈部署

MapReduce實現WordCount全步驟--（中科大軟院軟件體系結構實驗2）

Hadoop實分佈部署--使用多臺WIN10中的WSL

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結