請參考源碼，文字是最先得想法，沒有再做更改。源碼以更新

前期準備：requests庫：使用pip install requests 安裝。

pymongo庫：使用pip install pymongo安裝。

首先分析目標url：http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_1.shtml

這個url的規律很容易發現，我們通過更改index後面的數字便可以實現翻頁遍歷全部頁數。

接着我們審查網頁源代碼，我們找到html頁面中保存連接與新聞標題的部分，還有時間。

發現所有我們需要的信息都保存再li這個標籤下面，這裏我們可以用正則表達式來獲取所有我們需要的信息（標題，連接，日期）

pattern = re.compile(r'<li><.*?href="(.*?)".*?_blank">(.*?)</a><span>(.*?)</span>', re.S)
datas = re.findall(pattern, html)

然後是參與回覆與評論的：在獲取評論數量時會發現評論是用js的形式發送給瀏覽器的，所以要先把獲取的內容轉化爲json格式讀取python字典

url =http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-{需要添加}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20

需要添加的部分爲標題url的

這樣我們可以寫出代碼：

def get_comm_par(href):
    try:
        id = re.search('doc-i(.+).shtml', href)
        newsid = id.group(1)
        commenturl = 'http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20'
        comment = requests.get(commenturl.format(newsid))
        res = json.loads(comment.text.lstrip('var data='))
        return res['result']['count']['total'], res['result']['count']['show']
    except:
        return None

這樣我們需要的標題，連接，時間，評論人數，參與人數都得到了。

然後我們把這些信息寫入數據庫mongodb：

這部分可以寫成一個模塊在python中引入，還可以加別的參數，實現對網站更深度的遍歷。比如root_url。

MONGO_URL = 'localhost'
MONGODB = 'xinlang'
MONGOTABE = 'xinweng'

def write_to_mongodb(res, url):
    if db[MONGOTABE].insert(res):
        print('herf= {} save success'.format(url))
        return True
    print('save fail')
    return False

全部寫完後，我們可以爲主函數設置多線程，這裏用的是python自帶的multiprocessing，使用方法也很簡單：

pool = multiprocessing.Pool()
pool.map(main, [x for x in range(1, 301)])

最後貼上全部代碼

希望對大家有參考作用：

import csv
import threading
import multiprocessing
import json
from urllib.parse import urlencode
from urllib.request import urlopen,Request,urlparse,build_opener,install_opener
from  urllib.error import URLError,HTTPError






def html_download(url):
    headers = {'User-Agent': "User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"}
    request = Request(url, headers=headers)
    try:
        html = urlopen(request).read().decode()
    except HTTPError as e:
        html = None
        print('[W] 下載出現服務器錯誤: %s' % e.reason)
        return None
    except URLError as e:
        html = None
        print("[E] 站點不可達: %s" % e.reason)
        return None
    return html


def api_info_manager(page):
    #http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&show_all=1&show_num=6000&tag=1&format=json
    comment_channel = [ 'gnxw', 'shxw','gjxw']
    for comment in comment_channel:


        data= {
            'channel':'news',
            'cat_1':comment,
            'show_all':1,
            'show_num':6000,
            'tag':1,
            'page':page,
            'format':'json'
            }
        dataformat = 'http://api.roll.news.sina.com.cn/zt_list?'+urlencode(data)
        response = html_download(dataformat)
        #print(response)
        json_results  = json.loads(response,encoding='utf-8')['result']['data']
        for info_dict in json_results:
            yield info_dict


fileheader = ['id','column','title','url','keywords','comment_channel','img','level','createtime','old_level','media_type','media_name']


def write_csv_header(fileheader):  
    with open("新浪新聞.csv", "a",newline='') as csvfile:  
        writer = csv.DictWriter(csvfile, fileheader)  
        writer.writeheader()  


def save_to_csv(result):
    with open("新浪新聞.csv", "a",newline='') as csvfile:  
            print('    正在寫入csv文件中.....')  
            writer = csv.DictWriter(csvfile, fieldnames=fileheader)  
            writer.writerow(result) 


def main(page):
    for res in api_info_manager(page):
        save_to_csv(res)


if __name__ == '__main__':
        #多線程  
    write_csv_header(fileheader)   
    pool = multiprocessing.Pool()  
    # 多進程  
    thread = threading.Thread(target=pool.map,args = (main,[x for x in range(1, 100)]))  
    thread.start()  
    thread.join()

喜歡爬蟲的可以加我Q（四七零五八一九八五）交流，我後面會更新自己在機器學習和大數據方面的探索。因爲還在實習，而且python學習時間不長，如有紕漏，希望大家點評。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【API爬蟲】30分鐘百萬條新浪新聞信息爬取。python得極速之旅

請參考源碼，文字是最先得想法，沒有再做更改。源碼以更新

python gdal 安裝使用（Windows， python 3.6.8）

【支持向量機SVM】算法原理公式推導 python編程實現

【TextRank】關鍵詞提取算法原理公式推導源碼分析

【邏輯迴歸LR】算法原理公式推導 python編程實現

【決策樹DT】算法原理公式推導 python編程實現

【word2vec】算法原理公式推導

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結