模塊三第一週作業三熱門文章

原創

BeefpasteC

2020-03-06 00:04

1 問題描述

在CSDN站內搜索技術關鍵詞，例如java，下載前幾頁熱門文章HTML源碼到本地，文件命名方

式與博客大標題保持一致

2 解題提示

本週錄播課最後兩節

3 評分標準

本題共計40分
破解URL規則，通過Xpath得到鏈接地址與博客標題20分
完成博客下載 10分
代碼註釋，規範10分

4 要點解析

4.1 防爬機制

請求頭
- 模仿瀏覽器進行訪問
代理ip
- 加代理，通過換ip地址
cookie
- 有些網站是需要登錄纔可以的，所以可以獲取cookie

5 實現步驟

有代理

import urllib.request as ur
import lxml.etree as le
import user_agent

keyword = input('請輸入關鍵詞:')
pn_start = int(input('起始頁:'))
pn_end = int(input('終止頁:'))

def getRequest(url):
    return ur.Request(
        url=url,
        headers={
            'User-Agent':user_agent.get_user_agent_pc(),
        }
    )
# 代理ip
def getProxyOpener():
    # 代理ip地址
    proxy_address = ur.urlopen('http://api.ip.data5u.com/dynamic/get.html?order=b302473c7d3fb594860238e20a651296&sep=3').read().decode('utf-8').strip()
    # 給請求加上代理
    proxy_handler = ur.ProxyHandler(
        {
            'http':proxy_address
        }
    )
    # 返回加上代理之後的請求
    return ur.build_opener(proxy_handler)


for pn in range(pn_start,pn_end+1):
    request = getRequest(
        'https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&domain=&o=&s=&u=&l=&f=&rbg=0' % (pn,keyword.encode('utf-8'))
    )
    try:
        # 傳遞請求獲得響應
        response = getProxyOpener().open(request).read()
        href_s=le.HTML(response).xpath('//div[@class="limit_width"]/a/@href')
        for href in href_s:
            try:
                response_blog = getProxyOpener().open(
                    getRequest(href)
                ).read()
                title = le.HTML(response_blog).xpath('//h1[@class="title-article"]/text()')[0]
                # count=le.HTML(response_blog).xpath('//span[@class="mr16"]/text()')
                print(title)
                with open('blog/%s.html' % title,'wb') as f:
                    f.write(response_blog)
            except:
                print('爬取失敗！')
    except:pass

代理

import urllib.request as ur
import lxml.etree as le
import user_agent

keyword = input('請輸入關鍵詞:')
pn_start = int(input('起始頁:'))
pn_end = int(input('終止頁:'))

# 請求函數
def getRequest(url):
    return ur.Request(
        url=url,
        headers={
            'User-Agent':user_agent.get_user_agent_pc(),
        }
    )



for pn in range(pn_start,pn_end+1):
    # 調用請求函數，獲得響應
    request = getRequest(
        'https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&domain=&o=&s=&u=&l=&f=&rbg=0' % (pn,keyword.encode('utf-8'))
    )
    try:
        # 讀取響應內容
        response = ur.urlopen(request).read()
        # 轉化 xml，並進行xpath，獲取文章鏈接
        href_s=le.HTML(response).xpath('//div[@class="limit_width"]/a/@href')
        # 遍歷div
        for href in href_s:
            try:
                # 通過獲取的文章鏈接，進入頁面
                response_blog = ur.urlopen(
                    getRequest(href)
                ).read()
                # 獲取標題
                title = le.HTML(response_blog).xpath('//h1[@class="title-article"]/text()')[0]
                # count=le.HTML(response_blog).xpath('//span[@class="mr16"]/text()')
                print(title)
                # 保存頁面
                with open('blog/%s.html' % title,'wb') as f:
                    f.write(response_blog)
            except:
                print('爬取失敗！')
    except:pass

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

模塊三第一週作業三熱門文章

1 問題描述

2 解題提示

3 評分標準

4 要點解析

4.1 防爬機制

5 實現步驟

Numpy數組的廣播機制

模塊四第一週作業一菜價分析

谷歌瀏覽器配置xpath插件

pymysql替換mysqlclient

windows下redis啓動失敗

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

模塊三 第一週 作業三 熱門文章

1 問題描述

2 解題提示

3 評分標準

4 要點解析

4.1 防爬機制

5 實現步驟

模塊三第一週作業三熱門文章