爬蟲urllib庫及requests模塊

什麼是爬蟲

網絡爬蟲（又被稱爲網頁蜘蛛，網絡機器人，在FOAF社區中間，更經常的稱爲網頁追逐者），是一種按照一定的規則，自動地抓取萬維網信息的程序或者腳本。另外一些不常使用的名字還有螞蟻、自動索引、模擬程序或者蠕蟲。例如：給個網址，可以獲取到該網址裏邊的（圖片， url，視頻，文件）等信息。

瀏覽網頁時經歷的過程

瀏覽器 (請求request)-> 輸入URL地址(http://www.baidu.com/index.html file:///mnt ftp://172.25.254.250/pub）
-> http協議確定， www.baidu.com訪問的域名確定 -> DNS服務器解析到IP地址
-> 確定要訪問的網頁內容 -> 將獲取到的頁面內容返回給瀏覽器（響應過程）

爬取網頁的方法

基本方法

from urllib import request
from  urllib.error import URLError
try:
    respose = request.urlopen('http://www.baidu.com',timeout=1)
    content = respose.read().decode('utf-8')
    print(content)
except URLError as e:
    print("訪問超時",e.reason)

使用Resuest對象(可以添加其他的頭部信息)

from urllib import request
from urllib.error import URLError
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x86_64; rv:45.0) '
                        'Gecko/20100101 Firefox/45.0'}
try:
    req = request.Request(url,headers=headers)
    content = request.urlopen(req).read().decode('utf-8')
    print(content)
except URLError as e:
    print(e.reason)
else:
    print('Succeess')

後續添加的頭部信息：

from urllib import request
from urllib.error import URLError
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
user_agent = ' Mozilla/5.0 (X11; Linux x86_64; rv:45.0)' \
          ' Gecko/20100101 Firefox/45.0'
try:
    req = request.Request(url)
    req.add_header('User-Agent',user_agent)
    content = request.urlopen(req).read().decode('utf-8')
    print(content)
except URLError as e:
    print(e.reason)
else:
    print('Succeess')

反爬蟲策略

添加頭部信息，模擬瀏覽器

1.Android
Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19
Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
2.Firefox
Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0
Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0
3.Google Chrome
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19
4.iOS
Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3

IP代理

當抓取網站時，程序的運行速度很快，如果通過爬蟲去訪問，一個固定的ip訪問頻率很高，
網站如果做反爬蟲策略，那麼就會封掉ip；
如何解決?
- 設置延遲；time.sleep(random.randint(1,5))
- 使用IP代理，讓其他IP代替你的IP訪問；
如何獲取代理IP？
http://www.xicidaili.com/
如何實現步驟?
1). 調用urllib.request.ProxyHandler(proxies=None)； — 類似理解爲Request對象
2). 調用Opener— 類似urlopen，這個是定製的
3). 安裝Opener
4). 代理IP的選擇

from urllib import request
url = 'https://httpbin.org/get'
proxy = {'https':'120.92.74.189:3128','http':'183.129.207.84:21231'}
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0'
#調用urllib.request.ProxyHandler(proxies=None)；  --- 類似理解爲Request對象
proxy_support = request.ProxyHandler(proxy)
#.調用Opener - -- 類似與urlopen， 這個是定製的
opener = request.build_opener(proxy_support)
# 僞裝瀏覽器
opener.addheaders = [('User-Agent',user_agent)]
#.安裝Opener
request.install_opener(opener)
#).代理IP的選擇
response = request.urlopen(url)
content = response.read().decode('utf-8')
print(content)

保存cookie信息

cookie，某些網站爲了辨別用戶身份，只有登陸之後才能訪問某個頁面；
進行一個會話跟蹤，將用戶的相關信息包括用戶名等保存到本地終端
CookieJar是基類，接着是FileCookieJar。然後是兩個子類MozillaCookieJar和LWPCookieJar。

CookieJar：管理HTTP cookie值、存儲HTTP請求生成的cookie、向傳出的HTTP請求添加cookie的對象。整個cookie都存儲在內存中，對CookieJar實例進行垃圾回收後cookie也將丟失。

FileCookieJar (filename,delayload=None,policy=None)：從CookieJar派生而來，用來創建FileCookieJar實例，檢索cookie信息並將cookie存儲到文件中。filename是存儲cookie的文件名。delayload爲True時支持延遲訪問訪問文件，即只有在需要時纔讀取文件或在文件中存儲數據。

MozillaCookieJar (filename,delayload=None,policy=None)：從FileCookieJar派生而來，創建與Mozilla瀏覽器 cookies.txt兼容的FileCookieJar實例。

LWPCookieJar (filename,delayload=None,policy=None)：從FileCookieJar派生而來，創建與libwww-perl標準的 Set-Cookie3 文件格式兼容的FileCookieJar實例。

from http import cookiejar
from urllib.request import HTTPCookieProcessor
from urllib import request
#聲明一個CookieJar ---> FileCookieJar --> MozillaCookie
cookie = cookiejar.CookieJar()
#利用urllib.request的HTTPCookieProcessor創建一個cookie處理器
handler = HTTPCookieProcessor(cookie)
#通過CookieHandler創建opener
# 默認使用的opener就是urlopen;
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for items in cookie:
    print(items)

如何將Cookie以指定格式保存到文件中?

from http import cookiejar
from urllib.request import HTTPCookieProcessor
from urllib import request
#設置保存cookie的文件名
cookieFilename = 'cookie.txt'
#聲明一個MozillaCookie,用來保存cookie並且可以寫入文件
cookie = cookiejar.MozillaCookieJar(filename=cookieFilename)
#利用urllib.request的HTTPCookieProcessor創建一個cookie處理器
handler = HTTPCookieProcessor(cookie)
#通過CookieHandler創建opener
# 默認使用的openr就是urlopen;
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
# ignore_discard, 即使cookie信息將要被丟棄。 也要把它保存到文件中;
# ignore_expires, 如果在文件中的cookie已經存在， 就覆蓋原文件寫入;
cookie.save(ignore_discard=True,ignore_expires=True)

如何從文件中獲取cookie並訪問？

from http import cookiejar
from urllib.request import HTTPCookieProcessor
from urllib import request
#指定cookie文件存在的位置
cookieFilename = 'cookie.txt'
#聲明一個MozillaCookie,用來保存cookie並且可以寫入文件， 用來讀取文件中的cookie信息
cookie = cookiejar.MozillaCookieJar()
# 從文件中讀取cookie內容
cookie.load(filename=cookieFilename)
#利用urllib.request的HTTPCookieProcessor創建一個cookie處理器
handler = HTTPCookieProcessor(cookie)
 #通過CookieHandler創建opener
# 默認使用的openr就是urlopen;
opener = request.build_opener(handler)
#打開url頁面
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

url解析和構造

urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)
功能: 將url分爲6部分，返回一個元組；
協議, 服務器的地址(ip:port), 文件路徑，訪問的頁面
urllib.parse.urlsplit
urlparse和urlsplit基本上是一模一樣的。唯一不一樣的地方是，urlparse裏面多了一個params屬性，而urlsplit沒有這個params屬性。比如有一個url爲：url = ‘http://www.baidu.com/s;hello?wd=python&username=abc#1’，
那麼urlparse可以獲取到hello，而urlsplit不可以獲取到。url中的params也用得比較少。

rom urllib import parse
url = 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=' \
      '1&tn=baidu&wd=hello&rsv_pq=d0f841b10001fab6&rsv_t=' \
    '2d43603JgfgVkvPtTiNX%2FIYssE6lWfmSKxVCtgi0Ix5w1mnjks2eEMG%2F0Gw&rqlang=' \
      'cn&rsv_enter=1&rsv_sug3=6&rsv_sug1=4&rsv_sug7=101&rsv_sug2=0&inputT=838&rsv_sug4=1460'
parsed_tuple = parse.urlparse(url)
print(parsed_tuple)
print(parsed_tuple.scheme,parsed_tuple.netloc,sep='\n')

通過字典編碼的方式構造url地址:
用瀏覽器發送請求的時候，如果url中包含了中文或者其他特殊字符，那麼瀏覽器會自動的給我們進行編碼。而如果使用代碼發送請求，那麼就必須手動的進行編碼，這時候就應該使用urlencode函數來實現。urlencode可以把字典數據轉換爲URL編碼的數據
parse_qs函數：可以將經過編碼後的url參數進行解碼。

from urllib.parse import urlencode
from urllib.parse import parse_qs
params = {
      'name':'爬蟲',
      'age':20
}
base_url = 'http://www.baidu.com?'
url = base_url+urlencode(params)
print(url)
print(urlencode(params))
print(parse_qs(urlencode(params)))

urllib常見異常處理

from urllib import request,error
try:
    url = 'https://mp.csdn.net/cooffee/hello.html'
    response = request.urlopen(url,timeout=1)
    print(response.read().decode('utf-8'))
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
    print(e.reason)
    print('超時')
else:
    print("成功")

requests模塊

實例：

import requests
url = 'http://www.baidu.com'
response = requests.get(url)
print(response)
print(response.status_code)
print(response.cookies)
print(response.text)

常見請求：

import requests
#上傳 
response = requests.post('http://httpbin.org/post',data={
    'name':'cooffee','age':20})
print(response.text)
#刪除
response = requests.delete('http://httpbin.org/delete',data={
'name':'cooffee','age':20})
print(response.text)

帶參數的get請求：

import requests
data={
    'start':20,
    'limit':40,
    'sort':'new_score',
    'status':'P',
}
url = 'https://movie.douban.com/subject/4864908/comment?'
response = requests.get(url,params=data)
print(response.url)

解析json格式：

import requests
ip = input("請輸入查詢的IP:")
url = "http://ip.taobao.com/service/getIpInfo.php?ip=%s" %(ip)
response = requests.get(url)
content = response.json()
print(content,type(content),sep='\n')

獲取二進制數據

import requests
url='https://gss0.bdstatic.com' \
    '/-4o3dSag_xI4khGkpoWK1HF6hhy/baike' \
    '/w%3D268%3Bg%3D0/sign=4f7bf38ac3fc1e17fdbf8b3772ab913e' \
    '/d4628535e5dde7119c3d076aabefce1b9c1661ba.jpg'
response = requests.get(url)
with open('github.png','wb') as f:
    f.write(response.content)

添加頭部信息：

import requests
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
headers = {
    'User-Agent': user_agent
}
response = requests.get(url,headers=headers)
print(response.status_code)

響應信息的操作：
response = requests.get(url, headers=headers)
print(response.text) # 文本
print(response.content) #二進制文本
print(response.status_code) #訪問的狀態
print(response.headers) #頭部信息
print(response.url) #url地址

狀態碼的判斷：
response = requests.get(url, headers=headers)
exit() if response.status_code != 200 else print(“請求成功”)

上傳文件：

import requests
data = {'file':open('github.png','rb')}
response = requests.post('http://httpbin.org/post',files=data)
print(response.text)

獲取cookie信息：

import requests
response = requests.get('http://www.csdn.net')
print(response.cookies)
for key,value in response.cookies.items():
    print(key + "=" + value)

讀取已經存在的cookie信息訪問網址內容(會話維持）：

import requests
s = requests.session()
response1 = s.get('http://httpbin.org/cookies/set/name/cooffee')
response2 = s.get('http://httpbin.org/cookies')
print(response2.text)

代理設置及代理時間：

import requests
proxy = {
    'https':'61.128.208.94:3128',
    'http':'222.221.11.119:3128'
}
response = requests.get('http://httpbin.org/get', proxies=proxy,  timeout=10)
print(response.text)

爬取博客

import requests
import re
import pdfkit
from bs4 import BeautifulSoup
from itertools import chain
def get_blog_urlli():
    urlli = []
    for page in range(3):
        url = 'https://blog.csdn.net/weixin_42635252/article/list/'+str(page+1)
        responsea = requests.get(url)
        soup = BeautifulSoup(responsea.text,'html5lib')
        Btitle = soup.find_all(target="_blank")
        pattern = r'<a href="(https://[\w+\./]+)" target="_blank">[\s]+<span'
        urlmore=re.findall(pattern,str(Btitle))
        urlli.append(urlmore)
    return urlli
def get_blog_content(urlli):
    titlename=[]
    for url in chain(*urlli):
        response = requests.get(url)
        if response.status_code != 200:
            continue
        soup = BeautifulSoup(response.text, 'html5lib')
        # 獲取head標籤的內容
        head = soup.head
        # 獲取博客標題
        title = soup.find_all(class_="title-article")[0].get_text()
        print(title)
        # 獲取博客內容
        content = soup.find_all(class_="article_content")[0]
        # 寫入本地文件
        # other = 'http://passport.csdn.net/account/login?from='
        with open('%s.html'%(title), 'w') as f:
            f.write(str(head))
            f.write('<h1>%s</h1>\n\n' %(title))
            f.write(str(content))
        titlename.append(title)
    return titlename
def change_pdf(titlename):
    for title in titlename:
        try:
            pdfkit.from_file('%s.html' % (title), '/home/kiosk/Desktop/blog/%s.pdf' % (title))
        except OSError as e:
            print(e)
urlli=get_blog_urlli()
titlename=get_blog_content(urlli)
change_pdf(titlename)

爬蟲urllib庫及requests模塊

什麼是爬蟲

瀏覽網頁時經歷的過程

爬取網頁的方法

基本方法

使用Resuest對象(可以添加其他的頭部信息)

反爬蟲策略

添加頭部信息，模擬瀏覽器

IP代理

保存cookie信息

url解析和構造

urllib常見異常處理

requests模塊

爬取博客

redis的key亂碼問題和值自增問題

CORS error 但是 status code 是200 OK

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

用光線投射法渲染規則模型

爬蟲urllib庫及requests模塊

CrawlSpider模板

Scrapy-Redis分佈式爬蟲組件

scrapy的下載器中間件及配置文件

Scrapy Shell 和 Request、Response對象

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

爬蟲urllib庫及requests模塊

什麼是爬蟲

瀏覽網頁時經歷的過程

爬取網頁的方法

基本方法

使用Resuest對象(可以添加其他的頭部信息)

反爬蟲策略

添加頭部信息 ，模擬瀏覽器

IP代理

保存cookie信息

url解析和構造

urllib常見異常處理

requests模塊

爬取博客

添加頭部信息，模擬瀏覽器