製作爬蟲爬取百度圖片

https://apriljia.com/2018/09/19/%E5%88%B6%E4%BD%9C%E7%88%AC%E8%99%AB%E7%88%AC%E5%8F%96%E7%99%BE%E5%BA%A6%E5%9B%BE%E7%89%87/

我們平時經常會有一些蒐集數據的需要，尤其是圖片數據。如果一個一個從網上找再下載下來實在是太麻煩了，這麼繁瑣的工作不如交給腳本去做。於是我寫了一個簡單的PYTHON3爬取百度圖片的爬蟲，github: https://github.com/plutojia/crawler-for-baiduImage

我們先打開百度圖片看看它到底是什麼樣子的：
在其中搜索“美女”：

我們在瀏覽這些圖片時可以發現，當你下拉滾動條時，又會有新的圖片出現，而這些新的圖片並不是一開始就加載好的，而是隨着你的向下瀏覽不斷請求刷新的，這就說明百度圖片這種瀑布流式加載使用了Ajax技術，那麼我們看看它的Ajax請求到底是什麼樣的。
在搜索美女的頁面下，打開chrome瀏覽器的開發者工具，選擇Network選項卡，再選擇Network選項卡里的XHR選項卡，然後把網頁向下翻，多翻幾頁，應該會看到請求出現，如圖所示：

雙擊點開其中的任意請求看看：

發現請求url是https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%BE%8E%E5%A5%B3&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&word=%E7%BE%8E%E5%A5%B3&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&cg=girl&pn=120&rn=30&gsm=78&1537365676312=

前面的https://image.baidu.com/search/acjson?還挺清楚，後面的一長串是什麼玩意啊？別急，繼續往下看。在開發者工具中繼續往下查看，可以看到有這麼一段：

這裏的Query String Parameters其實就是我們的請求信息，其中好多都是固定的，需要注意的只有幾個：queryWord和word是你的搜索關鍵字，rn代表一頁有多少圖，一般取30，pn代表已經顯示了多少圖，取30*n即可。這樣我們對請求的分析就搞定啦，寫個代碼試一下

首先先import一些常用模塊

from urllib.parse import urlencode
import requests
import re
import os

def get_page(offset):
    params = {
        'tn': 'resultjson_com',
        'ipn': 'rj',
        'ct':'201326592',
        'is':'',
        'fp': 'result',
        'queryWord': '帥哥',
        'cl': '2',
        'lm': '-1',
        'ie': 'utf-8',
        'oe': 'utf-8',
        'adpicid':'',
        'st': '-1',
        'z':'',
        'ic': '0',
        'word': '帥哥',
        's':'',
        'se':'',
        'tab':'',
        'width':'',
        'height':'',
        'face': '0',
        'istype': '2',
        'qc':'',
        'nc': '1',
        'fr':'',
        'expermode':'',
        'cg': 'girl',
        'pn': offset*30,
        'rn': '30',
        'gsm': '1e',
        '1537355234668':'',
    }
    url = 'https://image.baidu.com/search/acjson?' + urlencode(params)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)

if __name__=='__main__':
    json = get_page(1)
    print(json)

運行以上代碼，輸出即爲響應的JSON形式，若能成功輸出，則請求正確。我們還需要從其中獲得圖片的地址，過程如下：
在開發者工具中，選擇Preview,點開其中的data，發現一堆0，1，2，3....這些就代表了各個圖片及其信息，我們點開一個看看，比如點開0：

這裏面的objURL就是圖片的真實地址了，而且是未經壓縮的大圖哦！等等，這地址怎麼看着這麼奇怪，這是因爲百度對其進行了加密，解密只需要一個函數：

def  baidtu_uncomplie(url):
    res = ''
    c = ['_z2C$q', '_z&e3B', 'AzdH3F']
    d= {'w':'a', 'k':'b', 'v':'c', '1':'d', 'j':'e', 'u':'f', '2':'g', 'i':'h', 't':'i', '3':'j', 'h':'k', 's':'l', '4':'m', 'g':'n', '5':'o', 'r':'p', 'q':'q', '6':'r', 'f':'s', 'p':'t', '7':'u', 'e':'v', 'o':'w', '8':'1', 'd':'2', 'n':'3', '9':'4', 'c':'5', 'm':'6', '0':'7', 'b':'8', 'l':'9', 'a':'0', '_z2C$q':':', '_z&e3B':'.', 'AzdH3F':'/'}
    if(url==None or 'http' in url):
        return url
    else:
        j= url
        for m in c:
            j=j.replace(m,d[m])
        for char in j:
            if re.match('^[a-w\d]+$',char):
                char = d[char]
            res= res+char
        return res

只需將objURL作爲參數傳進去就能返回圖片真實地址。那麼現在我們只需要寫一個能分析響應從而得到objURL的函數，再將objURL變成真實地址，最後將圖片下載下來就好了。已經沒什麼難的了，直接上完整代碼：

from urllib.parse import urlencode
import requests
import re
import os
save_dir='baidutu/'

def  baidtu_uncomplie(url):
    res = ''
    c = ['_z2C$q', '_z&e3B', 'AzdH3F']
    d= {'w':'a', 'k':'b', 'v':'c', '1':'d', 'j':'e', 'u':'f', '2':'g', 'i':'h', 't':'i', '3':'j', 'h':'k', 's':'l', '4':'m', 'g':'n', '5':'o', 'r':'p', 'q':'q', '6':'r', 'f':'s', 'p':'t', '7':'u', 'e':'v', 'o':'w', '8':'1', 'd':'2', 'n':'3', '9':'4', 'c':'5', 'm':'6', '0':'7', 'b':'8', 'l':'9', 'a':'0', '_z2C$q':':', '_z&e3B':'.', 'AzdH3F':'/'}
    if(url==None or 'http' in url):
        return url
    else:
        j= url
        for m in c:
            j=j.replace(m,d[m])
        for char in j:
            if re.match('^[a-w\d]+$',char):
                char = d[char]
            res= res+char
        return res

def get_page(offset):
    params = {
        'tn': 'resultjson_com',
        'ipn': 'rj',
        'ct':'201326592',
        'is':'',
        'fp': 'result',
        'queryWord': '帥哥',
        'cl': '2',
        'lm': '-1',
        'ie': 'utf-8',
        'oe': 'utf-8',
        'adpicid':'',
        'st': '-1',
        'z':'',
        'ic': '0',
        'word': '帥哥',
        's':'',
        'se':'',
        'tab':'',
        'width':'',
        'height':'',
        'face': '0',
        'istype': '2',
        'qc':'',
        'nc': '1',
        'fr':'',
        'expermode':'',
        'pn': offset*30,
        'rn': '30',
        'gsm': '1e',
        '1537355234668':'',
    }
    url = 'https://image.baidu.com/search/acjson?' + urlencode(params)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)

def get_images(json):
    if json.get('data'):
        for item in json.get('data'):
            if item.get('fromPageTitle'):
                title = item.get('fromPageTitle')
            else:
                title='noTitle'
            image = baidtu_uncomplie(item.get('objURL'))
            if(image):
                yield {
                    'image': image,
                    'title': title
                }

def save_image(item,count):
    try:
        response = requests.get(item.get('image'))
        if response.status_code == 200:
            file_path = save_dir+'{0}.{1}'.format(str(count), 'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to Save Image')

def main(pageIndex,count):
    json = get_page(pageIndex)
    for image in get_images(json):
        save_image(image, count)
        count += 1
    return count
if __name__=='__main__':
    if not os.path.exists(save_dir):
        os.mkdir(save_dir)
    count=1
    for i in range(1,20):
        count=main(i,count)
    print('total:',count)

最後爬下來的圖片會以1，2，3，4~命名，最終輸出圖片總數。

製作爬蟲爬取百度圖片

基於 Nginx Ingress + 雲效 AppStack 實現灰度發佈

12款高效開源Wiki系統推薦，打造團隊知識管理利器

C語言--右移左移

一個開源且全面的C#算法實戰教程

dotnet 基於 DirectML 控制檯運行 Phi-3 模型

自定義MyBatis插件

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

常用的 Git 指令

鼠標控制軟件有可能和虛擬機軟件產生衝突

sm4加密工具類

北航機試題2015（題目+代碼）

北航機試題2013（題目+代碼）

北航機試題2011（題目+代碼）

北航機試題2010（題目+代碼）

北航機試題2009（題目+代碼）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結