Python爬蟲之知乎圖片抓取

最近覺得python爬蟲挺好玩的，就網上找了找教程自學了幾天，真的還挺有意思的，推薦一箇中國大學mooc平臺的一個關於Python爬蟲的課程，老師講的很好，非常適合入門，這裏是鏈接。

想起曾經在知乎的一個專欄裏面看到過一個非常好玩的東西，之前看的時候還不會爬蟲，只是將文章收藏了下來，代碼在這裏。現在回過頭來去看，發現還挺簡單的。專欄文章是用lxml來解析html文件的，我查了下，發現xpath真的非常好用（哈哈，其實我還不怎麼會，待找找教程學了再來吧），不過既然學了BeautifulSoup，我就用BeautifulSoup來簡單實現一下吧。

廢話不多說，直接上代碼吧。（這裏只是爬取一個問題下排名第一的答案裏的圖片）

import requests
from bs4 import BeautifulSoup
import os
import time

cookie = ''#那篇專欄文章裏有教怎麼用cookie
headers = {'User-Agent': 'Mozilla/5.0',#模擬瀏覽器向網頁發出請求
'Cookie': cookie}


def getHtmlText(url):
    try:
        response = requests.get(url,headers=headers)
        response.raise_for_status()
        response.encoding = 'utf-8'
        return response.text
    except:
        exit('模擬cookie登陸失敗')

def savePictures():
    html_text = getHtmlText('https://www.zhihu.com/question/40063489')
    #soup = BeautifulSoup(open('zhihu.html','r',encoding='utf-8'),'html.parser') #如果不會cookie登陸，可以自己手動保存網頁源碼爲html文件
    soup = BeautifulSoup(html_text,'html.parser')

    #question和author，自己打開網頁源碼，找到他們，然後看所在標籤
    question = soup.h1.text.strip() 
    author = str(soup.find_all(name='a',attrs='UserLink-link')[1].text) 
    #info就是author回答question的答案的所有信息
    info = soup.find_all(name='div',attrs="RichContent-inner")[0] #一個question下會顯示兩位排名靠前的回答，這裏選擇第一位
    x = info.find_all(name='noscript') #這就是所有圖片鏈接所在的標籤列表
    links = []
    for i in x:
        link = i.img.attrs['src']
        links.append(link)

    try:
        filename = question + ' - ' + author
        #print(filename)
        if not os.path.exists(filename):
            os.mkdir(filename)
        for i in range(len(links)):
            img_source = requests.get(links[i]).content
            img_path = filename + '/' + str(i)+ '.' + links[i].split('.')[-1]
            with open(img_path,'wb') as f:
                f.write(img_source)
                print(links[i],'保存成功')
    except:
        print('error')

start = time.time()
savePictures()
end = time.time()

print('總耗時: ',end-start,'秒')

哈哈，上截圖：

Python爬蟲之知乎圖片抓取

int和string類型互換

python基礎之一——數據類型和內存管理

OSTU （大津算法）

MNIST機器學習入門

Numpy模塊

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結