python簡單爬取圖片的一點總結

原創

auspark

2020-04-19 15:01

折騰了好幾天，終於開發了一個能夠爬取mzitu的單進程程序，雖然只有短短的幾十行code，

但由於沒怎麼搞過爬蟲，有很多的坑都是費了很大勁兒才爬出來，不過不斷的查詢、實驗等學到的東西還真挺受用的：

學習了：

1、requests，urllib2，BeautifulSourp，selenium+webdriver(mzitu沒涉及到，但還是學了下）

2、每級URL的變化分析和提取

3、路徑和字符串的處理

4、防盜鏈 'Referer'

終於能夠完整爬取主頁上第一頁的24個連接的所有圖片，在此記錄下！

#coding=utf-8

import requests
from bs4 import BeautifulSoup
import urllib2
import re
import time
import os

'''
運行平臺：Mac OS python 2.7
'''

url = 'http://www.mzitu.com'
localDir = os.path.expanduser('~/Desktop/mzitu')
header = {
            'Host':'www.mzitu.com',
            'Accept-Language': 'en-us',
            'Connection': 'close',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15',
            "Cookie":"Hm_lpvt_dbc355aef238b6c32b43eacbbf161c3c=1586961981; Hm_lvt_dbc355aef238b6c32b43eacbbf161c3c=1586939099"
            }

def get_url_content(aURL):
    request = urllib2.Request(aURL,None,header)
    html = urllib2.urlopen(request, timeout=10)
    data = html.read()
    return data


def get_all_link_from_main_url():
    data = get_url_content(url)
    pan = r'<li><a href="(.*)" target="_blank.* target="_blank.*'
    https = re.compile(pan).findall(data)
    return https


def get_max_page(pin):
    data = get_url_content(pin)
    bs = BeautifulSoup(data, 'lxml')
    mp = bs.find_all('div', class_='pagenavi')
    aList = mp[0].find_all('span')[-2].string
    return aList

def imageHref(pageURL):
    data = get_url_content(pageURL)
    bs = BeautifulSoup(data, 'lxml')
    mp = bs.find_all('div', class_='main-image')
    src = mp[0].img['src']
    return src

def downloadImage(pageURL,imgURL):
    subdir = pageURL.split('/')[-2]
    mkDir = localDir+'/'+subdir
    if not os.path.exists(mkDir):
        os.mkdir(mkDir)
    filename = imgURL.split('/')[-1]
    localURL = mkDir+'/'+filename
    with open(localURL,'wb') as f:
        headersURL = {
            'Referer': 'https://www.mzitu.com',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
        }
        rst = urllib2.Request(imgURL,None,headersURL)
        rsp = urllib2.urlopen(rst)
        f.write(rsp.read())
        f.close()


def download_one_set(pOne):
    print pOne
    max_page = int(get_max_page(pOne))
    for m in range(max_page):
        each_url = pOne+'/'+str(m+1)
        print each_url
        img_url = imageHref(each_url)
        print img_url
        downloadImage(each_url,img_url)
        print '\t--',img_url,' -- ok'
        time.sleep(1)


if __name__ == '__main__':
    p_list = get_all_link_from_main_url()
    for page in p_list:
        download_one_set(page)
    print '下載 ok'

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python簡單爬取圖片的一點總結

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

python urllib2.urlopen()獲取到html內容亂碼解決

python urllib模塊(urlopen/response/request/headler/異常處理/URL解析)

python去除字符串空格的方法

python中文字符串比較時出現編碼錯誤

git設置忽略對臨時文件或文件夾的追蹤

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結