第0011道練習題_Python下載<杉本有美>圖片

Python練習題第 0011題

https://github.com/Yixiaohan/show-me-the-code
用 Python 寫一個爬圖片的程序,爬這個鏈接裏的日本妹子圖片 :-)
http://tieba.baidu.com/p/2166231880

如果html是這樣子的話:

<img...>...</img>
<img...>...</img>
<img...>...</img>

用BeautifulSoup是沒問題的,可是!貼吧裏上傳的圖片,html是下面這樣的,用BeautifulSoup的話會死的很慘,結果超出想象!

所以果斷用正則非貪婪模式找到所有節點之後,再用BS拎出每個圖片的鏈接。

<img bdwater="杉本有美吧,955,550" changedsize="true" class="BDE_Image" height="323" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=6b12a1088718367aad897fd51e738b68/1e29460fd9f9d72abb1a7c3cd52a2834349bbb7e.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=c27ae82432fa828bd1239debcd1f41cd/86674dafa40f4bfb85a9f275024f78f0f736187e.jpg" width="560"><br><img bdwater="杉本有美吧,960,700" changedsize="true" class="BDE_Image" height="408" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=cf8beb009213b07ebdbd50003cd69113/d56ca4de9c82d158f8c63590810a19d8bc3e422b.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=f76c7125359b033b2c88fcd225ce3620/908be71f3a292df5c3b8c034bd315c6034a87378.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=0c3f8f99d53f8794d3ff4826e21a0ead/4e8839738bd4b31c197bf89a86d6277f9e2ff835.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=d776057135a85edffa8cfe2b795409d8/5603c7160924ab18fc6c8d1634fae6cd7b890b79.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=ab95f855ac345982c58ae59a3cf4310b/b85ba63533fa828b9a56f1c2fc1f4134970a5a7a.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=fc6240c9bd3eb13544c7b7b3961ea8cb/d57664f082025aafd86712eafaedab64034f1a1a.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=7bb08445574e9258a63486e6ac83d1d1/e2a86c899e510fb37db71bb6d833c895d0430ccd.jpg" width="560"><br><img bdwater="杉本有美吧,1280,860" changedsize="true" class="BDE_Image" height="376" pic_type="0" src="http://imgsrc.baidu.com/forum/w%3D580/sign=4583262f6609c93d07f20effaf3cf8bb/b32054a98226cffc9283d393b8014a90f703eacf.jpg" width="560"/></br></img></br></img></br></img></br></img></br></img></br></img></br></img></br></img></br></img>

Talk is cheap, show you my code.

#! /usr/bin/env python
# -*- coding:utf-8 -*-

__author__ = 'Sophie2805'

import urllib2
from bs4 import BeautifulSoup
import re

if __name__ == '__main__':
    url ="http://tieba.baidu.com/p/2166231880"
    save_path = "/Users/Sophie/Downloads/shanben_pic/"

    headers = {
            'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',
            'Referer':"http://tieba.baidu.com"
    }
    req = urllib2.Request(url = url ,headers = headers)
    html = urllib2.urlopen(req).read()

    # non-greedy mode to find all the pic, BS would not work here because the html is not normal
    p = re.compile('<img.+?class="BDE_Image".+?>')
    list_of_pic = p.findall(html)

    counter = 1
    for x in list_of_pic:
        soup = BeautifulSoup(x)
        url = soup.img['src']
        req = urllib2.Request(url=url, headers=headers)
        pic = urllib2.urlopen(url).read()
        postfix = url[url.rfind('.'):]
        #print postfix
        file = open(save_path+str(counter)+postfix,'w')
        try:
            file.write(pic)
        finally:
            file.close()
        counter += 1

這裏寫圖片描述

知識點Get

Python整型和字符串的轉換

int -> str: str(int_value)
str -> int: int(str_value)

正則非貪婪模式和貪婪模式

http://deerchao.net/tutorials/regex/regex.htm

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章