python網絡爬蟲（1）--抓取圖片（2）

原創

outbook

2018-08-27 16:33

上一篇用來HTML解析器來解析網頁源代碼，這次用正則表達式來解析

同上一篇略同，代碼如下：

本次抓取Google圖片

# getimage.py
import urllib.request
import re
from urllib.error import HTTPError, URLError


url = 'https://www.google.com.hk/search?safe=strict&hl=zh-CN&biw=1366&bih=638&s' \
      'ite=imghp&tbm=isch&sa=1&btnG=Google+%E6%90%9C%E7%B4%A2&q=%E8%87%AA%E7%84%B6'
# pretend as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;\
             WOW64; rv:23.0) Gecko/20100101 Firefox/23.0 '}
url2 = urllib.request.Request(url, headers=headers)

# get the source code form url
fb = urllib.request.urlopen(url2)
souCode = fb.read().decode('utf-8')

# get what you want form souCode
downLists = re.findall('http\S+.jpg', souCode)

# download form urlLists to your catalog
i = 0
for lists in downLists:
    print(lists)
    try:
        urllib.request.urlretrieve(lists, 'C:\\image\\nature\\nature%s.png' % i)
    except HTTPError:
        continue
    except URLError:
        continue
    except UnicodeEncodeError:
        continue
    i += 1

注意：需要異常處理，以爲有些圖片是打不開的，或網頁編程者出錯的。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python網絡爬蟲（1）--抓取圖片（2）

域名，IP，主機名的區別與使用

關於文件中的name

'gbk' codec can't encode character '\xa0' in position 1534: illegal multibyte sequence

WSGI簡介

給我放在中間

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結