python網絡爬蟲（1）--抓取圖片

所謂網絡爬蟲就是從特定的網頁中獲取你想要的東西，更確切的說，是從網頁源代碼中篩選你想要的東西。

本文將用比較簡單的方法從網頁中下載一些圖片。

主要工具模塊有：urllib.request 和html.parser 是的，真如你所見，不用正則表達式

步驟很簡單：

1.獲取網頁源代碼

2.從源代碼中提取需要的信息（這裏指圖片的下載鏈接）

3.將圖片鏈接打開並下載到目錄。

抓的網址爲：http://www.easyicon.net/iconsearch/book/ （下載圖標）

代碼如下：

1.獲取網頁源代碼

# getimage.py
import urllib.request
from html.parser import HTMLParser

url = 'http://www.easyicon.net/iconsearch/book/'
# pretend as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;\
             WOW64; rv:23.0) Gecko/20100101 Firefox/23.0 '}
url2 = urllib.request.Request(url, headers=headers)

# get the source code form url
fb = urllib.request.urlopen(url2)
souCode = fb.read().decode('utf-8')

注意：如果不加上headers進行封裝，鑑於大多數網頁都有反爬蟲策略，你很有可能會得到403錯誤！！！

2.從源代碼中獲取信息

因爲這是一個篩選的過程，所以你最好寫打開網頁源代碼看看你需要的東西具體表現是什麼（可以有火狐或Google遊覽器點擊右鍵獲得），比方說下載圖片即是找到下載的鏈接，然後將其拿下來，比方說我要找的就是這個：

裏面的地址

這裏用簡單的HTMl解析器，而不用正則表達式（具體怎麼用見文檔）：

# get what you want form souCode
downLists = []


class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            if len(attrs) == 2:
                if attrs[1][1] == 'PNG 格式圖標下載':
                    downLists.append(attrs[0][1])

parser = MyHTMLParser()
parser.feed(souCode)

3.將鏈接下載並保存到目錄：

# download form urlLists to your catalog
i = 0
for lists in downLists:
    print(lists)
    urllib.request.urlretrieve(lists, 'C:\image\down%s.png' % i)
    i += 1

以上,將代碼連接並執行，你的目錄就有圖片啦啦啦。。。

照葫蘆畫瓢，就可以下載其他東西了

python網絡爬蟲（1）--抓取圖片

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

域名，IP，主機名的區別與使用

關於文件中的name

'gbk' codec can't encode character '\xa0' in position 1534: illegal multibyte sequence

WSGI簡介

給我放在中間

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結