python基礎爬蟲總結

1.爬取信息原理

與瀏覽器客戶端類似，向網站的服務器發送一個請求，該請求一般是url,也就是網址。之後服務器響應一個html頁面給客戶端，當然也有其他數據類型的信息，這些就是網頁內容。我們要做的就是解析這些信息，然後選擇我們想要的，將它爬取下來按要求寫入到本地。

2. 爬蟲基本流程

1.獲取網頁的響應的信息

這裏有兩個常用的方法

html = requests.get(url)
return html.text

或者

html = urllib.request.urlopen(url)
return html.read()

第一個get方法會返回一個Response對象，裏面有服務器返回的所有信息，包括響應頭，響應狀態碼等。直接輸出html，只有這個<Response [200]>，要將信息提取出來有兩個方法，content和text，content返回bytes型數據，text返回Unicode型數據（這種初級爬蟲用什麼都一樣，編碼什麼的我還在研究-_-)，這裏我們直接返回.text。
第二個方法我引用網上一句話：

urlopen打開URL網址，url參數可以是一個字符串url或者是一個Request對象，返回的是http.client.HTTPResponse對象.http.client.HTTPResponse對象大概包括read()、readinto()、getheader()、getheaders()、fileno()、msg、version、status、reason、debuglevel和closed函數，其實一般而言使用read()函數後還需要decode()函數，這裏一個巨大的優勢就是：返回的網頁內容實際上是沒有被解碼或的，在read()得到內容後通過指定decode()函數參數，可以使用對應的解碼方式。

2.解析網頁內容

正則表達式是個很好的選擇，但我不怎麼會用。然而一個強大的第三方庫給我提供了很大的幫助，Beautifulsoup。

soup = BeautifulSoup(html,'html.parser)
urls = soup.find_all('div',attrs={'class':'bets-name'})
print(urls[0])

BeautifulSoup給我們提供了很多方法，先創建一個soup實例，用html.parer自帶解析器，也可以選lxml等。然後根據目標標籤中的內容傳入參數，找到目標標籤，注意find_all返回的對象。

3.將信息下載到本地

如果是文本信息可以直接寫入，圖片信息的話就要再次訪問圖片鏈接，然後以content方法寫入

3.爬取站酷圖片

這裏以Pycharm作爲開發工具！

# coding: utf-8
# data: 2018/04/04
#target: Pictures on ZHANK

from bs4 import BeautifulSoup
import requests
import urllib.request

def get_html(url):
    html = requests.get(url)
    return html.text

def Download(html,filepath):
    soup = BeautifulSoup(html,'html.parser')
    urls = soup.find_all('div',class_="imgItem maskWraper")
    count = 1

    try:
        for url in urls:
            img = url.find('img')
            print(img)
            img_url = img['data-original']
            req = requests.get(img_url)
            with open(filepath + '/' + str(count) + '.jpg', 'wb') as f:                        #以二進制形式寫入文件
                f.write(req.content)
            count += 1
            if count == 11:      #爬取十張圖片就停止
                break
    except Exception as e:
        print(e)

def main():
    url = "http://www.hellorf.com/image/search/%E5%9F%8E%E5%B8%82/?utm_source=zcool_popular"  #目標網址
    filepath = "D://桌面/Python/study_one/Spider_practice/Spider_File/icon"                    #圖片保存地址
    html = get_html(url)
    Download(html,filepath)

if __name__ == "__main__":
    main()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬取基礎網頁圖片

python基礎爬蟲總結

1.爬取信息原理

2. 爬蟲基本流程

1.獲取網頁的響應的信息

2.解析網頁內容

3.將信息下載到本地

3.爬取站酷圖片

python爬取基礎網頁圖片

java基礎總結--equals與==

Web安全防範

C++ 刷題常用函數總結

C++ 常用函數總結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結