python3實現簡單的爬蟲

原創

Skyones

2018-10-27 03:35

主要實現的是從百度貼吧爬取一些圖片

打開對應的網頁

主要是使用python下的庫urllib

request.urlopen() 打開目標網頁
read() 讀取網頁信息

因此最開始代碼如下：

#coding=utf-8

from urllib import request

def getHtml(url):
	page = request.urlopen(url)
	html = page.read()
	return html

html = getHtml("https://tieba.baidu.com/p/5882095555")
print(html)

獲取網頁裏面的圖片

需要用到python的re庫做正則的處理，還要根據所爬的網頁去確定圖片的正則表達式，修改後代碼如下：

#coding=utf-8

from urllib import request

def getHtml(url):
	page = request.urlopen(url)
	html = page.read()
	return html
def getImg(html)
	reg = r'img class="BDE_Image" src="(.+?\.jpg)"'
	imgre = re.compile(reg)
	html = html.decode('utf-8')
	imglist = re.findall(imgre,html)
	return imglist
	

html = getHtml("https://tieba.baidu.com/p/5882095555")
print(getImg(html))

在這段代碼中，

reg爲正則表達式
compile() 函數用來構建正則對象
findall() 函數用來尋找網頁中符合正則匹配的圖片

將獲取到的圖片保存到本地

這裏只要使用request中的urlretrieve()處理，寫入自己要存儲的地址，修改後代碼如下：

#coding=utf-8

from urllib import request
import re

def getHtml(url):
    page = request.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'img class="BDE_Image" src="(.+?\.jpg)"'
    imgre = re.compile(reg)
    html = html.decode('utf-8')
    imglist = re.findall(imgre,html)
    x = 0
    for imgurl in imglist:
        request.urlretrieve(imgurl,'E:\img\ background%s.jpg' % x)
        x += 1
    return imglist

html = getHtml("https://tieba.baidu.com/p/5882095555")

print(getImg(html))

最後執行程序，就會在制定位置看到爬去下來的圖片了

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python3實現簡單的爬蟲

主要實現的是從百度貼吧爬取一些圖片

打開對應的網頁

獲取網頁裏面的圖片

將獲取到的圖片保存到本地

deepin環境下安裝pip

Python機器學習基礎（二）

python 機器學習基礎（一）

deepin環境下安裝

Hadoop實踐1-inux deepin配置Hadoop環境

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結