python抓取網站88titienmae88中的“圖片區”的第一頁的所有圖片

原創

2020-06-15 13:00

#-*-coding:utf-8-*-
from urllib.request import urlopen, urlretrieve
from bs4 import BeautifulSoup
import re
import os


'''
抓取網站http://jyghf.com/中的“圖片區”的第一頁的所有圖片
        關於這個網站，大家不要太有糾結的情緒，作爲一個泱泱大國的男士，第一個爬蟲，必須要有作爲，作爲啊！！！
'''

'''
第一步，進入到圖片類中：根據http://jyghf.com/的html分析：
        在id='top_box'的div中，第一個class='menu'的div包含着所有的“圖片區”分類。
        這些分類的url都是以“/p”開頭，如：/p01/index.html，全路徑： http://jyghf.com/p01/index.html
'''

'''
第二步，進入到圖片文件夾中：根據http://jyghf.com/p01/index.html的html分析
        在class="typelist"的div中，圖片路徑都在“<li>”標籤中，這些圖片路徑都是以“/htm/”開頭
        如：“/htm/2017/12/13/p01/393067.html”，全路徑：“http://jyghf.com/htm/2017/12/13/p01/393067.html”
'''

'''
第三步，獲取圖片的下載路徑：根據html分析，圖片路徑，都在id="view1"的div中的<img>標籤的“src”屬性下。
'''

# 第一步，進入到圖片類中
def getPicTypeLink():
    html=urlopen("http://jyghf.com/")
    bshtml=BeautifulSoup(html,"html.parser")
    picTypes=bshtml.find("div",{"id":"top_box"}).find("div",{"class":"menu"})\
        .findAll("a",href=re.compile("^(/p)"))
    return picTypes

# 第二步，進入到圖片文件夾中
def getPicFileLink(typeLink):
    html=urlopen("http://jyghf.com/{0}".format(typeLink))
    bshtml=BeautifulSoup(html,"html.parser")
    picfiles=bshtml.find("div",{"class":"typelist"}).findAll("a",href=re.compile("^(/htm/)"))

    nextpage=bshtml.find("div",{"id":"page"}).find("a",title="下一頁").attrs["href"]

    return picfiles



# 第三步，獲取圖片的下載路徑
def getPicSrcLink(picfilelink):
    html=urlopen("http://jyghf.com/{0}".format(picfilelink))
    bshtml=BeautifulSoup(html,"html.parser")
    srcLinks=bshtml.find("div",{"id":"view1"}).findAll("img",src=re.compile("^(http://)"))
    return srcLinks

#輔助：根據圖片類型+圖片文件夾，創建目錄
def getDownloadPath(typename,filename,downLoadDirectory="E:\downloaded"):
    path="{0}/{1}/{2}/".format(downLoadDirectory,typename,filename)
    directory = os.path.dirname(path)
    if not os.path.exists(directory):
        os.makedirs(directory)
    return path

#########開始##########
picTypeLinks=getPicTypeLink()

for link in picTypeLinks:
    typeLink=link.attrs["href"]
    typename=link.get_text()
    print(typename)
    picFileLinks=getPicFileLink(typeLink)
    for picfile in picFileLinks:
        if len(picFileLinks)>0:
            picFileLink=picfile.attrs["href"]
            filename=picfile.get_text()
            pid = 1
            print(filename)
            picSrcLinks=getPicSrcLink(picFileLink)
            for picsrc in picSrcLinks:
                downloadurl=picsrc.attrs["src"]
                print("第{0}張圖片".format(pid))
                urlretrieve(downloadurl,"{0}/{1}.jpg".format(getDownloadPath(typename,filename),pid))
                pid+=1

python抓取網站88titienmae88中的“圖片區”所有圖片：http://blog.csdn.net/qq_34908167/article/details/79041861

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python抓取網站88titienmae88中的“圖片區”的第一頁的所有圖片

生產者消費者從低級到高級

JDK8的新特性總結

《Python網絡數據採集》第三章(閱讀代碼筆記)

Git導圖以及命令彙總

python抓取網站88titienmae88中的“圖片區”的第一頁的所有圖片

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結