python爬蟲實戰(四) python鬼滅漫畫爬取+簡單JS分析

原創

皖渝

2020-06-22 19:55

本次爬取僅供學習，無任何商業用途

豬油骨，拿來滷~今天，來分享一下python圖片爬取+簡單JS分析

爬取網址：漫畫地址
(這個網站只更新到188話，實際上已經有200多話了)

一、獲取所有章節URL地址

打開網址後，使用Chrome抓包，發現所有章節的數據如下所示：

def get_html(url):
    r=requests.get(url,headers=header)
    r.encoding='gbk'
    if r.status_code==200:
        return r.text
    except:
        print('網絡連接異常')

def get_total_chapter():
    data=json.loads(get_html(url))
    chapter_total=data['Comics'][2]['Chapters']
    for item in chapter_total:
        yield item.get('Url')

二、解析圖片地址，進行簡單JS解密

點擊進入第一話後，分析網頁源碼，發現圖片保存在a標籤下的img中，但是需要爬取的src是用javascript寫的！這個時候直接用lxml庫去解析是拿不到圖片的。

這裏，我們先分析圖片鏈接的組成，用正則把提取出來即可。

src='"+m201304d+"newkuku/2016/02/15/鬼滅之刃][第1話/JOJO_001513.jpg'

其中，m201304是加密的部分，這個網站比較簡單，直接找到js4.js文件，即可發現m201304對應的是http://v2.kukudm.com/，除此之外還有三個加密碼，我們可以構建成列表，用if判斷是否含如下加密碼，再用replace替換即可。

三、翻頁分析

分析URL可知，第一話共54頁，通過改變末尾的/number.html即可實現翻頁

全部代碼

所有圖片都放在桌面的comic文件夾下

import requests
import json
import os
import re
import time
os.chdir('C:/Users/dell/Desktop/comic')
url='https://api.soman.com/soman.ashx?action=getsomancomicdetail&comicname=%E9%AC%BC%E7%81%AD%E4%B9%8B%E5%88%83&source=kuku%E5%8A%A8%E6%BC%AB'
header={'user-agent':"Opera/9.80 (Windows NT 6.0; U; en) Presto/2.8.99 Version/11.10"}

def get_html(url):
    r=requests.get(url,headers=header,timeout=5)
    r.encoding='gbk'
    if r.status_code==200:
        return r.text
    else:
        print('網絡連接異常')

def get_total_chapter():
    data=json.loads(get_html(url))
    chapter_total=data['Comics'][2]['Chapters']
    for item in chapter_total:
        yield item.get('Url')

        
def save_items(url,count):
    r=requests.get(url,headers=header,timeout=5)
    with open('./第{}話/'.format(count)+str(int(time.time()))+'.jpg','wb') as f:
        f.write(r.content)

def get_all_img():  #得到每話總圖片數
    src_list=["m200911d","m201001d","m201304d","k0910k"]
    count=0
    for chapter in get_total_chapter():
        try:
            count+=1
            os.makedirs('./第{}話'.format(count))
            pat='共(.*?)頁'
            total_page=re.search(pat,get_html(chapter)).group(1)
            for page in range(1,int(total_page)+1):
                pat1='<IMG SRC=(.*)></a>'
                src=re.search(pat1,get_html(chapter)).group(1)
                for item in src_list:
                    if item in src_list:
                        src=src.replace("+"+item+"+",'http://v2.kukudm.com/').replace('"','')            
                save_items(eval(src),count)
                print('第{}話第{}頁爬取完成'.format(count,page))
                now_page=re.search('.*/(.*)\.htm',chapter).group(1)
                chapter=chapter.replace(str(now_page)+'.htm',str(page+1)+'.htm')
        except:
            print('未爬取到數據')
if __name__=='__main__':
    get_all_img()

最終爬取的漫畫如下(這裏僅作示例，只爬取了前10話的內容)：

10話大概爬取了25分鐘左右，算下來，爬完188話，也要7個多小時…後續可以用多進程方法加速一下爬取速度。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲實戰(四) python鬼滅漫畫爬取+簡單JS分析

目錄

一、獲取所有章節URL地址

二、解析圖片地址，進行簡單JS解密

三、翻頁分析

全部代碼

python爬蟲實戰(四) python鬼滅漫畫爬取+簡單JS分析

Excel(一)之VLOOKUP用法集合——你真的會用VLOOKUP函數麼？

python學習筆記(四) 數據容器—列表、元組、字典、集合概述

python爬蟲實戰(二) selenium切換iframe爬取知網論文

python學習筆記(二)數據篩選

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結