Python爬蟲，動態加載，JSON數據

該博客僅用於技術討論，若有侵權，聯繫筆者刪除。

之前筆者做的爬蟲基本都是獲取網頁代碼後就可直接獲取數據，或者用selenium進行模擬用戶。但是用selenium的速度是真心慢。這次筆者在爬取VenusEye威脅情報中心的時候，獲取代碼後發現所需數據都是JS動態加載的數據。結果如下：

<dl @click="search('domain')" v-show="headerEmail">
    <dt>{{langMap['域名'][config.locale]}}：</dt>
    <dd>{{headerkeyword.replace(/^(http|https|ftp)\:\/\//,'')}}</dd>
</dl>
<dl @click="search('url')" v-show="headerEmail">
    <dt>URL：</dt>
    <dd>{{headerkeyword}}</dd>
</dl>
<dl @click="search('haxi')" v-show="headerHash">
    <dt>{{langMap['哈希'][config.locale]}}：</dt>
    <dd>{{headerkeyword}}</dd>
</dl>
<dl @click="search('ip')" v-show="headerIp">
    <dt>IP：</dt>
    <dd>{{headerkeyword}}</dd>
</dl>
<dl @click="search('email')">
    <dt>{{langMap['郵箱'][config.locale]}}：</dt>
    <dd>{{headerkeyword}}</dd>
</dl>

這個時候有兩種辦法可以解決。第一種是用selenium進行爬取，因爲selenium相當於所見幾所得嘛。但是當你爬取的數據量過大時選擇selenium顯然不適合。第二種方法就是筆者使用的——進行網頁分析。

首先打開所需爬取的頁面，按F12——Network——點擊下方的XHR按鈕，然後F5刷新頁面後可以看到JS請求。

點擊其中一個文件，此處選擇ip文件。查看返回的JSON數據，發現正是我們想要的。

接下來要做的就是直接在Python中進行JSON請求，獲取數據。點擊該文件的Header，General中的Request URL就是請求的URL，Form Data中的target則是進行POST請求時上傳的數據。因爲General中Request Method顯示了是POST請求，若是Get請求，則不用data，直接在URL後面上傳數據。

確定了請求的URL和data，接下來就可以進行爬蟲了。

首先是初始化數據，包括導入包、headers、存數據的表頭等：

import requests, xlwt, time, random

#初始化
def init():
    global url_1, url_2, headers, workbook, table, row_now
    url = 'https://top.chinaz.com/all/'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
#         'X-Forwarded-For' : '9.9.9.9',
        'Forwarded': '9.9.9.9'
    }
    url_1 = 'https://www.venuseye.com.cn/ve/ip'
    url_2 = 'https://www.venuseye.com.cn/ve/ip/ioc'
    workbook = xlwt.Workbook(encoding='utf-8')
    table=workbook.add_sheet("name",cell_overwrite_ok=True)
    value=[
        "ip", "location", "as", "updata_time", "tags", "ports", "threat_score", 
        "ioc_code", "ioc_updata_time", "ioc_categories", "ioc_families", "ioc_organizations"
    ]
    for i in range(len(value)):
        table.write(0,i,value[i])
    row_now = 1

然後定義一個函數，用來獲取傳入IP的數據，也就是代碼的核心部分。這裏筆者進行了很多數據判斷，是爲了確保程序在運行時有較好的兼容性：

#獲取當前IP數據
def get_ip_data(ip_now,pro):
    data = {'target':ip_now}
    result_1 = requests.post(url_1, headers=headers, data=data, proxies=pro).json()
    global row_now
    if result_1['status_code'] == 200:
        if 'ip' in result_1['data']:
            table.write(row_now,0,result_1['data']['ip'])
        else:
            return False
        location = '' 
        if result_1['data']['cy'] != '':
            location = result_1['data']['cy']
        if result_1['data']['provincial'] != '':
            location = location + ',' + result_1['data']['provincial']
        if result_1['data']['area'] != '':
            location = location + ',' + result_1['data']['area']
        if result_1['data']['ompany'] != '':
            location = location + ',' + result_1['data']['ompany']
        if result_1['data']['operator'] != '':
            location = location + '(' + result_1['data']['operator']+ ')'
        table.write(row_now,1,location)
        as_data = ''
        if 'asn' in result_1['data']:
            as_data = result_1['data']['asn']
        if 'aso' in result_1['data']:
            if result_1['data']['aso'] != '':
                as_data = str(as_data) + '(' + result_1['data']['aso']+ ')'
        table.write(row_now,2,as_data)
        if 'active_time' in result_1['data']:
            timeArray = time.localtime(result_1['data']['active_time'])
            table.write(row_now,3,time.strftime("%Y-%m-%d", timeArray))
        tags = ''
        if 'tags' in result_1['data']:
            if len(result_1['data']['tags']) > 0:
                for now_data in result_1['data']['tags']:
                    tags = tags + now_data + ';'
            else:
                tags = result_1['data']['tags']
            table.write(row_now,4,tags)
        ports = ''
        if 'ports' in result_1['data']:
            if len(result_1['data']['ports']) > 0:
                for now_data in result_1['data']['ports']:
                    ports = ports + now_data + ';'
            else:
                ports = result_1['data']['ports']
            table.write(row_now,5,ports)
        if 'threat_score' in result_1['data']:
            table.write(row_now,6,str(result_1['data']['threat_score']))
    else:
        return False
    result_2 = requests.post(url_2, headers=headers, data=data, proxies=pro).json()
    if result_2['status_code'] == 200:
        if 'code' in result_2['data']['ioc'][0]:
            table.write(row_now,7,result_2['data']['ioc'][0]['code'])
        if 'update_time' in result_2['data']['ioc'][0]:
            timeArray = time.localtime(result_2['data']['ioc'][0]['update_time'])
            table.write(row_now,8,time.strftime("%Y-%m-%d", timeArray))
        if 'categories' in result_2['data']['ioc'][0]:
            categories = ''
            if len(result_2['data']['ioc'][0]['categories']) > 0:
                for now_data in result_2['data']['ioc'][0]['categories']:
                    categories = categories + now_data + ';'
            else:
                categories = result_2['data']['ioc'][0]['categories']
            table.write(row_now,9,categories)
        if 'families' in result_2['data']['ioc'][0]:
            families = ''
            if len(result_2['data']['ioc'][0]['families']) > 0:
                for now_data in result_2['data']['ioc'][0]['families']:
                    families = families + now_data + ';'
            else:
                families = result_2['data']['ioc'][0]['families']
            table.write(row_now,10,families)
        if 'organizations' in result_2['data']['ioc'][0]:
            organizations = ''
            if len(result_2['data']['ioc'][0]['organizations']) > 0:
                for now_data in result_2['data']['ioc'][0]['organizations']:
                    organizations = organizations + now_data + ';'
            else:
                organizations = result_2['data']['ioc'][0]['organizations']
            table.write(row_now,11,organizations)
    row_now = row_now + 1    
    return True

因爲爬取的目標網頁有反爬蟲措施，當一個IP請求次數過多後會拒絕服務。此處筆者進行了IP代理處理，由於使用的代理IP是付費的，所以請求的URL中筆者就用***進行隱藏：

#獲取代理IP
def get_new_ip():
    pro = ''
    while True:
        try:
            ip_json = result_1 = requests.post('***', headers=headers).json()#請求API獲取JSON數據
            if ip_json['code'] == 0 and ip_json['success'] == 'true':
                pro = {
                    'http':ip_json['data'][0]['IP'],# + ip_json['data'][0]['Port']
                    'https':ip_json['data'][0]['IP']# + ip_json['data'][0]['Port']
                }
                web_data = requests.get('http://httpbin.org/get', headers=headers,proxies=pro)
                print(web_data.text)
                break 
            elif ip_json['code'] == 10000:
                print('10000')
                time.sleep(5)
        except:
            print('IP get try agin')
            time.sleep(5)
    return pro

最後就是主函數了。主要是IP循環處理、程序異常處理和數據保存：

#主函數
if __name__ == '__main__':
    init()
    pro = get_new_ip()
    for line in open("./venuseye_ip爬蟲/test.txt"):
        print(line.replace('\n',''),end='\t')
        while True:
            try:
                if get_ip_data(line,pro) == True:
                    print('end', end = '\t')
                    break
                else:
                    time.sleep(3)
                    print('打開網頁失敗', end = '\t')
                    pro = get_new_ip()
            except:
                    print('無響應', end = '\t')
                    pro = get_new_ip()
    workbook.save('./venuseye_ip爬蟲/test.xls')

Python爬蟲，動態加載，JSON數據

Python 潮流週刊#52：Python 處理 Excel 的資源

Python爬蟲實例：爬取某個網頁的子網頁

Python爬蟲實例：爬取國內所有醫院信息

NLP：用Senta做文本情感分析

惡意JavaScript代碼檢測文獻閱讀（一）

惡意JavaScript代碼檢測文獻閱讀（二）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結