該博客僅用於技術討論,若有侵權,聯繫筆者刪除。
之前筆者做的爬蟲基本都是獲取網頁代碼後就可直接獲取數據,或者用selenium進行模擬用戶。但是用selenium的速度是真心慢。這次筆者在爬取VenusEye威脅情報中心的時候,獲取代碼後發現所需數據都是JS動態加載的數據。結果如下:
<dl @click="search('domain')" v-show="headerEmail">
<dt>{{langMap['域名'][config.locale]}}:</dt>
<dd>{{headerkeyword.replace(/^(http|https|ftp)\:\/\//,'')}}</dd>
</dl>
<dl @click="search('url')" v-show="headerEmail">
<dt>URL:</dt>
<dd>{{headerkeyword}}</dd>
</dl>
<dl @click="search('haxi')" v-show="headerHash">
<dt>{{langMap['哈希'][config.locale]}}:</dt>
<dd>{{headerkeyword}}</dd>
</dl>
<dl @click="search('ip')" v-show="headerIp">
<dt>IP:</dt>
<dd>{{headerkeyword}}</dd>
</dl>
<dl @click="search('email')">
<dt>{{langMap['郵箱'][config.locale]}}:</dt>
<dd>{{headerkeyword}}</dd>
</dl>
這個時候有兩種辦法可以解決。第一種是用selenium進行爬取,因爲selenium相當於所見幾所得嘛。但是當你爬取的數據量過大時選擇selenium顯然不適合。第二種方法就是筆者使用的——進行網頁分析。
首先打開所需爬取的頁面,按F12——Network——點擊下方的XHR按鈕,然後F5刷新頁面後可以看到JS請求。
點擊其中一個文件,此處選擇ip文件。查看返回的JSON數據,發現正是我們想要的。
接下來要做的就是直接在Python中進行JSON請求,獲取數據。點擊該文件的Header,General中的Request URL就是請求的URL,Form Data中的target則是進行POST請求時上傳的數據。因爲General中Request Method顯示了是POST請求,若是Get請求,則不用data,直接在URL後面上傳數據。
確定了請求的URL和data,接下來就可以進行爬蟲了。
首先是初始化數據,包括導入包、headers、存數據的表頭等:
import requests, xlwt, time, random
#初始化
def init():
global url_1, url_2, headers, workbook, table, row_now
url = 'https://top.chinaz.com/all/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
# 'X-Forwarded-For' : '9.9.9.9',
'Forwarded': '9.9.9.9'
}
url_1 = 'https://www.venuseye.com.cn/ve/ip'
url_2 = 'https://www.venuseye.com.cn/ve/ip/ioc'
workbook = xlwt.Workbook(encoding='utf-8')
table=workbook.add_sheet("name",cell_overwrite_ok=True)
value=[
"ip", "location", "as", "updata_time", "tags", "ports", "threat_score",
"ioc_code", "ioc_updata_time", "ioc_categories", "ioc_families", "ioc_organizations"
]
for i in range(len(value)):
table.write(0,i,value[i])
row_now = 1
然後定義一個函數,用來獲取傳入IP的數據,也就是代碼的核心部分。這裏筆者進行了很多數據判斷,是爲了確保程序在運行時有較好的兼容性:
#獲取當前IP數據
def get_ip_data(ip_now,pro):
data = {'target':ip_now}
result_1 = requests.post(url_1, headers=headers, data=data, proxies=pro).json()
global row_now
if result_1['status_code'] == 200:
if 'ip' in result_1['data']:
table.write(row_now,0,result_1['data']['ip'])
else:
return False
location = ''
if result_1['data']['cy'] != '':
location = result_1['data']['cy']
if result_1['data']['provincial'] != '':
location = location + ',' + result_1['data']['provincial']
if result_1['data']['area'] != '':
location = location + ',' + result_1['data']['area']
if result_1['data']['ompany'] != '':
location = location + ',' + result_1['data']['ompany']
if result_1['data']['operator'] != '':
location = location + '(' + result_1['data']['operator']+ ')'
table.write(row_now,1,location)
as_data = ''
if 'asn' in result_1['data']:
as_data = result_1['data']['asn']
if 'aso' in result_1['data']:
if result_1['data']['aso'] != '':
as_data = str(as_data) + '(' + result_1['data']['aso']+ ')'
table.write(row_now,2,as_data)
if 'active_time' in result_1['data']:
timeArray = time.localtime(result_1['data']['active_time'])
table.write(row_now,3,time.strftime("%Y-%m-%d", timeArray))
tags = ''
if 'tags' in result_1['data']:
if len(result_1['data']['tags']) > 0:
for now_data in result_1['data']['tags']:
tags = tags + now_data + ';'
else:
tags = result_1['data']['tags']
table.write(row_now,4,tags)
ports = ''
if 'ports' in result_1['data']:
if len(result_1['data']['ports']) > 0:
for now_data in result_1['data']['ports']:
ports = ports + now_data + ';'
else:
ports = result_1['data']['ports']
table.write(row_now,5,ports)
if 'threat_score' in result_1['data']:
table.write(row_now,6,str(result_1['data']['threat_score']))
else:
return False
result_2 = requests.post(url_2, headers=headers, data=data, proxies=pro).json()
if result_2['status_code'] == 200:
if 'code' in result_2['data']['ioc'][0]:
table.write(row_now,7,result_2['data']['ioc'][0]['code'])
if 'update_time' in result_2['data']['ioc'][0]:
timeArray = time.localtime(result_2['data']['ioc'][0]['update_time'])
table.write(row_now,8,time.strftime("%Y-%m-%d", timeArray))
if 'categories' in result_2['data']['ioc'][0]:
categories = ''
if len(result_2['data']['ioc'][0]['categories']) > 0:
for now_data in result_2['data']['ioc'][0]['categories']:
categories = categories + now_data + ';'
else:
categories = result_2['data']['ioc'][0]['categories']
table.write(row_now,9,categories)
if 'families' in result_2['data']['ioc'][0]:
families = ''
if len(result_2['data']['ioc'][0]['families']) > 0:
for now_data in result_2['data']['ioc'][0]['families']:
families = families + now_data + ';'
else:
families = result_2['data']['ioc'][0]['families']
table.write(row_now,10,families)
if 'organizations' in result_2['data']['ioc'][0]:
organizations = ''
if len(result_2['data']['ioc'][0]['organizations']) > 0:
for now_data in result_2['data']['ioc'][0]['organizations']:
organizations = organizations + now_data + ';'
else:
organizations = result_2['data']['ioc'][0]['organizations']
table.write(row_now,11,organizations)
row_now = row_now + 1
return True
因爲爬取的目標網頁有反爬蟲措施,當一個IP請求次數過多後會拒絕服務。此處筆者進行了IP代理處理,由於使用的代理IP是付費的,所以請求的URL中筆者就用***進行隱藏:
#獲取代理IP
def get_new_ip():
pro = ''
while True:
try:
ip_json = result_1 = requests.post('***', headers=headers).json()#請求API獲取JSON數據
if ip_json['code'] == 0 and ip_json['success'] == 'true':
pro = {
'http':ip_json['data'][0]['IP'],# + ip_json['data'][0]['Port']
'https':ip_json['data'][0]['IP']# + ip_json['data'][0]['Port']
}
web_data = requests.get('http://httpbin.org/get', headers=headers,proxies=pro)
print(web_data.text)
break
elif ip_json['code'] == 10000:
print('10000')
time.sleep(5)
except:
print('IP get try agin')
time.sleep(5)
return pro
最後就是主函數了。主要是IP循環處理、程序異常處理和數據保存:
#主函數
if __name__ == '__main__':
init()
pro = get_new_ip()
for line in open("./venuseye_ip爬蟲/test.txt"):
print(line.replace('\n',''),end='\t')
while True:
try:
if get_ip_data(line,pro) == True:
print('end', end = '\t')
break
else:
time.sleep(3)
print('打開網頁失敗', end = '\t')
pro = get_new_ip()
except:
print('無響應', end = '\t')
pro = get_new_ip()
workbook.save('./venuseye_ip爬蟲/test.xls')