使用Python爬蟲框架Scrapy爬取Android Vulnerability Bulletin（安卓系統漏洞公告）基本方法

原創

2020-06-12 23:56

其實之前寫過一篇關於Scrapy使用的博客：https://blog.csdn.net/qysh123/article/details/79802250

不過這裏的內容和之前相比稍微多了點技巧，所以簡單總結一下。由於項目需要，需要爬取：https://source.android.com/security/bulletin 這裏列出的所有CVE的修復的commit hash，其實這個需求是比較明確和簡單的，不過我還是花了點時間。首先觀察一下，每個月的bulletin的鏈接都是這種形式：https://source.android.com/security/bulletin/2020-04-01，那麼首先應該過濾出所有這樣的鏈接，這個和之前博客裏用的方法一樣。不過需要用到正則來匹配後面的年月日，首先我們定義一個function來判斷後面這部分鏈接：

######################################################################
def is_bulletin(test_string):
    pattern = re.compile(r'/security/bulletin/(\d{4}-\d{1,2}-\d{1,2})$')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################

然後就是用XPath來匹配這個鏈接了：

        for each in response.xpath('//a/@href'):
            suburl=each.extract()
            if(is_bulletin(suburl)):
                time.sleep(0.1)
                yield scrapy.Request('https://source.android.com'+suburl, self.parse)

這些內容之前博客裏也有介紹。在我們進入到類似這樣：https://source.android.com/security/bulletin/2020-04-01 的頁面中後，就可以看到，實際上每個CVE對應鏈接裏面就有commit hash，具體的頁面html大概是這個樣子的：

<tr>
<td>CVE-2020-0023</td>
<td><a href="https://android.googlesource.com/platform/packages/apps/Bluetooth/+/0d8307f408f166862fbd6efb593c4d65906a46ae">A-145130871</a></td>
<td>ID</td>
<td>嚴重</td>
<td>10</td>
</tr>

我們可以用上面那篇博客中類似的方法來定位到這個href：

for each in response.xpath('//tr/td/a[starts-with(@href,"https://android.googlesource.com/")]/@href'):

這個其實一看就明白，就像這裏介紹的：https://blog.csdn.net/winterto1990/article/details/47903653，/@xxxx 的作用是提取當前路徑下標籤的屬性值。不過我們如何定位到前面td裏的CVE編號呢？這就是今天想要總結的一點內容，按照上面這個頁面中說的，.. 雙點可用來選取當前節點的父節點，不過作者並沒有給出例子，所以我還嘗試了一會：首先通過：

for each in response.xpath('//tr/td/a[starts-with(@href,"https://android.googlesource.com/")]'):

這個來定位到上面的"<a>"然後我們得返回到"<tr>"，然後再依次選擇tr中的每一個td：

for each_content in each.xpath('../../td/text()'):

這裏需要通過另一個function來判斷是否是CVE編號，需要注意的是，CVE編號最後一部分有可能是4位，也有可能是5位：

######################################################################
def is_CVE(test_string):
    pattern = re.compile(r'CVE-(\d{4}-\d{4})')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################

這樣我們就可以提取出來CVE和對應的commit鏈接了。就簡單總結這麼多，最後給出完整的代碼：

import scrapy
import re
import time
######################################################################
def is_bulletin(test_string):
    pattern = re.compile(r'/security/bulletin/(\d{4}-\d{1,2}-\d{1,2})$')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################
######################################################################
def is_CVE(test_string):
    pattern = re.compile(r'CVE-(\d{4}-\d{4})')
    result = pattern.match(test_string)
    if result:
        return True
    else:
        return False
######################################################################
test_string="CVE-2222-22222"
print(is_CVE(test_string))

class VulSpider(scrapy.Spider):
    name="Vul"
    allowed_domains=["source.android.com"]
    start_urls = [
        'https://source.android.com/security/bulletin',
    ]
    
    def parse(self,response):
        
        previous_cve=""
        
        for each in response.xpath('//tr/td/a[starts-with(@href,"https://android.googlesource.com/")]'):#/@href
            git_url=each.xpath('./@href')[0].extract().replace('%2F','/')
            other_content=[]
            
            found_cve=False
            
            for each_content in each.xpath('../../td/text()'):
                content=each_content.extract()
                if(is_CVE(content)):
                    found_cve=True
                    previous_cve=content
                    print(content)
                    print(git_url)
                else:
                    other_content.append(content)   

        for each in response.xpath('//a/@href'):
            suburl=each.extract()
            if(is_bulletin(suburl)):
                time.sleep(0.1)
                yield scrapy.Request('https://source.android.com'+suburl, self.parse)

這裏還由於頁面特點有些其他的處理內容，我就不詳細介紹了。

在嘗試的時候也參考了下面一些博客中的例子：

https://blog.csdn.net/winterto1990/article/details/47903653

https://blog.csdn.net/weixin_41558061/article/details/80077423

https://blog.csdn.net/qq_40134903/article/details/80728094

一併表示感謝！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用Python爬蟲框架Scrapy爬取Android Vulnerability Bulletin（安卓系統漏洞公告）基本方法

DAPPER 事務 TRANSACTION

用diff命令計算源代碼（如C源碼）差異時忽略註釋的最詳細方法及應用

Windows 10快速刪除大量回收站文件以及由此引起的回收站右鍵清空反應慢問題的解決

Python Strip()函數踩坑記錄

Windows系統中用Python Shutil拷貝文件夾並保持目錄結構的方法（[Errno 13] Permission denied的解決辦法）

torch.max總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結