Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey

機器學習和數據挖掘技術

基於機器學習和數據挖掘的軟件漏洞分析方法
SEYED MOHAMMAD GHAFFARIAN and HAMID REZA SHAHRIARI,
Amirkabir University of Technology

摘要

軟件安全漏洞是計算機安全領域的關鍵問題之一，在過去幾十年中陸續提出了許多方法來減輕軟件漏洞的損害。機器學習和數據挖掘技術也是解決該問題的衆多方法之一。在本文中，回顧了基於機器學習和數據挖掘技術的軟件漏洞分析和挖掘領域的工作，討論了各種方法的優、缺點，並指出了該領域的挑戰和一些未知領域。

keyword

Software vulnerability analysis, software vulnerability discovery, software security, machine-learning, data-mining, review, survey

引言

計算機軟件無處不在，並且人類生活在很大程度上依賴於各種各樣的軟件。不同形式的軟件在不同平臺上運行，既有手持移動設備上簡單的應用程序，也有複雜的分佈式企業軟件系統。這些軟件基於各種各樣的技術，以不同的方法產生，每種技術都有自己的優點和侷限。在這個龐大的產業以及計算機安全領域中，一個重要的問題是軟件安全漏洞。引用行業專家的話：

“In the context of software security, vulnerabilities are specific flaws or oversights in a piece of software that allow attackers to do something malicious: expose or alter sensitive information, disrupt or destroy a system, or take control of a computer system or program.” Dowd et al. (2007)

軟件漏洞根據開發的複雜性和攻擊面等因素，帶來不同嚴重程度的危害(Nayak等，2014)。在過去的二十年中，存在大量因爲軟件漏洞而對公司和個人造成了重大損害的事件。一個突出的例子是流行瀏覽器插件中的漏洞情況，這些漏洞威脅到數百萬互聯網用戶的安全和隱私。例如，Adobe Flash Player (US-CERT 2015; Adobe Security Bulletin 2015) 和Oracle Java (US-CERT 2013)。此外，基礎開源軟件中的漏洞也威脅到全球數千家公司及其客戶的安全（例如Heartbleed（Codenomicon 2014），ShellShock（賽門鐵克安全響應2014）和Apache Commons（Breen 2015））。上述例子只是每年報告大量漏洞中的一小部分。
針對這個重要的問題，學術界、軟件行業的研究人員已經提出了許多減弱危害的方法。 Shahriar和Zulkernine (2012) 針對安全漏洞的不同方法進行了廣泛調研，包括測試，靜態分析和混合分析，以及1994年至2010年期間發佈的安全編程，程序轉換和修補方法。
但在接下來的幾年中（從2011年開始），研究界越來越關注另一類方法。這些方法利用來自數據科學和人工智能（AI）領域的技術來解決軟件漏洞分析和發現問題。 Shahriar和Zulkernine (2012) 忽略了這類有趣的方法。
在本文中，調研了利用數據挖掘和機器學習技術的軟件漏洞分析和發現方法。首先，定義了軟件漏洞分析和挖掘的問題，並簡要介紹了該領域的傳統方法。簡要介紹了機器學習和數據挖掘技術及其使用背後的動機。分別闡述了利用機器學習和數據挖掘技術解決軟件漏洞分析和挖掘問題的工作。爲這類方法進行分類，並討論了它們的優點和侷限性。最後，討論了該領域的挑戰，並指出了一些未知的領域，以激發這一新興研究領域的工作。

背景：軟件漏洞分析與挖掘

軟件漏洞的定義

首先給出前人對軟件安全漏洞的定義

“an instance of an error in the specification, development, or configuration of software such that its execution can violate the security policy.” (Krsul 1998)

“A software vulnerability is an instance of a mistake in the specification, development, or configuration of software such that its execution can violate the explicit or implicit security policy.” Ozment (2007)

“In the context of software security, vulnerabilities are specific flaws or oversights in a piece of software that allow attackers to do something malicious: expose or alter sensitive information, disrupt or destroy a system, or take control of a computer system or program.” Dowd et al. (2007)

IEEE標準軟件工程術語表(IEEE標準1990):

error: “the difference between a computed, observed, or measured value or condition and the true, specified, or theoretically correct value or condition” .
fault: “an incorrect step, process, or data definition in a computer program”. Faults are also known as flaws or bugs.
failure: “the inability of a system or component to perform its required functions within specified performance requirements”.
mistake: “a human action that produces an incorrect result”.

A summary and clarification of the relation of these terms is to “distinguish between the human action (a mistake), its manifestation (a hardware or software fault), the result of the fault (a failure), and the amount by which the result is incorrect (the error)” (IEEE Standards 1990).

本文定義：

A software vulnerability is an instance of a flaw, caused by a mistake in the design, development, or configuration of software such that it can be exploited to violate some explicit or implicit security policy.

軟件漏洞的原因是人爲錯誤，其表現形式是缺陷。執行對易受攻擊軟件的執行不一定違反安全策略; 直到某些特定數據（漏洞利用代碼）或某些具有某些條件的隨機數據到達有缺陷的語句，此時，其執行可能違反某些安全策略。

“In general, software vulnerabilities can be thought of as a subset of the larger phenomenon of software bugs. Security vulnerabilities are bugs that pack an extra hidden surprise: A malicious user can leverage them to launch attacks against the software and supporting systems.” Dowd et al. (2007)

可靠性、完備性和不可判定性

程序漏洞分析是決定給定程序是否包含已知安全漏洞（根據安全策略）的問題。基於圖靈停止問題和賴斯定理的不可判定性，可以證明許多程序分析問題在一般情況下也是不可判定的 (Landi 1992; Reps 2000)。對於從業者而言，不可判定性意味着不存在對問題完備的解決方案。
在數學邏輯中，如果系統沒有無效參數被批准，則證明系統是可靠的。如果所有有效參數都可以被系統批准，則證明系統是完整的。通過推論，一個可靠的完備的證明系統是一個能夠批准所有有效論證並反駁所有無效論證的系統（Xie et al.2005）。在軟件安全的背景下，如果漏洞分析系統永遠不會批准易受攻擊的程序（沒有漏掉漏洞），那麼它就是可靠的。如果可以批准所有安全程序（沒有虛假漏洞），則漏洞分析系統是完備的。

除了漏洞分析之外，更實用的系統是程序漏洞挖掘（或漏洞報告）系統。與批准或拒絕給定程序的安全性（即二進制輸出）的漏洞分析系統相比，程序漏洞挖掘系統報告在程序中發現的每個漏洞更詳細的信息（例如類型，位置等）程序。同樣，一個完備且可靠的軟件漏洞挖掘系統是不存在的。

傳統方法

靜態分析：對一給定的程序，不需要執行，基於其源代碼，利用泛化的抽象對其性質進行分析。
動態分析：對一給定的程序，通過輸入特定的數據進行執行，並監視運行時的行爲。
混合分析

在大量漏洞挖掘技術中，下面三種方法在軟件產業種應用更爲廣泛。

軟件滲透測試
模糊測試
靜態數據流分析

機器學習和數據挖掘技術

在上文提到的傳統方法之外，另一類不同的方法應用數據科學和人工智能領域技術的方法來處理軟件漏洞分析和挖掘問題，在2011年以後收到了很大的關注。
AI領域的機器學習方法已經被證明在不同的應用場景都有着顯著的效果。

希望與擔憂

“In their early days, computer security and artificial intelligence didn’t seem to have much to say to each other . . . Security researchers aimed to fix the leaks in the plumbing of the computing infrastructure or design infrastructures they deemed leakproof . . . But the two fields have grown closer over the years, particularly where attacks have aimed to simulate legitimate behaviors . . . We might imagine systems that would have a degree of self-awareness about the data that they process. The notion of reflective systems (systems that can reference and modify their own behavior) has its origins in the AI community . . . Imagine a plumbing system that contained a system of smart pipes that could detect incipient leaks. A cyber-infrastructure that incorporated the analog of smart pipes would be of great interest.” Landwehr (2008)

方法分類

基於軟件度量的脆弱性預測。大量的工作是基於熟知的軟件度量作爲特徵集合，利用機器學習的方法建立預測模型，然後使用模型去評估漏洞的狀態。
異常檢測方法。利用無監督的方法自動提取正常模型，並將漏洞檢測爲異常行爲。
易受攻擊代碼模式識別。利用機器學習方法從衆多漏洞代碼樣本中提取易受攻擊代碼段的模式，然後使用模式匹配技術來檢測和定位軟件中的漏洞源代碼。
其他方法。不屬於上述類別的近期工作，利用人工智能和數據科學領域的技術進行軟件脆弱性分析和發現。

基於軟件度量的脆弱性預測

脆弱性預測模型基於通用的軟件工程度量，利用數據挖掘，機器學習和統計分析技術預測軟件開發中的漏洞(源代碼文件，面向對象類等)。這些方法的主要思想來自於軟件工程中的軟件質量和可靠性保證領域。缺陷預測模型已應用於工業界((Khoshgoftaar et al. 1997)。

Table 1. Summary of Recent Works on Vulnerability Prediction Models Based on Software Metrics

Paper	Metrics	Granularity	Within/Cross-project	Vulnerability info
(Zimmermann et al. 2010)	Code-churn, complexity, coverage,dependency, organizational	Binary modules	Within-project	Public advisories
(Meneely and Williams 2010)	Developer-activity	Source file	Within-project	Public advisories
(Doyle and Walden 2011)	Code complexity, Security Resources Indicator	Source file	Within-project	Tool-based detection
(Shin and Williams 2013)	Complexity, code-churn, fault-history	Source file	Within-project	Public advisories
(Shin and Williams 2011)	Code complexity, dependency network complexity, execution complexity	Source file	Within-project	Public advisories
(Shin et al. 2011)	Complexity, code-churn,developer-activity	Source file	Within-project	Public advisories
(Moshtari et al. 2013)	Unit complexity, coupling	Source file	both	Self-developed detection framework
(Meneely et al. 2013)	Code-churn, developer-activity	Code commits	Within-project	Public advisories
(Bosu et al. 2014)	Developer-activity	Code commits	Within-project	Public advisories
(Perl et al. 2015)	Code-churn, developer-activity, GitHub meta-data	Code commits	Cross-project	Public advisories
(Walden et al. 2014)	Code complexity	Source file	both	Public advisories
(Morrison et al. 2015)	Code-churn, complexity, coverage, dependency, organizational	Binary modules,source file	Within-project	Public advisories
(Younis et al. 2016)	Code complexity, Information Flow,Functions, Invocations	Functions	Within-project	Public advisories

異常檢測方法

Table 2. Summary of Reviewed Works on Anomaly Detection for Vulnerability Discovery

Paper	Type	Approach	Within/Cross-project	Security focused
(Engler et al. 2001)	API usage pattern	Template-based rule extraction	Within	Yes
(Livshits and Zimmermann 2005)	API usage pattern	Association rule mining	Within	No
(Li and Zhou 2005)	API usage pattern	Frequent closed itemset mining	Within	No
(Wasylkowski et al. 2007)	API usage pattern	Frequent closed itemset mining	Within	No
(Acharya et al. 2007)	API usage pattern	Frequent partial-order itemset mining	Cross	No
(Chang et al. 2008)	Missing checks	Maximal frequent sub-graph mining	Within	No
(Thummalapenta and Xie 2009)	API usage pattern + Missing checks	Imbalanced frequent itemset mining	Cross	No
(Gruska et al. 2010)	API usage pattern	Frequent closed itemset mining	Cross	No
(Yamaguchi et al. 2013)	Missing checks	k-Nearest neighbors + bag-of-words	Within	Yes

易受攻擊代碼模式識別

Table 3. Summary of Reviewed Works on Vulnerable Code Pattern Recognition

Paper	Code Processing Approach	Learning Approach	Static/Hybrid	Source/Binary
(Yamaguchi et al. 2011, 2012)	Extracting AST with parser	Supervised (classification)	Static	Source
(Shar and Tan 2012, 2013)	Static data flow analysis	Supervised (classification)	Static	Source
(Shar et al. 2013, 2015)	Static program slicing and control flow analysis	Semi-supervised and supervised (classification)	Hybrid	Source
(Scandariato et al. 2014)	Bag-of-words extraction from program source text	Supervised (classification)	Static	Source
(Yamaguchi et al. 2014, 2015)	Extracting Code Property Graph	Unsupervised (clustering)	Static	Source
(Pang et al. 2015)	N-gram analysis on program source text	Supervised (classification)	Static	Source
(Grieco et al. 2015)	N-gram analysis on function call sequences	Supervised (classification)	Hybrid	Binary

其他方法

Table 4. Summary of Reviewed Miscellaneous Approaches

Paper	Approach Summary
(Sparks et al. 2007)	Used Genetic Algorithm (GA) for intelligently guiding the input selection process of black-box fuzz testing
(Wijayasekara et al. 2012, 2014)	Used text mining (bag-of-words) on bug reports in open bug databases for identifying hidden impact bugs (HIBs)
(Alvares et al. 2013)	Used a hybrid of static data-flow analysis and computational intelligence (GA and FSS) techniques for discovering exploitable memory corruption vulnerabilities
(Medeiros et al. 2014)	Used classification techniques on the output of static tainted data-flow analysis for web application vulnerability discovery to identify false-positive reports
(Sadeghi et al. 2014)	Used a probabilistic rule ranking approach based on the information contained in categorized software repositories to improve the efficiency and scalability of static vulnerability analysis tools

機器學習：軟件漏洞分析

Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey

摘要

keyword

引言

背景：軟件漏洞分析與挖掘

軟件漏洞的定義

可靠性、完備性和不可判定性

傳統方法

機器學習和數據挖掘技術

希望與擔憂

方法分類

基於軟件度量的脆弱性預測

異常檢測方法

易受攻擊代碼模式識別

其他方法

EXCEL中下拉菜單中添加新選項或者刪除選項

Git使用經驗總結5-修改提交信息

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Java中止線程的方式

[轉帖]Oracle Exadata 學習筆記之核心特性Part1

《最新出爐》系列入門篇-Python+Playwright自動化測試-43-分頁測試

HTTP協議相關文檔

Web of Science 導出記錄字段含義 (Web of Science 字段標識)

神經網絡算法揭示人類聽覺行爲和大腦皮層處理層次

降維：主成分分析

機器學習：軟件漏洞分析

NumPy基礎與數組創建

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結