15.Machine-Learning Supported Vulnerability Detection in Source Code

 
 
 
(安安理解)就是做一個界面,看看那些代碼表示方式和不同的機器學習架構哪個更匹配,哪個準確率高,做一個benchmark
然後設計出一個定製的特徵模型  
 
 
 
和下面這篇文章的想法很像,但是下面這篇是通過代碼相似度算法來進行展開的
VulPecker: an automated vulnerability detection system based on code similarity analysis(Published in ACSAC '16 2016 Computer Science)

1 INTRODUCTION

根據Ghaffarian and Shahriari [5]最近的調查,基於機器學習的漏洞發現可以被拆分爲三個領域:

vulnerability detection based on software metrics
anomaly detection
vulnerable code pattern recognition

 

本論文關注於第三點

Our goal is, therefore, a supervised machine learning process that extracts patterns of vulnerable code snippets and re-identifies them in unseen source code.
 
 
the vulnerability knowledge could be divided into known patterns and unknown patterns.
The latter can only be discovered by an anomaly-based method.
We will focus on known patterns categorized for the most common weaknesses, or CWEs [12], a project for classifying frequent vulnerabilities.
 
 

2 RELATED WORK

 
The vulnerability analysis on source code can be divided in three types: lexical, syntactic and semantic analysis [7].
By looking at source code analysis with machine learning at different stages, we can distinguish three waves [1]:
1)the first wave consists of basic tools with hand-crafted features.
2)The second wave follows a lexical analysis, which has already been studied extensively, treating code as text and organizing code into classes using natural language processing techniques.
3)The ongoing development, taken into account the semantics of programming languages, is referred as the
third wave of MLoC and pointed out as future work.
 

 

Allamanis et al. represent programs as graphs [2], where edges
correspond to syntactic and semantic relationships. They evaluate
their approach in open source C# projects to predict variable names
based on their usage and to predict the correct variable name at the
corresponding program location. They give selected examples for
the correct prediction of variable usage and also test their model on
unseen projects. This work is relevant to the third wave of MLoC
and there is still plenty of room for improvements in accuracy and
F1 score.
[2] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning
to Represent Programs with Graphs. arXiv:1711.00740 [cs] (Nov. 2017). http:
//arxiv.org/abs/1711.00740 arXiv: 1711.00740.

 


In the work of Tufano et al. [17], different source code representations are used to detect code clones with a deep learning approach. Identifiers, abstract syntax trees (AST), control flow graphs (CFG), or byte code, are used as representations, each providing an orthogonal view of the code snippet and demonstrating the effectiveness of each, but also creating a combined model with ensemble learning.
The authors have shown that both single and combined representations work with a very high accuracy in this experiment. We also want to test single and combined representations of source code, but for a another application: vulnerability detection.
 
[17] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin
White, and Denys Poshyvanyk. 2018. Deep learning similarities from different
representations of source code. In Proceedings of the 15th International Conference
on Mining Software Repositories - MSR ’18. ACM Press, Gothenburg, Sweden,
542–553. https://doi.org/10.1145/3196398.3196431
 

The actual detection of vulnerabilities in source code is researched in [15] using an identifier representation in a deep learning environment. They created a data set from a variety of sources and labeled them, based on the results of statical analysis tools considering the top five CWE categories and empirically developed a best
practice pipeline consisting of a random embedding of the source
code tokens, learned the features through a one-dimensional CNN
and used it as input to a random forest classifier that decides if the
code is secure or vulnerable. This work could be assigned to the
second wave of MLoC. By creating an embedding that pays more
attention to the token semantics than just a random embedding,
this method could achieve a higher classification accuracy. In addition, with a tree representation of the code, this method could be further improved.
 
[15] Rebecca L. Russell, Louis Y. Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer,
Onur Ozdemir, Paul M. Ellingwood, and Marc W. McConley. 2018. Automated
Vulnerability Detection in Source Code Using Deep Representation Learning. In
17th IEEE International Conference on Machine Learning and Applications, ICMLA
2018, Orlando, FL, USA, December 17-20, 2018. 757–762. https://doi.org/10.1109/
ICMLA.2018.00120
 

A graph based representation brings better performance for
vulnerability detection, as shown in the work of Kronjee et al. [9]
for CFGs to detect SQL injections and cross-site scripting in PHP
applications. They have shown that the chosen machine learning
algorithm is not crucial because it provides approximately similar
AUC-PR (area under curve for precision recall) values for the same
vulnerability. We also want to include CFGs as code representation
in our work, but for C/C++ fragments.
 
[9] Jorrit Kronjee, Arjen Hommersom, and Harald Vranken. 2018. Discovering
Software Vulnerabilities Using Data-flow Analysis and Machine Learning. In
Proceedings of the 13th International Conference on Availability, Reliability and
Security (ARES 2018). ACM, New York, NY, USA, 6:1–6:10. https://doi.org/10.
1145/3230833.3230856 event-place: Hamburg, Germany.
 
 

A path-based approach called Code2Vec [3] is used to provide a
framework that creates a fixed-length continuous vector from code
snippets of any size to detect code similarities. However, the results
show that the prediction accuracy depends less on the path than
on the variable names of the start and end nodes in the considered
path. Other limitations of this work are the non-universal closed
label vocabulary, which is limited to the training data. The work
is though a promising concept. We want to find out, how well this
approach works for vulnerability discovery in our future work.
 
[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. code2vec: Learning Distributed Representations of Code. arXiv:1803.09473 [cs, stat] (March 2018).
http://arxiv.org/abs/1803.09473 arXiv: 1803.09473.
 

Binary code representations are used for pattern analysis [6, 10,
19] by identifying potentially dangerous usage patterns of the C
standard library to predict the probability of a crash when executing
certain commands and to test them with various fuzzing tools.
However, due to high prediction errors, we move away from the
idea of analyzing byte code and look instead at code, which is
represented in a high-level format.
 
 

 

3 BENCHMARKING STATE-OF-THE-ART METHODS

 
 
 

4 FOCUS ON ACTIONABILITY

In general, static analysis tools often come with a high false positive
rate, whereas dynamic analysis tools often provide a high number
of false negatives [16].
Recent work [9, 15] classify a code snippet at the function level.
 
 
 

5 COMBINING REPRESENTATIONS TO AN IMPROVED MODEL

 
集成對代碼的不同表現形式   這樣可以將詞法和語法分析結合起來
集成不同的編碼
 
 

6 EVALUATION PLAN

 
 
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章