15.Machine-Learning Supported Vulnerability Detection in Source Code

(安安理解)就是做一個界面，看看那些代碼表示方式和不同的機器學習架構哪個更匹配，哪個準確率高，做一個benchmark

然後設計出一個定製的特徵模型

和下面這篇文章的想法很像，但是下面這篇是通過代碼相似度算法來進行展開的

VulPecker: an automated vulnerability detection system based on code similarity analysis(Published in ACSAC '16 2016 Computer Science)

1 INTRODUCTION

根據Ghaffarian and Shahriari [5]最近的調查，基於機器學習的漏洞發現可以被拆分爲三個領域：

vulnerability detection based on software metrics

anomaly detection

vulnerable code pattern recognition

本論文關注於第三點

Our goal is, therefore, a supervised machine learning process that extracts patterns of vulnerable code snippets and re-identifies them in unseen source code.

the vulnerability knowledge could be divided into known patterns and unknown patterns.

The latter can only be discovered by an anomaly-based method.

We will focus on known patterns categorized for the most common weaknesses, or CWEs [12], a project for classifying frequent vulnerabilities.

2 RELATED WORK

The vulnerability analysis on source code can be divided in three types: lexical, syntactic and semantic analysis [7].

By looking at source code analysis with machine learning at different stages, we can distinguish three waves [1]:

1）the first wave consists of basic tools with hand-crafted features.

2）The second wave follows a lexical analysis, which has already been studied extensively, treating code as text and organizing code into classes using natural language processing techniques.

3）The ongoing development, taken into account the semantics of programming languages, is referred as the

third wave of MLoC and pointed out as future work.

Allamanis et al. represent programs as graphs [2], where edges

correspond to syntactic and semantic relationships. They evaluate

their approach in open source C# projects to predict variable names

based on their usage and to predict the correct variable name at the

corresponding program location. They give selected examples for

the correct prediction of variable usage and also test their model on

unseen projects. This work is relevant to the third wave of MLoC

and there is still plenty of room for improvements in accuracy and

F1 score.

[2] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning

to Represent Programs with Graphs. arXiv:1711.00740 [cs] (Nov. 2017). http:

//arxiv.org/abs/1711.00740 arXiv: 1711.00740.

In the work of Tufano et al. [17], different source code representations are used to detect code clones with a deep learning approach. Identifiers, abstract syntax trees (AST), control flow graphs (CFG), or byte code, are used as representations, each providing an orthogonal view of the code snippet and demonstrating the effectiveness of each, but also creating a combined model with ensemble learning.

The authors have shown that both single and combined representations work with a very high accuracy in this experiment. We also want to test single and combined representations of source code, but for a another application: vulnerability detection.

[17] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin

White, and Denys Poshyvanyk. 2018. Deep learning similarities from different

representations of source code. In Proceedings of the 15th International Conference

on Mining Software Repositories - MSR ’18. ACM Press, Gothenburg, Sweden,

542–553. https://doi.org/10.1145/3196398.3196431

The actual detection of vulnerabilities in source code is researched in [15] using an identifier representation in a deep learning environment. They created a data set from a variety of sources and labeled them, based on the results of statical analysis tools considering the top five CWE categories and empirically developed a best

practice pipeline consisting of a random embedding of the source

code tokens, learned the features through a one-dimensional CNN

and used it as input to a random forest classifier that decides if the

code is secure or vulnerable. This work could be assigned to the

second wave of MLoC. By creating an embedding that pays more

attention to the token semantics than just a random embedding,

this method could achieve a higher classification accuracy. In addition, with a tree representation of the code, this method could be further improved.

[15] Rebecca L. Russell, Louis Y. Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer,

Onur Ozdemir, Paul M. Ellingwood, and Marc W. McConley. 2018. Automated

Vulnerability Detection in Source Code Using Deep Representation Learning. In

17th IEEE International Conference on Machine Learning and Applications, ICMLA

2018, Orlando, FL, USA, December 17-20, 2018. 757–762. https://doi.org/10.1109/

ICMLA.2018.00120

A graph based representation brings better performance for

vulnerability detection, as shown in the work of Kronjee et al. [9]

for CFGs to detect SQL injections and cross-site scripting in PHP

applications. They have shown that the chosen machine learning

algorithm is not crucial because it provides approximately similar

AUC-PR (area under curve for precision recall) values for the same

vulnerability. We also want to include CFGs as code representation

in our work, but for C/C++ fragments.

[9] Jorrit Kronjee, Arjen Hommersom, and Harald Vranken. 2018. Discovering

Software Vulnerabilities Using Data-flow Analysis and Machine Learning. In

Proceedings of the 13th International Conference on Availability, Reliability and

Security (ARES 2018). ACM, New York, NY, USA, 6:1–6:10. https://doi.org/10.

1145/3230833.3230856 event-place: Hamburg, Germany.

A path-based approach called Code2Vec [3] is used to provide a

framework that creates a fixed-length continuous vector from code

snippets of any size to detect code similarities. However, the results

show that the prediction accuracy depends less on the path than

on the variable names of the start and end nodes in the considered

path. Other limitations of this work are the non-universal closed

label vocabulary, which is limited to the training data. The work

is though a promising concept. We want to find out, how well this

approach works for vulnerability discovery in our future work.

[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. code2vec: Learning Distributed Representations of Code. arXiv:1803.09473 [cs, stat] (March 2018).

http://arxiv.org/abs/1803.09473 arXiv: 1803.09473.

Binary code representations are used for pattern analysis [6, 10,

19] by identifying potentially dangerous usage patterns of the C

standard library to predict the probability of a crash when executing

certain commands and to test them with various fuzzing tools.

However, due to high prediction errors, we move away from the

idea of analyzing byte code and look instead at code, which is

represented in a high-level format.

3 BENCHMARKING STATE-OF-THE-ART METHODS

4 FOCUS ON ACTIONABILITY

In general, static analysis tools often come with a high false positive

rate, whereas dynamic analysis tools often provide a high number

of false negatives [16].

Recent work [9, 15] classify a code snippet at the function level.

5 COMBINING REPRESENTATIONS TO AN IMPROVED MODEL

集成對代碼的不同表現形式這樣可以將詞法和語法分析結合起來

集成不同的編碼

15.Machine-Learning Supported Vulnerability Detection in Source Code

1 INTRODUCTION

2 RELATED WORK

3 BENCHMARKING STATE-OF-THE-ART METHODS

4 FOCUS ON ACTIONABILITY

5 COMBINING REPRESENTATIONS TO AN IMPROVED MODEL

6 EVALUATION PLAN

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Leetcode：58. 最後一個單詞的長度 python 2020.2.11

mysql安裝及連接時的一些問題

劍指offer----數組篇

劍指offer-------鏈表篇

LeetCode：第13題羅馬數字轉整數 python語言實現 2020.2.3

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結