Cross-Project Transfer Representation Learning for Vulnerable Function Discovery

View Researcher's Other Codes

Disclaimer: The provided code links for this paper are external links. Science Nest has no responsibility for the accuracy, legality or content of these links. Also, by downloading this code(s), you agree to comply with the terms of use as set out by the author(s) of the code(s).

Please contact us in case of a broken link from here

Authors Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang, O. De Vel, P. Montague
Journal/Conference Name I
Paper Category
Paper Abstract Machine learning is now widely used to detect security vulnerabilities in the software, even before the software is released. But its potential is often severely compromised at the early stage of a software project when we face a shortage of high-quality training data and have to rely on overly generic hand-crafted features. This paper addresses this cold-start problem of machine learning, by learning rich features that generalize across similar projects. To reach an optimal balance between feature-richness and generalizability, we devise a data-driven method including the following innovative ideas. First, the code semantics are revealed through serialized abstract syntax trees (ASTs), with tokens encoded by Continuous Bag-of-Words neural embeddings. Next, the serialized ASTs are fed to a sequential deep learning classifier (Bi-LSTM) to obtain a representation indicative of software vulnerability. Finally, the neural representation obtained from existing software projects is then transferred to the new project to enable early vulnerability detection even with a small set of training labels. To validate this vulnerability detection approach, we manually labeled 457 vulnerable functions and collected 30 000+ nonvulnerable functions from six open-source projects. The empirical results confirmed that the trained model is capable of generating representations that are indicative of program vulnerability and is adaptable across multiple projects. Compared with the traditional code metrics, our transfer-learned representations are more effective for predicting vulnerable functions, both within a project and across multiple projects.
Date of publication 2018
Code Programming Language Python

Copyright Researcher 2022