General document | Deep Learning for Cyber Vulnerability Discovery: NGTF Project Scoping Study
This report is the result of a scoping study undertaken as part of anAustralian Defence Department Next Generation Technologies Fund (NGTF) project entitled Deep Learning for Cyber-Security (DLC). The report provides a motivation for the study of software vulnerability discovery, briefly reviewing existing techniques for both source and binary code, with an emphasis on machine learning approaches. Noting the spectacular successes in recent years of Deep Learning (DL) techniques in areas such as image recognition, it is proposed to investigate the application of DL techniques to the software vulnerability discovery problem, with a focus on binary code analysis as most relevant to Defence. As part of this effort, consideration is given to the acquisition and generation of suitable training and testing datasets.
The rapid evolution and growth in scale and complexity of malicious cyber capabilities presents an ever increasing challenge for cyber-security. Rather than react/respond to discovered attacks, a proactive approach is to concentrate on preventative measures through the discovery of software vulnerabilities, particularly in mission-critical applications and systems. Discovered vulnerabilities may be mitigated before they can be actively exploited.
There has been some work in recent years on investigating the use of machine learning (ML) techniques in order to assist software vulnerability discovery. Motivated by the spectacular success of deep learning (DL) approaches in fields such as computer vision and natural language processing, we propose to study the application of DL techniques to software vulnerability discovery. DL approaches have the ability to learn feature representations and complex non-linear structures in datasets exhibiting hierarchies of patterns at fine to coarse scales, and there is every reason to suspect that they may continue to enjoy similar success in the software vulnerability discovery domain. Primary focus will be on the analysis of binary code vulnerabilities as this is of most relevance to Defence, though source code approaches will also be considered.
This report is a scoping study undertaken in the context of an Australian Department of Defence (DoD) Next Generation Technologies Fund (NGTF) project entitled Deep Learning for Cyber-Security (DLC)*. The project will investigate concepts, techniques and technologies relating to the application of deep learning algorithms to the discovery of software vulnerabilities. An initial focus will be on generation of suitable datasets of known vulnerabilities for training and testing of the techniques under study.
The NGTF DLC project scoping study sought to to:
- Ground the research programme design in the existing DL and cyber-security literature, so that future work builds on a firm foundation of existing knowledge
- Develop appropriate datasets of known vulnerable code for both training and testing of DL techniques
- Develop deep learning based vulnerability discovery techniques and technology solutions
- Detail a plan of work packages, to ensure appropriate long-term project outcomes
In particular, the DLC project seeks to deliver:
- Ground truth datasets consisting of labelled source/binary codebases
- New DL methods for binary and source code based vulnerability discovery
- Prototypes of DL based software vulnerability discovery tools
The DLC research team consists of members from CEWD-DST Group, Data61, the University of Melbourne, Deakin University and Swinburne University.