Technical report | Bayesian Modelling of Network Traffic Metadata using Dirichlet Multinomial Mixtures
Statistical theory commends probabilistic modelling techniques for the discovery of latent structure in large datasets not amenable to analysis by inspection. Netflow metadata, for example, may contain latent structure representing different traffic behaviours. The utility of a class of Bayesian models known as Dirichlet multinomial mixtures in discovering such behaviours, and how they might be applied to network analysis problems such as source characterisation, event detection or filtering, is considered herein. Encouragingly, under the right conditions, these models are found to detect and quantify meaningful behavioural distinctions. For an analyst using metadata to mitigate privacy, volume or encryption constraints, but faced with the unpredictable behaviours of cyber adversaries with ever-evolving tools, techniques and procedures, unsupervised learning like Dirichlet mixture modelling could prove a valuable tool.
In the field of network traffic analysis, constraints including privacy, encryption and capacity give impetus to a transition from analysis based on deep packet inspection to analysis based on metadata.
Whereas packet inspection provides full visibility of (unencrypted) content and therefore low ambiguity in interpretation, metadata is content-opaque, and even dissimilar transactions might have near identical metadata records. Communications once characterised by reference to known byte sequences, or signatures, must instead be assessed by trends, or behaviours, and the high ambiguity in metadata demands such trends be inferred over multiple observations. There arises an issue of scale, both from this volume and high dimensionality in the metadata, so that the data is no longer amenable to analysis by inspection.
Instead, scientific theory commends statistical learning for the discovery of latent structure in data. Bayesian probabilistic modelling techniques from this class, although well established in many other domains, are only recently emerging in network analysis. A subclass known as Dirichlet multinomial mixture (DMM) models appears particularly well matched to network problems, describing a structure in which multiple disparate sources of data are mixed together at measurement, much as the modern internet mixes many disparate protocols and services on a common transport infrastructure. Accordingly, this report seeks to assess the utility of DMM modelling with network metadata in roles such as source characterisation, detection of cyber security events, or related filtering. The significant output from the model is a description of each identifiable source, providing two derivative results - a clustering of data by source, and a measure of likelihood that data should belong to a source.
From a broad range of potential research activities identified, this work concentrates on assessing DMM against filtered views of highly aggregated internet backbone traffic and with a variety of data attributes. The major outcomes are:
- DMM is a suitable model choice for network traffic metadata, i.e. the model building process should converge, producing a manageable number of distinct sources, each of which can typically be explained by behavioural trends.
- There is the potential to use a broad range of attribute combinations to describe network data observations, and this choice can significantly alter the modelling outcomes. Attributes may be literal metadata fields or derivations thereof. Correlation between attribute combinations should be minimised to avoid lack of resolution in the source descriptions.
- Model building was effective against the traffic in both highly aggregated and tightly filtered forms.
- Trends in data clusters per source can assist characterisation and detection.
- Trends in likelihood measurements can also be related to behaviours.
Statistical learning has relevance to the Australian Signals Directorate (ASD), which is the entity charged with provision of Australia's Signals Intelligence (SIGINT) capabilities. ASD analysts must use metadata to mitigate privacy, volume and encryption constraints. To deal with the information loss of metadata abstraction and the unpredictable behaviours of cyber adversaries with ever-evolving tools, techniques and procedures, unsupervised learning like DMM modelling must form the foundation of future toolsets.