You are here

Technical report | Numerical Algorithms for the Analysis of Expert Opinions Elicited in Text Format

Abstract

Latent Dirichlet Allocation (LDA) is a scheme which may be used to estimate topics and their probabilities within a corpus of text data. The fundamental assumptions in this scheme are that text is a realisation of a stochastic generative model and that this model is well described by the combination of multinomial probability distributions and Dirichlet probability distributions. Various means can be used to solve the Bayesian estimation task arising in LDA. Our formulations of LDA are applied to subject matter expert text data elicited through carefully constructed decision support workshops. In the main these workshops address substantial problems in Australian Defence Capability. The application of LDA here is motivated by a need to provide insights into the collected text, which is often voluminous and complex in form. Additional investigations described in this report concern questions of identifying and quantifying differences between stake-holder group text written to a common subject matter. Sentiment scores and key-phase estimators are used to indicate stake-holder differences. Some examples are provided using unclassied data.

Executive Summary

This report describes the motivation, scope and outcomes of a recent Defence Science and Technology Organisation (DSTO) research collaboration with Industry, intended to develop a specialised computer-based text analysis capability. In March of 2011, a formal research agreement was struck between the Joint Operations Divsion (JOD) and the National ICT Australia (NICTA). Fundamentally this agreement was aimed at developing a specific text analysis capability, with particular emphasis placed upon examining collections of text-format expert opinions, each of which concerned a given defence capability issue. Here the term text analysis might include: identifying a nite number of key topics in a text corpus and their relative weightings, or, some quantitative measure of difference between stakeholder group opinions on speci c common issue etc.

The primary motivation for this work is derived from text-data volume & processing issues arising in the Joint Decision Support Centre (JDSC). The JDSC was established in March of 2006 and is one component of a unique collaboration between DSTO and the Capability Development Group (CDG). Part of the JDSC's core program of work concerns providing decision support to current projects listed in the Defence Capability Plan (DCP). The common vehicle for this support is a facilitated defence capability workshop. Such workshops typically run for 2-4 days and may include up to 40 attendees, consisting of technical SMEs, Australian Defence Force (ADF) sta and representatives from various stakeholder groups. These workshops are carefully designed to address specific defence questions and to elicit, record and analyse expert opinions. Note, it is important to understand that the JDSC's scope here best 'approximates' what is known as the Expert Problem as its described in the Taxonomy due to French [Fre85]. Briefly, the Expert Problem is de ned as follows:

Definition 0.1 (French, 1985). A group of experts are asked for advice by a Decision Maker (DM) who faces a specific real decision problem. The DM is, or can be taken to be, outside the group. The DM takes responsibility and accountability for the consequences of the decision. The experts are free from such responsibility and accountability. In this context the emphasis is on the DM learning from the experts.In our context the relevance of French's de nition is primarily expressed in the last sentence of his definition, emphasising the DM learning from experts.

Consequently, JDSC decision support workshops are orientated towards informing Defence Decision Makers through workshops and their outcomes. JDSC workshop data are generally of two classes: 1) numeric, such as voting scores or quantitative preference rankings, or 2) text data collected through network-based text collection software. The text data collected at JDSC workshops is usually rich in content, but significant in volume. Ideally, this data should be analysed both in situ, that is during a given workshop, and o -line post-workshop. The main tasks here are data reduction and visualisation, that is, to  ompute an accessible summary visualisation of valuable information inherent in a corpus of text likely to inform Defence Decision Makers. It should also be noted here that while the motivation for this project originated from the inherent needs of JDSC workshops, the outcomes of this project are not limited to JDSC related activities.

The main outcomes detailed in this report concern the development and capabilities of a set of text analysis algorithms intended to support and enhance the various tasks described above. Specific capabilities detailed here are:

  • Probabilistic Topic Analysis: Topic analysis concerns identifying a nite number of topics within a corpus of text and subsequently estimating a level of association of document elements (such as words or phrases) to each of these topics.
  • Differential Analysis: Differential analysis concerns identifying and quantifying the di erences between subsets of text, where set membership is by aliation to a speci c stake-holder group. For example, what might be the di erences between text data generated by ARMY SMEs and Air Force SMEs on a common defence capability issue? Further, how might such di erences be computed and analysed?
  • Key-Phrase Analysis: Key phrase analysis concerns identifying and ranking the top N phrases in a document, either for the complete document or subsets of text attributed to various stake-holders.

This report also contains technical detail on mathematical foundations of the work and speciific details on some algorithmic issues inherent in its complex estimation tasks. Finally, an example of the algorithms at work on an unclassi ed text data set is provided. This text data was collected at a special JDSC workshop including two groups only, DSTO staff and NICTA staff . Primarily this unclassifed data is included to demonstrate graphic visualisations of the three aforementioned core tasks.
 

Key information

Author

W. P. Malcolm and Wray Buntine

Publication number

DSTO-TR-2797

Publication type

Technical report

Publish Date

April 2013

Classification

Unclassified - public release

Keywords

Natural Language Processing, Text Analysis, Sentiment Analysis, Key Phrase Estimation, Probabilistic Topic Estimation, Latent Dirichlet Allocation, Di erential Analysis of Stake-Holder Text, Bayesian Estimation, Monte Carlo Methods