Onkur Sen: Using Machine Learning to Predict Terrorist Attacks

## Using Machine Learning to Predict Terrorist Attacks [Onkur Sen](http://onkursen.com/) Rice University [github.com/onkursen/ct](https://github.com/onkursen/ct)

## Background * [Global Terrorism Database](http://www.start.umd.edu/gtd/) from Univ. of Maryland * Info on 104k terrorist attacks (1970-2011) * Treat fields as **features** * **Goal**: use [machine learning](http://scikit-learn.org) to predict the country in which terrorist attacks will occur given various incomplete sets of features

## Applications and Context * Counter-terrorism: real-world context for machine learning * Classification: general problem applicable in all fields * **Where's the physics?** * Distinguishing classes of events (signal vs. background) * Simulating data

## Machine Learning Approach

Features

Input

Year
Month
Day
Attack Type
Target Type

Output

Country

## Techniques Used * Support Vector Machines (SVM) * Good in high-dimensional spaces (can scale easily) * Gaussian Naive Bayes (GNB): * Assumes feature independence and Gaussian distribution * Multinomial Naive Bayes (MNB) * Same except assumes multinomial distribution * Stochastic Gradient Descent (SGD) * Very efficient and easy to tune All **supervised learning** approaches: train on data for which the result is known, then apply to new data

## Datasets ### Incidents between 1970 and 1990 * Train on first half, test on second half (date only) * Alternate train and test: date only * Alternate train and test: date and attack type * Alternate train and test: date, attack type, and target type

## First step. ### [Make a map.](http://onkursen.github.io/ct/mapbox.html)

## Let's look at some code. ### [Pulling and preparing the data](https://github.com/onkursen/ct/blob/master/prepare.py) ### [Classifying and predicting](https://github.com/onkursen/ct/blob/master/run.py) **Caution**: training and testing datasets of size ~20k each Classification step takes some time (3-5 minutes)!

Results: Correct Prediction Rates

Dataset	SVM	GNB	MNB	SGD
Two halves	4.86%	6.20%	8.37%	5.27%
Alternating: date	30.31%	14.70%	10.91%	10.26%
Alternating: date, attack type	32.47%	17.20%	10.73%	10.32%
Alternating: date, attack type, target type	35.75%	17.91%	11.01%	10.28%

## Future Work * Adjust model parameters * Can prediction be made better by fine-tuning? * Are there features of the data that are not exploited? * Examine prediction correctness over time * Is the algorithm better at certain times than others? * Switch inputs/outputs * Given date and country, predict attack type? * Clustering