crimelinkage
packageThe crimelinkage
package1 is a set of tools to help crime analysts and researchers with tasks related to crime linkage. This package specifically includes methods for criminal case linkage, crime series identification and clustering, and suspect identification. More details about the methodology and its performance on actual crime data can be found in (M. D. Porter and Reich 2014; M. D. Porter 2014; Reich and Porter 2015; Bouhana, Johnson, and Porter 2015)
The object of the crimelinkage
package is to provide a statistical approach to criminal linkage analysis that discovers and groups crime events that share a common offender and prioritizes suspects for further investigation. Bayes factors are used to describe the strength of evidence that two crimes are linked. Using concepts from agglomerative hierarchical clustering, the Bayes factors for crime pairs can be combined to provide similarity measures for comparing two crime series. This facilitates crime series clustering, crime series identification, and suspect prioritization.
Make sure you have installed and loaded the crimelinkage
package.
#- If the crimelinkage package is not installed, type: install.packages{"crimelinkage"}
library(crimelinkage)
Crime linkage will require two primary data sets: one with crime incidents and the other that links offenders with the crimes they have committed. crimelinkage
has some fictitious data (named crimes
and offenders
) that can help illustrate how the various linkage methods work. The crimes
data is a data frame of 490 simulated crime events. You can get the first 6 records by typing:
data(crimes)
head(crimes)
#> crimeID X Y MO1 MO2 MO3 DT.FROM
#> 1 C:1 13661.1 -4659.3 6 a J 1993-01-06 23:55:00
#> 2 C:2 6758.8 -10092.4 25 a E 1993-01-06 00:00:00
#> 3 C:3 10077.6 -8810.9 25 a E 1993-01-08 07:00:00
#> 4 C:4 12080.3 -4011.1 25 a E 1993-01-10 20:27:00
#> 5 C:5 10862.5 -8091.0 19 b C 1993-01-11 07:45:00
#> 6 C:6 14032.9 -1945.4 7 <NA> <NA> 1993-01-04 15:30:00
#> DT.TO
#> 1 1993-01-07 04:39:00
#> 2 1993-01-06 00:00:00
#> 3 1993-01-08 17:30:00
#> 4 1993-01-10 20:27:00
#> 5 1993-01-11 07:45:00
#> 6 1993-01-04 15:30:00
Each crime has an ID number, spatial coordinates, 3 categorical MO features, and a temporal window of when the crime could have occurred. More details about these fields can be found by typing ?crimes
. An NA
indicates the record is missing a value.
The offenders
data links the solved crimes to the offender(s).
data(offenders)
head(offenders)
#> offenderID crimeID
#> 1 O:1 C:6
#> 2 O:2 C:2
#> 3 O:3 C:29
#> 4 O:4 C:19
#> 5 O:5 C:18
#> 6 O:6 C:18
The crimes series from an individual offender can be extracted using getCrimeSeries
. For example, offender O:40 has committed 4 known crimes.
cs = getCrimeSeries("O:40",offenders)
cs
#> $offenderID
#> [1] "O:40"
#>
#> $crimeID
#> [1] "C:26" "C:110" "C:78" "C:85"
If we want to see the actual incident data, use the function getCrimes()
getCrimes("O:40",crimes,offenders)
#> crimeID X Y MO1 MO2 MO3 DT.FROM
#> 26 C:26 9541.4 4124.5 23 h J 1993-01-29 07:30:00
#> 78 C:78 10609.4 5683.4 23 a D 1993-04-23 08:00:00
#> 85 C:85 11731.8 5429.7 23 a J 1993-04-30 12:30:00
#> 110 C:110 11669.4 5353.0 25 f M 1993-05-01 12:36:00
#> DT.TO
#> 26 1993-01-29 14:30:00
#> 78 1993-04-23 16:00:00
#> 85 1993-04-30 14:30:00
#> 110 1993-05-01 12:36:00
Co-offenders can be identified with the getCriminals()
function. For example, offender O:40 has 5 known co-offenders
getCriminals(cs$crimeID,offenders)
#> [1] "O:33" "O:40" "O:72" "O:73" "O:82"
We can also examine all offenders of a particular crime,
getCriminals("C:78",offenders)
#> [1] "O:40" "O:72" "O:73"
Pairwise case linkage is the processes of determining if a given pair of crimes share the same offender. This is a binary classification problem where each crime pair is considered independently. In order to carry out pairwise case linkage, crime pairs must be compared and evidence variables created.
The first step is to combine all the crime and offender data using the makeSeriesData()
function:
seriesData = makeSeriesData(crimedata=crimes,offenderTable=offenders)
Note that the data.frame passed into
crimedata=
must at minimum have columns (exactly) named:crimeID
,DT.FROM
, andDT.TO
(if the times of all crimes are known exactly, thenDT.TO
is not needed). ThecrimeID
column should be a character vector indicating the crime identification number. TheDT.FROM
andDT.TO
columns should be datetime objects of classPOSIXct
. See the functionsstrptime()
,as.POSIXct()
, and the packagelubridate
for details on converting date and time data into the proper format.
From the series data, we can get some useful information about the offenders and crime series. For example,
nCrimes = table(seriesData$offenderID) # length of each crime series
table(nCrimes) # distribution of crime series length
#> nCrimes
#> 1 2 3 4 5 6 7 14
#> 328 39 6 4 2 6 1 1
mean(nCrimes>1) # proportion of offenders with multiple crimes
#> [1] 0.1524548
provides some info on the length of crime series. From this, we see that 328 offenders were charged with only one crime, 39 were charged with two crimes, and the proportion of offenders that had multiple crimes in the data is 0.152.
The rate of co-offending can also be examined with the following commands:
nCO = table(seriesData$crimeID) # number of co-offenders per crime
table(nCO) # distribution of number of co-offenders
#> nCO
#> 1 2 3 4
#> 294 78 15 3
mean(nCO>1) # proportion of crimes with multiple co-offenders
#> [1] 0.2461538
We see that 294 crimes were committed by a single offender and 24.6% of crimes had multiple offenders.
Case linkage compares the variables from two crimes to assess their level of similarity and distinctiveness. To facilitate this comparison, the crime variables for a pair of crimes need to be transformed and converted into evidence variables that are more suitable for case linkage analysis (M. D. Porter 2014).
The first step is to get the crimes pairs we want to compare. The function makePairs()
is used to get the linked and unlinked pairs that will be used to build a case linkage model.
set.seed(1) # set random seed for replication
allPairs = makePairs(seriesData,thres=365,m=40)
The crime pairs are restricted to occur no more that 365 days apart. The threshold thres
is used to minimize the effects of offenders that may change their preferences (e.g., home location) over time. The crime pairs only include solved crimes.
The linked pairs (crimes committed by same offender) have type = 'linked'
. For these, the weight
column corresponds to \(1/N\), where \(N\) is the number of crime pairs in a series. These weights will be used when we construct the case linkage models. This attempts to put all crime series on an equal footing by downweighting the crime pairs from offenders with large crime series.
The unlinked pairs (type = 'unlinked'
) were generated by selecting m = 40
crimes from each crime group and comparing them to crimes in different crime groups. Crime groups are the connected components of the offender graph. This approach helps prevent (unknown) linked crimes from being included in the unlinked data. The weights for unlinked pairs are 1.
This generates 255 linked pairs and 10932 unlinked pairs, with weights.
The next step is to compare all crime pairs using the function compareCrimes()
. Crime \(V_i\) is compared to crime \(V_j\) (rows from crime incident data) and evidence variables are created that measure the similarity or dissimilarity of the two crimes across their attributes. The list varlist
specifies the column names of crimes
that provide the spatial, temporal, and categorical crime variables. More information about how the evidence variables are made can be found in the help for ?compareCrimes
.
varlist = list( spatial = c("X", "Y"),
temporal = c("DT.FROM","DT.TO"),
categorical = c("MO1", "MO2", "MO3")) # crime variables list
X = compareCrimes(allPairs,crimedata=crimes,varlist=varlist,binary=TRUE) # Evidence data
Y = ifelse(allPairs$type=='linked',1,0) # Linkage indicator. 1=linkage, 0=unlinked
This information is now in the correct format for building classification models for case linkage.
head(X)
#> i1 i2 weight type spatial temporal tod dow MO1
#> 1 C:103 C:97 0.1666667 linked 0.09031733 5.684315 6.000000 1.7500000 1
#> 2 C:103 C:98 0.1666667 linked 0.09462653 5.647644 6.000000 1.7500000 1
#> 3 C:103 C:99 0.1666667 linked 0.10100025 5.681197 6.000000 1.7500000 1
#> 4 C:106 C:123 0.3333333 linked 2.64874725 15.068056 1.717840 1.0680556 1
#> 5 C:106 C:124 0.3333333 linked 1.74753462 14.953472 4.077934 0.9534722 1
#> 6 C:107 C:111 0.1000000 linked 0.00000000 3.364583 7.362903 3.3065667 1
#> MO2 MO3
#> 1 1 1
#> 2 1 0
#> 3 1 0
#> 4 1 1
#> 5 1 0
#> 6 1 0
table(Y)
#> Y
#> 0 1
#> 10932 255
Now that we have the linkage data in a suitable format, we can build case linkage (i.e., classification) models and compare their performance. First, partition the linkage data into training (70%) and testing (30%) sets and create a data.frame D.train
containing the training data:
set.seed(3) # set random seed for replication
train = sample(c(TRUE,FALSE),nrow(X),replace=TRUE,prob=c(.7,.3)) # assign pairs to training set
test = !train
D.train = data.frame(X[train,],Y=Y[train]) # training data
Technically, any binary classification model can be used for case linkage (as long as it accepts weighted observations2). We will illustrate with three models: logistic regression (with stepwise selection), naive Bayes, and boosted trees.
Make the formula for the models
vars = c("spatial","temporal","tod","dow","MO1","MO2","MO3")
fmla.all = as.formula(paste("Y ~ ", paste(vars, collapse= "+")))
fmla.all
#> Y ~ spatial + temporal + tod + dow + MO1 + MO2 + MO3
Fit stepwise logistic regression using the glm()
and step()
functions:
fit.logistic = glm(fmla.all,data=D.train,family=binomial,weights=weight)
fit.logisticstep = step(fit.logistic,trace=0)
summary(fit.logisticstep)
#>
#> Call:
#> glm(formula = Y ~ spatial + temporal + tod + MO1, family = binomial,
#> data = D.train, weights = weight)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.7669 -0.0656 -0.0272 -0.0098 5.1818
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.874088 0.495542 -3.782 0.000156 ***
#> spatial -0.479699 0.088712 -5.407 6.40e-08 ***
#> temporal -0.018712 0.004199 -4.456 8.34e-06 ***
#> tod -0.125321 0.063840 -1.963 0.049641 *
#> MO11 1.426244 0.359841 3.964 7.38e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 437.39 on 7900 degrees of freedom
#> Residual deviance: 312.49 on 7896 degrees of freedom
#> AIC: 237.54
#>
#> Number of Fisher Scoring iterations: 10
The stepwise procedure chooses the spatial, temporal, and time-of-day distance and MO1 as the relevant predictor variables.
A naive Bayes model can be constructed with the naiveBays()
function. This function fits a non-parametric naive Bayes model which bins the continuous predictors and applies a shrinkage to control the complexity. The df=
argument is the degrees of freedom for each predictor, nbins=
is the number of bins to use for continuous predictors, and partition=
controls how the bins are constructed. The resulting component plots can be made with the plotBF()
function.
Here we fit a naive Bayes model with up to 10 degrees of freedom for each component using 15 bins which are designed to have approximately equal number of training points in each bin (quantile binning).
NB = naiveBayes(fmla.all,data=D.train,weights=weight,df=10,nbins=15,partition='quantile')
#- Component Plots
plot(NB,ylim=c(-2,2)) # ylim sets the limits of the y-axis
The naive Bayes model also gives the strongest influence to the spatial and temporal distances and MO1.
Fit boosted trees with the gbm()
function in the gbm
package. This package needs to be installed if not already done so.
library(gbm) # if not installed type: install.packages("gbm")
set.seed(4)
fit.gbm = gbm(fmla.all,distribution="bernoulli",data=D.train,weights=weight,
shrinkage=.01,n.trees=500,interaction.depth=3)
nt.opt = fit.gbm$n.trees # see gbm.perf() for better options
#- Relative influence and plot
print(relative.influence(fit.gbm,n.trees=nt.opt))
par(mfrow=c(2,4))
for(i in 1:7) plot(fit.gbm,i)
The boosted tree model uses an interaction depth of 3 (allowing up to 3-way interactions). It finds that the spatial and temporal distance variables have the strongest influence.
The models can be evaluated on the test data.
D.test = data.frame(Y=Y[test],X[test,])
X.test = D.test[,vars]
Y.test = D.test[,"Y"]
Now, we will predict the values of the Bayes factor (up to constant) from each method and store in an object called BF
. The predictions are evaluated with getROC()
:
#- Predict the Bayes factor for each model
BF = data.frame(
logisticstep = predict(fit.logisticstep,newdata=X.test,type='link'),
nb = predict(NB,newdata=X.test),
gbm = predict(fit.gbm,newdata=X.test,n.trees=nt.opt)
)
#- Evaluation via ROC like metrics
roc = apply(BF,2,function(x) getROC(x,Y.test))
For each model, the getROC()
function orders the crime pairs in the testing set according to their estimated Bayes Factor and the predictive performance is determined from the number of actual linkages found in the ordered list. The first plot shows the proportion of all linked crimes that are correctly identified (True Positive Rate) if a certain number of crime pairs are investigated.
with(roc[[1]], plot(Total,TPR,typ='l',xlim=c(0,1000),ylim=c(0,1),las=1,
xlab='Total Cases Examined',ylab='True Positive Rate'))
grid()
for(i in 2:length(roc)){
with(roc[[i]], lines(Total,TPR,col=i))
}
legend('topleft',names(roc),col=1:length(roc),lty=1, cex=.75,bty="n")
title("Proportion of overall linked crimes captured")
The next plot shows the proportion of the ordered crime pairs that are actual linkages (Precision) for a given number of cases examined.
with(roc[[1]], plot(Total,PPV,typ='l',xlim=c(0,1000),ylim=c(0,1),las=1,
xlab='Total Cases Examined',ylab='Precision'))
grid()
for(i in 2:length(roc)){
with(roc[[i]], lines(Total,PPV,col=i))
}
legend('topright',names(roc),col=1:length(roc),lty=1, cex=.75,bty="n")
title("Proportion of identified crimes that are actually linked")
The purpose of this analysis is to illustrate some capabilities of the crimelinkage
package for pairwise case linkage. More functionality will be shown in other vignettes.
Bouhana, Noemie, Shane Johnson, and Michael D. Porter. 2015. “Consistency and Specificity in Burglars Who Commit Prolific Residential Burglary: Testing the Core Assumptions Underpinning Behavioural Crime Linkage.” Legal and Criminological Psychology (Early View).
Porter, Michael D. 2014. “A Statistical Approach to Crime Linkage.” http://arxiv.org/abs/1410.2285.
Porter, Michael D, and Brian J Reich. 2014. Statistical Methods for Crime Series Linkage. National Institute of Justice, 810 Seventh Street N.W.,Washington, DC 20531: U.S. Department of Justice, Office of Justice Programs.
Reich, Brian J, and Michael D Porter. 2015. “Partially-Supervised Spatiotemporal Clustering for Crime Series Identification.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 178 (2): 465–80. http://onlinelibrary.wiley.com/doi/10.1111/rssa.12076/abstract.