Statistical linkage and clustering of criminal events can be used by crime analysts to create of lists of potential suspects for an unsolved crime, identify groups of crimes that may have been committed by the same individuals or group of individuals, for offender profiling, and for predicting future events. Pairwise case linkage attempts to establish if a of crimes share a common offender. In practice, there is more interest in crime series linkage, which attempts to identify the set of crimes committed by a common offender. For example, a crime analyst may need to identify the additional crimes that are part of a crime series (crime series identification) or discover all of the crime series in a criminal incident database (crime series clustering).
Crime series clustering is fundamentally a clustering problem where we seek to group the crimes into clusters that correspond to a serial offender (or group of offenders). Instead of simultaneously clustering all crimes in a criminal database, Crime series identification is focused on identifying the additional unsolved crimes that are part of an existing crime series. Crime series identification can be useful, for example, in interrogations (R. Adderley and Musgrove 2003; Richard Adderley 2004) by providing investigators a list of additional crimes (similar to the known crimes) that a suspect may be responsible for. A crime analyst could also use crime series identification when investigating a crime series or suspected crime series to generate a list of additional crimes to jointly investigate.
The crimelinkage
package provides several tools for crime series identification and clustering based on methods from hierarchical and model-based clustering.
The hierarchical-based approaches to crime series linkage are algorithmic in nature and involve creating measures of similarity between sets of crimes. The pairwise (crime-crime) similarity is a function of the Bayes factors obtained from case linkage. The group (series-series) similarity scores are based on a linkage method from hierarchical cluster analysis (e.g., single, complete, average linkage).
Because the hierarchical-based crime series clustering and crime series identification methods are based on the output from pairwise case linkage, they require having a model for estimating the log Bayes factor for a crime pair. Details for getting a Bayes factor model are given in the vignette Statistical Methods for Crime Series Linkage. The steps are repeated here for convenience. This analysis uses a naive Bayes to estimate the log Bayes factor.
#-- Load the package and get the example crime data
library(crimelinkage)
data(crimes) # some example crime incident data
data(offenders) # some example crime offender data
seriesData = makeSeriesData(crimedata=crimes,offenderTable=offenders)
#-- Make Crime Pairs for Case Linkage
set.seed(1) # set random seed for replication
allPairs = makePairs(seriesData,thres=365,m=40)
#-- Make Evidence Variables for Case Linkage
varlist = list( spatial = c("X", "Y"),
temporal = c("DT.FROM","DT.TO"),
categorical = c("MO1", "MO2", "MO3")) # crime variables list
X = compareCrimes(allPairs,crimedata=crimes,varlist=varlist,binary=TRUE) # Evidence data
Y = ifelse(allPairs$type=='linked',1,0) # Linkage indicator. 1=linkage, 0=unlinked
#-- Get Training Data
set.seed(3) # set random seed for replication
train = sample(c(TRUE,FALSE),nrow(X),replace=TRUE,prob=c(.7,.3)) # assign pairs to training set
test = !train
D.train = data.frame(X[train,],Y=Y[train]) # training data
#-- Fit naive Bayes model and make estimateBF() function
vars = c("spatial","temporal","tod","dow","MO1","MO2","MO3")
fmla.all = as.formula(paste("Y ~ ", paste(vars, collapse= "+")))
NB = naiveBayes(fmla.all,data=D.train,weights=weight,df=10,nbins=15,partition='quantile')
estimateBF <- function(X){ # estimateBF() returns the estimated log Bayes factor
predict(NB,newdata=X)
}
Hierarchical clustering is an algorithmic approach to crime series cluster analysis that sequentially forms a hierarchy of cluster solutions. The agglomerative approach starts with every observation (e.g., crime incident) in its own cluster. Then it sequentially merges the two closest clusters to form a new larger cluster. This process is repeated until all observations are in the same cluster or a stopping criterion is met.
This algorithm requires two similarity measures to be specified: the pairwise similarity between two observations and the similarity between two groups of observations. There are three primary approaches to measuring the similarity between groups of observations. , or nearest neighbor, uses the most similar pair between the two groups as the group similarity measure. In contrast, uses the least similar pair between two groups as the measure of group similarity. uses the average similarity between all pairs in the two groups.
The crimelinkage
package provides the function crimeClust_hier()
for agglomerative hierarchical crime clustering. It uses the log Bayes factor as the pairwise similarity measure. That is, the similarity between crimes \(i\) and \(j\) is \(S(i,j) = \log BF(i,j)\), where \(BF(i,j)\) is the estimated Bayes factor for linkage. Then one of: average, single, or complete linkage is used for the similarity between groups.
In this example, we will cluster all of the unsolved crimes from crimes
data and display dendrogram of results with plot_hcc()
function.
#-- Get unsolved crimes
unsolved = subset(crimes, !crimeID %in% seriesData$crimeID)
#-- Run agglomerative hierarchical crime clustering
tree = crimeClust_hier(unsolved,varlist,estimateBF,linkage='average', binary=TRUE)
#-- Plot results in dendrogram using plot_hcc()
plot_hcc(tree,yticks=seq(-2,6,by=2),type="triangle",hang=.05,main="Average Linkage")
The dendrogram (using average linkage) merges 5 crime pairs with a log Bayes factor of more than 4.
Examining the crimes C:431 and C:460 (which were the most similar), we can see they are close in space, around 2 weeks apart in time, and have matching values for MO1, MO2, and Mo3.
#-- Examine crimes C:431 and C:460
subset(crimes,crimeID %in% c('C:431','C:460'))
#> crimeID X Y MO1 MO2 MO3 DT.FROM
#> 431 C:431 7097.0 -10256.8 26 a D 1993-10-23 22:00:00
#> 460 C:460 7195.3 -10624.8 26 a D 1993-11-07 03:10:00
#> DT.TO
#> 431 1993-10-24 06:00:00
#> 460 1993-11-07 03:10:00
If there is a particular crime of interest, then the function clusterPath()
will find the sequence of merges and scores for the event.
#-- Find path info for crime C:429
cp = clusterPath('C:429',tree)
cp[cp$logBF>0,] # only return path for scores > 0
#> logBF crimes
#> 1 4.8226300 C:469
#> 2 2.1332281 C:413, C:436
#> 3 0.1540873 C:407, C:433
This shows that crime C:429
first get grouped with crime C:469
with a log Bayes factor of 4.82
. Then this pair gets grouped with C:413, C:436
with an average log Bayes factor of 2.13
.
This approach to crime series identification compares an unsolved crime to every crime in criminal incident database and calculates its similarity as the log Bayes factor (according to the model developed for case linkage). Then it aggregates the similarity scores over the crime groups using single, complete, or average linkage. Single linkage uses the largest score (most similar crime) from each group, complete linkage uses the smallest score (least similar crime) from each group, and average linkage uses the average score as the group score.
To give an example, extract the solved and unsolved crimes from the crimes
data.
solved = subset(crimes, crimeID %in% seriesData$crimeID)
unsolved = subset(crimes, !crimeID %in% seriesData$crimeID)
The function seriesID()
can be used to find the most similar crime series to the unsolved crime.
crime = unsolved[2,] # use the 2nd unsolved crime C:392
crime
#> crimeID X Y MO1 MO2 MO3 DT.FROM
#> 392 C:392 12793.2 -3386.5 25 a E 1993-06-19 07:00:00
#> DT.TO
#> 392 1993-06-19 07:00:00
results = seriesID(crime,solved,seriesData,varlist,estimateBF)
head(results$score)
#> group average single complete
#> 1 12 4.059641 4.059641 4.059641
#> 2 160 3.560011 3.560011 3.560011
#> 3 8 3.484413 3.484413 3.484413
#> 4 9 3.193912 3.913425 2.752908
#> 5 10 3.193912 3.913425 2.752908
#> 6 11 3.193912 3.913425 2.752908
This shows that the unsolved crime is most similar to the crime(s) in crime group 12
with an average linkage log Bayes factor of 4.06
. To get the crimes and offenders associated with these groups, just use the subset()
function with the groups
object:
subset(results$groups,group=='12') # most similar crime series
#> crimeID Index CS offenderID TIME group
#> 27 C:148 148 12 O:109 1993-06-25 01:20:00 12
subset(results$groups,group=='154') # 2nd most similar series
#> crimeID Index CS offenderID TIME group
#> 205 C:304 304 154 O:237 1993-10-25 07:00:00 154
subset(results$groups,group=='9') # a series with multiple crimes
#> crimeID Index CS offenderID TIME group
#> 9 C:144 144 9 O:106 1993-06-20 03:27:00 9
#> 10 C:163 163 9 O:106 1993-06-20 13:15:00 9
#> 11 C:145 145 9 O:106 1993-06-20 01:30:00 9
#> 12 C:164 164 9 O:106 1993-06-20 13:15:00 9
#> 13 C:165 165 9 O:106 1993-06-20 12:45:00 9
#> 14 C:166 166 9 O:106 1993-06-20 12:45:00 9
We can do this for another unsolved crime
crime4 = unsolved[4,] # use the 4th unsolved crime
results4 = seriesID(crime4,solved,seriesData,varlist,estimateBF)
head(results4$score)
#> group average single complete
#> 1 136 1.7873870 1.7873870 1.7873870
#> 2 37 0.7299544 0.7299544 0.7299544
#> 3 48 0.7299544 0.7299544 0.7299544
#> 4 316 0.7080559 0.7080559 0.7080559
#> 5 219 0.5838576 0.5838576 0.5838576
#> 6 220 0.5838576 0.5838576 0.5838576
Because the scores are so low (log Bayes factors around 1
), this unsolved crime is not very similar to any other solved crimes in the crime database. Perhaps this is the start of a new crime series?
It is also possible to compare a crime to all unsolved crimes to detect potential unsolved crime series.
#- using crime C:394 (the 4th unsolved crime)
pairs = data.frame(i1=unsolved$crimeID[4],i2=unique(unsolved$crimeID[-4]))
X = compareCrimes(pairs,unsolved,varlist,binary=TRUE) # Evidence data
score = data.frame(pairs,logBF=estimateBF(X))
head(score[order(-score$logBF),])
#> i1 i2 logBF
#> 6 C:394 C:397 0.9909933
#> 79 C:394 C:470 0.8584802
#> 60 C:394 C:451 -0.1306027
#> 5 C:394 C:396 -0.1697339
#> 20 C:394 C:411 -0.1767327
#> 28 C:394 C:419 -0.2319215
There are no unsolved crimes that are very similar to this one - probably not enough evidence to link this crime to any others.
This approach also gives similar results to what was obtained from the hierarchical clustering path approach:
C429 = which(unsolved$crimeID %in% 'C:429') # now use crime C:429
pairs = data.frame(i1=unsolved$crimeID[C429],i2=unique(unsolved$crimeID[-C429]))
X = compareCrimes(pairs,unsolved,varlist,binary=TRUE) # Evidence data
score = data.frame(pairs,logBF=estimateBF(X))
head(score[order(-score$logBF),])
#> i1 i2 logBF
#> 78 C:429 C:469 4.822630
#> 23 C:429 C:413 2.424596
#> 45 C:429 C:436 2.257786
#> 17 C:429 C:407 2.041945
#> 14 C:429 C:404 1.555891
#> 13 C:429 C:403 1.270481
#-- results from hierarchical clustering
cp = clusterPath('C:429',tree)
cp[cp$logBF>0,]
#> logBF crimes
#> 1 4.8226300 C:469
#> 2 2.1332281 C:413, C:436
#> 3 0.1540873 C:407, C:433
This section illustrates the partially-supervised Bayesian model-based clustering approach to crime series linkage of (Reich and Porter 2015). This approach is partially-supervised because the offender is known for a subset of the events, and utilizes spatiotemporal crime locations as well as crime features describing the offender's modus operandi.
The hierarchical model naturally handles complex features often seen in crime data, including missing data, interval censored event times, and a mix of discrete and continuous variables. It can also provide uncertainty assessments for all model parameters, including the relative influence of each feature (space, time, method of entry, etc.) in the model. In addition, the model produces posterior clustering probabilities which allow analysts to act on model output only as warranted.
The function crimeClust_bayes()
is used for the Bayesian model-based clustering approach. Because it uses both solved and unsolved crimes, the labels (crime group) for the solved crimes is also passed into the function. (Note: this function will take 20+ minutes to run.)
#-- Make the crime group labels for each crime (NA for unsolved crimes)
seriesData$CG = makeGroups(seriesData,method=2) # method=2 uses unique co-offenders
group_info = unique(seriesData[,c('crimeID','CG')]) # extract the group info
A = merge(crimes,group_info,by="crimeID",all.x=TRUE) # add group info to crimes
A = A[order(A$CG),] # order by crime group
#-- Run MCMC
fit = crimeClust_bayes(A$CG, spatial=A[,c('X','Y')],t1=A$DT.FROM,t2=A$DT.TO,
Xcat=A[,c("MO1","MO2","MO3")],maxcriminals=1000,
iters=3000,burn=1000,update=100,seed=5)
#-- Extract pairwise probabilities
pp = fit$p.equal # probability that crime i is linked to crime j
diag(pp) = NA
The matrix pp
contains the pairwise estimated probability that two crime are linked (share a common offender). We can use this information for crime series identification.
Using the image.plot()
from the fields
package, we can see how strongly the unsolved crimes are linked to the existing (solved) crime series.
library(fields) # if not installed, type: install.packages("fields")
#-- Get index of unsolved crimes
ind.unsolved = which(is.na(A$CG)) # index of unsolved crimes
n = nrow(A) # number of crimes
#-- Image plot of linkage probabilities
fields::image.plot(1:n,ind.unsolved,pp[1:n,ind.unsolved],
xlab="Crime",ylab="Unsolved Crime",
main="Probability crimes are linked")
We see that some unsolved crimes are linked to solved crimes with a posterior probability above 0.25
. These crimes may be worth further investigations.
Here we plot the maximum posterior probability that an unsolved crime is linked to another crime (solved or unsolved).
#-- Find strongest linkages
unsolved.probs = apply(pp[ind.unsolved,],1,max,na.rm=TRUE) # maximum probability
plot(ind.unsolved,unsolved.probs,xlab="unsolved crime",ylab='maximum probability of linkage')
abline(h=0.25)
ind = ind.unsolved[unsolved.probs > 0.25]
investigate = as.character(A$crimeID[ind]) # crimeIDs for crimes with strongest linkage
investigate
#> [1] "C:408" "C:417" "C:464"
This shows that C:408, C:417, C:464
are the crimes with the strongest linkages (posterior probabilities greater that 0.25). A particular crime can be investigated in more detail with the function bayesProb()
:
bp = bayesProb(pp[A$crimeID %in% "C:417"])
bp$crimeID = A$crimeID[bp$index]
bp$CG = A$CG[bp$index]
head(bp)
#> index prob crimeID CG
#> 1 81 0.34155556 C:15 46
#> 2 197 0.29111111 C:26 146
#> 3 459 0.22222222 C:459 NA
#> 4 446 0.10255556 C:446 NA
#> 5 2 0.04900000 C:10 2
#> 6 482 0.04888889 C:482 NA
For this example, our model provides a list of the most likely crimes associated with the unsolved crime C:417
. The first two crimes (C:15 and C:26) are solved crimes indicating that the offender(s) responsible for these crimes may also be responsible for C:417. The next two crimes, C:459 and C:446 do not have a group ID. This means that they are unsolved crimes. By providing the posterior probabilities, crime analysts may choose to investigate further only if the linkage is strong enough.
Adderley, R., and P. Musgrove. 2003. “Modus Operandi Modelling of Group Offending: A Data-Mining Case Study.” International Journal of Police Science & Management 5 (4). Vathek Publishing: 265–76.
Adderley, Richard. 2004. “The Use of Data Mining Techniques in Operational Crime Fighting.” In Intelligence and Security Informatics, edited by Hsinchun Chen, Reagan Moore, Daniel Zeng, and John Leavitt, 3073:418–25. Lecture Notes in Computer Science. Springer Berlin / Heidelberg.
Reich, Brian J, and Michael D Porter. 2015. “Partially-Supervised Spatiotemporal Clustering for Crime Series Identification.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 178 (2): 465–80. http://onlinelibrary.wiley.com/doi/10.1111/rssa.12076/abstract.