Implements cluster-polarization coefficient for measuring distributional polarization in single or multiple dimensions. Contains support for hierarchical clustering, k-means, partitioning around medoids, density-based spatial clustering with noise, and manually imposed cluster membership. Calculates CPC and adjusted CPC.
There are two ways to easily install this package in R. To install the stable version released to CRAN, install as normal:
install.packages("CPC")
library(CPC)
To install the most recent development version, first ensure you have
the latest version of devtools
installed:
install.packages("devtools")
library(devtools)
Then, run the following code:
devtools::install_github("imehlhaff/CPC")
To cite CPC
in publications and working papers, please
use:
Mehlhaff, Isaac D. “A Group-Based Approach to Measuring Polarization,” working paper (May 2022). https://imehlhaff.net/files/CPC_note.pdf.
For BibTeX users:
@unpublished{Mehlhaff2022,
title = {A {{Group-Based Approach}} to {{Measuring Polarization}}},
author = {Mehlhaff, Isaac D.},
date = {2022-05},
location = {{The University of North Carolina at Chapel Hill}},
url = {https://imehlhaff.net/files/CPC_note.pdf}
}
Most users of this package will be primarily interested in using it to easily calculate the CPC for a given distribution of numeric data. For example, let us take the bivariate, bimodal distribution generated by the following:
data <- matrix(c(rnorm(50, 0, 1), rnorm(50, 5, 1)), ncol = 2, byrow = TRUE)
Given our data generated above, the following call to
CPC()
uses k-means clustering to assign cluster membership
to each observation, calculates the CPC, and returns a numeric vector of
length 1:
CPC(data = data, k = 2, type = "kmeans")
Further arguments to fine-tune the specified clustering function,
such as which particular algorithm to use, can be passed directly to
CPC()
. For example, the following call specifies the
MacQueen algorithm instead of the default Hartigan-Wong algorithm:
CPC(data = data, k = 2, type = "kmeans", algorithm = "MacQueen")
In particular, if type = "kmeans"
, the algorithm is only
guaranteed to converge to local optima, so using a large number of
random starts is recommended. This can be specified with the
nstart
argument to kmeans()
, again passed
directly to CPC()
.
If we wanted to calculate the adjusted CPC instead, we need only add
the optional adjust
argument, which defaults to
FALSE
:
CPC(data = data, k = 2, type = "kmeans", adjust = TRUE)
For more advanced users who wish to evaluate the output of the
specified clustering function, the optional argument model
can be set to TRUE
:
CPC(data = data, k = 2, type = "kmeans", model = TRUE)
In this case, CPC()
returns a list containing the output
of the specified clustering function, all sums of squares calculations,
the CPC and the adjusted CPC. Because both the CPC and adjusted CPC are
automatically calculated and returned as list elements,
adjust
need not be specified when
model = TRUE
.
In the case that pre-specifying cluster membership is theoretically justified or empirically desirable, the CPC can be calculated with user-imposed clusters. Cluster membership should be identified for each observation included in the calculation, and these identifications should be included in their own column in the matrix or data frame. To update our example data from above:
clusters <- matrix(c(rep(1, 25), rep(2, 25)), ncol = 1)
data <- cbind(data, clusters)
CPC()
can now be called, with
type = "manual"
and calls to the optional arguments
cols
and clusters
denoting the columns
containing data and cluster memberships, respectively:
CPC(data = data, k = 2, type = "manual", cols = 1:2, clusters = 3)