Pre-Processing the Varieties of Democracy Dataset

library(knitr)
library(kableExtra)

Introduction

This guide is an entry in a series of proposed vignettes in which we walk through a deep cleaning or exploratory data analysis (EDA) of a widely employed environment-security dataset. For this entry, we will explore the Varieties of Democracy dataset (V-Dem; (Coppedge2020?)). V-Dem is a massive dataset that aims to provide quantitative assessments of historical and nation-state democracy. V-Dem provides both multidimensional and disaggregated measures of democracy across five primary principals: electoral, liberal, participatory, deliberative, and egalitarian (Pemstein2018?). The V-Dem team is comprised of dozens of scientists spread across the globe working with thousands of local experts to quantify local and regional aspects of democracy.

V-Dem is not alone in its efforts to quantify qualitative aspects of nation-state democracy, civil liberties, and elections. Similar datasets include Polity5 (Marshall2002?), Freedom House’s Freedom in the World, Countries at the Crossroads, and Freedom of the Press (FreedomHouse2014?), and the Institutions and Elections Project (Wig2015?). Although these datasets are similar in many ways, V-Dem stands out with the sheer number of metrics included. V-Dem features over 470 indicators, 82 indices, and 5 high-level metrics. That is an overwhelming amount of data on par with the World Development Indicators (TheWorldBank2017?). Let’s get started.

Acquiring the Data

The most recent V-Dem dataset is available from the V-Dem data homepage in preconfigured csv, SPSS, and STATA formats, however, there is a recommended package available to R users available on GitHub. Installing the remote package from GitHub requires devtools. As a non-standard package (not on CRAN or Bioconductor), vdemdata can cause issues for certain workflows, but you can use the demcon::get_vdem() function to directly download the latest dataset from vdemdata’s GitHub repo without installing the non-standard package.

For this guide we’ll be using data.table, but all of these steps could be performed with dplyr and the greater tidyverse, or even base R if you’re a sadist. Lastly, to assist with country coding, we’ll be using the states package.

# If you would like to install the package over GitHub.
  devtools::install_github("vdeminstitute/vdemdata")}

After the packages are installed load vdemdata, data.table, and states.

# library(vdemdata)
library(data.table)
library(states)

We’ll get the dataset with demcon::get_vdem().

vdem <- demcon::get_vdem(write_out = FALSE)
data.table::setDT(vdem)

Determining Variables of Interest

For the purposes of this guide we’ll focus on 2 widely used high-level metrics from vdem: v2x_libdem and v2x_polyarchy. The codebook can be filtered to provide greater context.

metrics<-c('v2x_libdem','v2x_polyarchy')

The V-Dem codebook reveals that these are 2 high level (vartype==D) democracy indices quantifying the extent of electoral (v2x_polyarchy) and liberal (v2x_libdem) democracy. Both metrics are continuous variables bound by 0-1. In addition to our desired indices, we should also subset the raw data for identification metrics such as country names, observation year, coding schemes that assist with harmonizing V-Dem data with other datasets, and indicators for country start and stop dates to manage secessions, civil wars, etc..

id.vars<-c('country_name', 'COWcode','histname' ,'codingstart_contemp', 'codingend_contemp','year')
vars<-c(id.vars, metrics)

Now we can subset the raw data and toss what we don’t need.

vdem<-vdem[, ..vars]

Determining Years of Interest

We’ll perform a last bit of pruning for temporal considerations. V-Dem has a large historical record dating back to 1789. This is valuable data, but far greater than many practitioners or analysts require. More commonly, analyses will start just before or after key events; i.e. WWII, the Cold War, and the War on Terror. Practically speaking, when preparing historical country-year data, we are most concerned with the headaches brought on by coding nation-state secessions, independence, unifications, etc.

With this in mind, important periods to consider/avoid are: Sudan 2011, Yugoslavia/Kosovo/Serbia/Montenegro 2003-2008, Eritrea 1993, Czech/Slovakia 1993, the complicated Yugoslavian dissolution, and Cold War fallout 1989-1991. Sudan is usually an easy check, but Yugoslavia/Kosovo/Serbia/Montenegro are almost always a real pain to manage across multiple datasets and they usually must be included in the analysis. For the purpose of this guide we will subset our data to 1995 and investigate any issues associated with Yugoslavia/Kosovo/Serbia/Montenegro.

vdem <- vdem[year>1950]

Country Code Checks

The most important issue to address with country-year datasets is accurate annual country codes. This includes nation-state secessions and independence (Sudan, Yugoslavia), independently listed territories (Hong Kong, Puerto Rico, Guam, French Guiana), and states with limited international recognition (Kosovo, West Bank/Palestine, Taiwan). These issues afflict international datasets in a wide variety of ways. Before you attempt to “fix” these issues, it’s important to consider how they will be addressed in all the datasets required for your analysis. Do not spend copious amounts of time coding changes to Kosovo and the West Bank if they’re completely ignored in your other datasets of concern.

V-Dem contains Correlates of War (CoW; COWcode) country codes. This is a popular coding scheme that makes country-coding an easier task. We’ll start be renaming the variable, because we will have to manipulate it a lot.

names(vdem)[2]<-"cow"

The states package can serve as a reference to check Correlates of War and Gleditsch and Ward country codes. Both are embedded in the package and available with calls to states::cowstates or states::gwstates. Let’s start by checking if any CoW codes are missing.

unique(vdem[is.na(cow),country_name])
#> [1] "Palestine/West Bank" "Palestine/Gaza"      "Somaliland"         
#> [4] "Hong Kong"

It may seem like the easy way out, but these states are commonly ignored in popular environment-security datasets, and can usually be dropped from analysis. One dataset where they would be included is United Nations refugee and asylum seeker data, in which case, you would have to introduce ISO codes to harmonize them with other United Nations data. This could be done with minimal trouble using the countrycode package, but will likely lead to other issues.


library(countrycode)

vdem[, iso3:=countrycode::countrycode(cow, 
                                      origin = "cown",
                                      destination = "iso3c")]
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 260, 265, 315, 345, 347, 511, 678, 680, 817

Now we go down the rabbit hole; who were matched unambiguously?

vdem[cow %in% c(260, 265, 315, 345, 347, 511, 678, 680, 817), unique(country_name)]
#> [1] "Yemen"                      "South Yemen"               
#> [3] "Republic of Vietnam"        "Kosovo"                    
#> [5] "Germany"                    "German Democratic Republic"
#> [7] "Czech Republic"             "Serbia"                    
#> [9] "Zanzibar"

These require hard-coded fixes to their ISO3 values. This is beyond the scope of the purpose of this vignette so we will drop the missing cow observations in V-Dem and move on, but I wanted to illustrate the beginning of a country code black hole.

vdem <- vdem[!is.na(cow)][, iso3:=NULL]

Yugoslavia, Serbia, Montenegro, and Kosovo

Official CoW codes for Yugoslavia, Serbia, Montenegro, and Kosovo are 345, 345, 341, and 347, respectively. CoW maintains the 345 numeric AND YUG character designations for Serbia after the Yugoslavia break. CoW assigns Montenegro 341 starting in 2006 and Kosovo 347 in 2008 (review these changes in states::cowstates).

Check how V-Dem assigns these changes.

dcast(vdem[cow %in% c(345, 341, 347), .(country_name, cow, year)],
      year~cow, value.var = "country_name")
#> Key: <year>
#>      year        341    345    347
#>     <num>     <char> <char> <char>
#>  1:  1951       <NA> Serbia   <NA>
#>  2:  1952       <NA> Serbia   <NA>
#>  3:  1953       <NA> Serbia   <NA>
#>  4:  1954       <NA> Serbia   <NA>
#>  5:  1955       <NA> Serbia   <NA>
#>  6:  1956       <NA> Serbia   <NA>
#>  7:  1957       <NA> Serbia   <NA>
#>  8:  1958       <NA> Serbia   <NA>
#>  9:  1959       <NA> Serbia   <NA>
#> 10:  1960       <NA> Serbia   <NA>
#> 11:  1961       <NA> Serbia   <NA>
#> 12:  1962       <NA> Serbia   <NA>
#> 13:  1963       <NA> Serbia   <NA>
#> 14:  1964       <NA> Serbia   <NA>
#> 15:  1965       <NA> Serbia   <NA>
#> 16:  1966       <NA> Serbia   <NA>
#> 17:  1967       <NA> Serbia   <NA>
#> 18:  1968       <NA> Serbia   <NA>
#> 19:  1969       <NA> Serbia   <NA>
#> 20:  1970       <NA> Serbia   <NA>
#> 21:  1971       <NA> Serbia   <NA>
#> 22:  1972       <NA> Serbia   <NA>
#> 23:  1973       <NA> Serbia   <NA>
#> 24:  1974       <NA> Serbia   <NA>
#> 25:  1975       <NA> Serbia   <NA>
#> 26:  1976       <NA> Serbia   <NA>
#> 27:  1977       <NA> Serbia   <NA>
#> 28:  1978       <NA> Serbia   <NA>
#> 29:  1979       <NA> Serbia   <NA>
#> 30:  1980       <NA> Serbia   <NA>
#> 31:  1981       <NA> Serbia   <NA>
#> 32:  1982       <NA> Serbia   <NA>
#> 33:  1983       <NA> Serbia   <NA>
#> 34:  1984       <NA> Serbia   <NA>
#> 35:  1985       <NA> Serbia   <NA>
#> 36:  1986       <NA> Serbia   <NA>
#> 37:  1987       <NA> Serbia   <NA>
#> 38:  1988       <NA> Serbia   <NA>
#> 39:  1989       <NA> Serbia   <NA>
#> 40:  1990       <NA> Serbia   <NA>
#> 41:  1991       <NA> Serbia   <NA>
#> 42:  1992       <NA> Serbia   <NA>
#> 43:  1993       <NA> Serbia   <NA>
#> 44:  1994       <NA> Serbia   <NA>
#> 45:  1995       <NA> Serbia   <NA>
#> 46:  1996       <NA> Serbia   <NA>
#> 47:  1997       <NA> Serbia   <NA>
#> 48:  1998 Montenegro Serbia   <NA>
#> 49:  1999 Montenegro Serbia Kosovo
#> 50:  2000 Montenegro Serbia Kosovo
#> 51:  2001 Montenegro Serbia Kosovo
#> 52:  2002 Montenegro Serbia Kosovo
#> 53:  2003 Montenegro Serbia Kosovo
#> 54:  2004 Montenegro Serbia Kosovo
#> 55:  2005 Montenegro Serbia Kosovo
#> 56:  2006 Montenegro Serbia Kosovo
#> 57:  2007 Montenegro Serbia Kosovo
#> 58:  2008 Montenegro Serbia Kosovo
#> 59:  2009 Montenegro Serbia Kosovo
#> 60:  2010 Montenegro Serbia Kosovo
#> 61:  2011 Montenegro Serbia Kosovo
#> 62:  2012 Montenegro Serbia Kosovo
#> 63:  2013 Montenegro Serbia Kosovo
#> 64:  2014 Montenegro Serbia Kosovo
#> 65:  2015 Montenegro Serbia Kosovo
#> 66:  2016 Montenegro Serbia Kosovo
#> 67:  2017 Montenegro Serbia Kosovo
#> 68:  2018 Montenegro Serbia Kosovo
#> 69:  2019 Montenegro Serbia Kosovo
#> 70:  2020 Montenegro Serbia Kosovo
#> 71:  2021 Montenegro Serbia Kosovo
#>      year        341    345    347

Thankfully the codes themselves are correct, however, V-Dem maintains independent listings for all three states even while they were unified under various arrangements between 1992-2006. The course of action here depends on your intended use and additional datasets. Taking the mean of Serbia and Montenegro (maybe even Kosovo) over this time period is one potential correction. For this guide, we will average Serbia and Montenegro. You may want to consider doing the same for Kosovo and Serbia or all 3 states.

for(i in 1995:2005) vdem[cow %in% c(341,345) & year==i, (metrics):=lapply(.SD, mean, na.rm = TRUE), .SDcols = metrics]

The coverage and coding for Kosovo is correct; it can be left if other data of interest recognizes the state.

Other Considerations

Sudan (625) and South Sudan (626) split in 2011. Check them in V-Dem.

dcast(vdem[cow %in% c(625,626), .(country_name, cow, year)],year~cow, value.var = "country_name")
#> Key: <year>
#>      year    625         626
#>     <num> <char>      <char>
#>  1:  1951  Sudan        <NA>
#>  2:  1952  Sudan        <NA>
#>  3:  1953  Sudan        <NA>
#>  4:  1954  Sudan        <NA>
#>  5:  1955  Sudan        <NA>
#>  6:  1956  Sudan        <NA>
#>  7:  1957  Sudan        <NA>
#>  8:  1958  Sudan        <NA>
#>  9:  1959  Sudan        <NA>
#> 10:  1960  Sudan        <NA>
#> 11:  1961  Sudan        <NA>
#> 12:  1962  Sudan        <NA>
#> 13:  1963  Sudan        <NA>
#> 14:  1964  Sudan        <NA>
#> 15:  1965  Sudan        <NA>
#> 16:  1966  Sudan        <NA>
#> 17:  1967  Sudan        <NA>
#> 18:  1968  Sudan        <NA>
#> 19:  1969  Sudan        <NA>
#> 20:  1970  Sudan        <NA>
#> 21:  1971  Sudan        <NA>
#> 22:  1972  Sudan        <NA>
#> 23:  1973  Sudan        <NA>
#> 24:  1974  Sudan        <NA>
#> 25:  1975  Sudan        <NA>
#> 26:  1976  Sudan        <NA>
#> 27:  1977  Sudan        <NA>
#> 28:  1978  Sudan        <NA>
#> 29:  1979  Sudan        <NA>
#> 30:  1980  Sudan        <NA>
#> 31:  1981  Sudan        <NA>
#> 32:  1982  Sudan        <NA>
#> 33:  1983  Sudan        <NA>
#> 34:  1984  Sudan        <NA>
#> 35:  1985  Sudan        <NA>
#> 36:  1986  Sudan        <NA>
#> 37:  1987  Sudan        <NA>
#> 38:  1988  Sudan        <NA>
#> 39:  1989  Sudan        <NA>
#> 40:  1990  Sudan        <NA>
#> 41:  1991  Sudan        <NA>
#> 42:  1992  Sudan        <NA>
#> 43:  1993  Sudan        <NA>
#> 44:  1994  Sudan        <NA>
#> 45:  1995  Sudan        <NA>
#> 46:  1996  Sudan        <NA>
#> 47:  1997  Sudan        <NA>
#> 48:  1998  Sudan        <NA>
#> 49:  1999  Sudan        <NA>
#> 50:  2000  Sudan        <NA>
#> 51:  2001  Sudan        <NA>
#> 52:  2002  Sudan        <NA>
#> 53:  2003  Sudan        <NA>
#> 54:  2004  Sudan        <NA>
#> 55:  2005  Sudan        <NA>
#> 56:  2006  Sudan        <NA>
#> 57:  2007  Sudan        <NA>
#> 58:  2008  Sudan        <NA>
#> 59:  2009  Sudan        <NA>
#> 60:  2010  Sudan        <NA>
#> 61:  2011  Sudan South Sudan
#> 62:  2012  Sudan South Sudan
#> 63:  2013  Sudan South Sudan
#> 64:  2014  Sudan South Sudan
#> 65:  2015  Sudan South Sudan
#> 66:  2016  Sudan South Sudan
#> 67:  2017  Sudan South Sudan
#> 68:  2018  Sudan South Sudan
#> 69:  2019  Sudan South Sudan
#> 70:  2020  Sudan South Sudan
#> 71:  2021  Sudan South Sudan
#>      year    625         626

This is correct. Lastly, we should check V-Dem against our CoW reference (states::cowstates) to see if V-Dem is missing any countries.

cowstates<-data.table::setDT(states::cowstates)
missing_in_vdem<-cowstates[end >= sprintf("%s-01-01", 1995)][!cowcode %in% vdem$cow]
knitr::kable(missing_in_vdem)
cowcode cowc country_name start end microstate
31 BHM Bahamas 1973-07-10 9999-12-31 FALSE
54 DMA Dominica 1978-11-03 9999-12-31 TRUE
55 GRN Grenada 1974-02-07 9999-12-31 TRUE
56 SLU St. Lucia 1979-02-22 9999-12-31 TRUE
57 SVG St. Vincent and the Grenadines 1979-10-27 9999-12-31 TRUE
58 AAB Antigua & Barbuda 1981-11-01 9999-12-31 TRUE
60 SKN St. Kitts and Nevis 1983-09-19 9999-12-31 TRUE
80 BLZ Belize 1981-09-21 9999-12-31 FALSE
221 MNC Monaco 1993-05-28 9999-12-31 TRUE
223 LIE Liechtenstein 1990-09-18 9999-12-31 TRUE
232 AND Andorra 1993-07-28 9999-12-31 TRUE
331 SNM San Marino 1992-03-02 9999-12-31 TRUE
835 BRU Brunei 1984-01-01 9999-12-31 FALSE
946 KIR Kiribati 1999-09-14 9999-12-31 TRUE
947 TUV Tuvalu 2000-09-05 9999-12-31 TRUE
955 TON Tonga 1999-09-14 9999-12-31 TRUE
970 NAU Nauru 1999-09-14 9999-12-31 TRUE
983 MSI Marshall Islands 1991-09-17 9999-12-31 TRUE
986 PAL Palau 1994-12-15 9999-12-31 TRUE
987 FSM Federated States of Micronesia 1991-09-17 9999-12-31 TRUE
990 WSM Samoa 1976-12-15 9999-12-31 TRUE

There is nothing of consequence here; these are mostly microstates that are commonly omitted from environment-security analysis. For simplicity, the remaining microstates included in V-Dem may be dropped unless you are carrying out a specialized analysis.

microstates <- cowstates[microstate==TRUE,unique(cowcode)]
vdem<-vdem[!(cow %in% microstates)]

Finally, we’ll examine V-Dem for duplicate country names to ensure we don’t miss any peculiarities.

dupes<-unique(vdem[,.(country_name, cow)])
# check for duplicate names across codes
table(duplicated(dupes$country_name))
#> 
#> FALSE  TRUE 
#>   175     3

Excellent!

Missing Values

As previously covered, v2x_polyarchy and v2x_libdem are 2 high level (vartype==D) democracy indices quantifying the extent of electoral and liberal democracy in a given state. Both metrics are continuous variables bound by 0-1. We can quickly check their distributions to get a better sense of the data.

hist.dat<-melt(vdem, 
               id.vars = c("cow", "year"),
               measure.vars = c("v2x_libdem",
                                "v2x_polyarchy"),
               variable.name = "metric",
               value.name = "value")

ggplot2::ggplot(hist.dat, ggplot2::aes(x=value))+
  ggplot2::geom_histogram()+
  ggplot2::facet_wrap(~metric)+
  ggplot2::labs(title = "V-Dem Metric Distributions",
                x = "Value",
                y= "Count")+
  ggplot2::theme_minimal()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 106 rows containing non-finite values (stat_bin).

These look pretty good with (mostly) uniform converage. The warnings have tipped us off to a few missing values; let’s investigate further.

vdem[is.na(v2x_libdem) | is.na(v2x_polyarchy),.(unique(country_name), n=.N, last_year=max(year)), by=cow]
#>      cow                       V1     n last_year
#>    <num>                   <char> <int>     <num>
#> 1:   482 Central African Republic     2      1965
#> 2:   860              Timor-Leste    48      1998
#> 3:   705               Kazakhstan     1      1990
#> 4:   565                  Namibia     1      1979
#> 5:   701             Turkmenistan     1      1990
#> 6:   692                  Bahrain    51      2001

There are only 11 missing values, but they should be investigated. First Timor-Leste.

vdem[cow==860, .(country_name, year, v2x_libdem, v2x_polyarchy)]
#>     country_name  year v2x_libdem v2x_polyarchy
#>           <char> <num>      <num>         <num>
#>  1:  Timor-Leste  1951         NA         0.016
#>  2:  Timor-Leste  1952         NA         0.016
#>  3:  Timor-Leste  1953         NA         0.016
#>  4:  Timor-Leste  1954         NA         0.016
#>  5:  Timor-Leste  1955         NA         0.016
#>  6:  Timor-Leste  1956         NA         0.016
#>  7:  Timor-Leste  1957         NA         0.016
#>  8:  Timor-Leste  1958         NA         0.016
#>  9:  Timor-Leste  1959         NA         0.016
#> 10:  Timor-Leste  1960         NA         0.017
#> 11:  Timor-Leste  1961         NA         0.017
#> 12:  Timor-Leste  1962         NA         0.017
#> 13:  Timor-Leste  1963         NA         0.017
#> 14:  Timor-Leste  1964         NA         0.017
#> 15:  Timor-Leste  1965         NA         0.017
#> 16:  Timor-Leste  1966         NA         0.017
#> 17:  Timor-Leste  1967         NA         0.017
#> 18:  Timor-Leste  1968         NA         0.017
#> 19:  Timor-Leste  1969         NA         0.017
#> 20:  Timor-Leste  1970         NA         0.017
#> 21:  Timor-Leste  1971         NA         0.017
#> 22:  Timor-Leste  1972         NA         0.017
#> 23:  Timor-Leste  1973         NA         0.047
#> 24:  Timor-Leste  1974         NA         0.049
#> 25:  Timor-Leste  1975         NA         0.069
#> 26:  Timor-Leste  1976         NA         0.021
#> 27:  Timor-Leste  1977         NA         0.076
#> 28:  Timor-Leste  1978         NA         0.076
#> 29:  Timor-Leste  1979         NA         0.076
#> 30:  Timor-Leste  1980         NA         0.076
#> 31:  Timor-Leste  1981         NA         0.076
#> 32:  Timor-Leste  1982         NA         0.076
#> 33:  Timor-Leste  1983         NA         0.076
#> 34:  Timor-Leste  1984         NA         0.076
#> 35:  Timor-Leste  1985         NA         0.076
#> 36:  Timor-Leste  1986         NA         0.076
#> 37:  Timor-Leste  1987         NA         0.076
#> 38:  Timor-Leste  1988         NA         0.076
#> 39:  Timor-Leste  1989         NA         0.076
#> 40:  Timor-Leste  1990         NA         0.078
#> 41:  Timor-Leste  1991         NA         0.078
#> 42:  Timor-Leste  1992         NA         0.078
#> 43:  Timor-Leste  1993         NA         0.078
#> 44:  Timor-Leste  1994         NA         0.078
#> 45:  Timor-Leste  1995         NA         0.078
#> 46:  Timor-Leste  1996         NA         0.078
#> 47:  Timor-Leste  1997         NA         0.078
#> 48:  Timor-Leste  1998         NA         0.090
#> 49:  Timor-Leste  1999      0.087         0.090
#> 50:  Timor-Leste  2000      0.186         0.225
#> 51:  Timor-Leste  2001      0.237         0.293
#> 52:  Timor-Leste  2002      0.384         0.503
#> 53:  Timor-Leste  2003      0.414         0.568
#> 54:  Timor-Leste  2004      0.414         0.568
#> 55:  Timor-Leste  2005      0.416         0.576
#> 56:  Timor-Leste  2006      0.414         0.575
#> 57:  Timor-Leste  2007      0.471         0.616
#> 58:  Timor-Leste  2008      0.477         0.627
#> 59:  Timor-Leste  2009      0.482         0.632
#> 60:  Timor-Leste  2010      0.488         0.633
#> 61:  Timor-Leste  2011      0.488         0.633
#> 62:  Timor-Leste  2012      0.498         0.644
#> 63:  Timor-Leste  2013      0.501         0.648
#> 64:  Timor-Leste  2014      0.474         0.632
#> 65:  Timor-Leste  2015      0.461         0.632
#> 66:  Timor-Leste  2016      0.454         0.615
#> 67:  Timor-Leste  2017      0.471         0.648
#> 68:  Timor-Leste  2018      0.492         0.674
#> 69:  Timor-Leste  2019      0.508         0.684
#> 70:  Timor-Leste  2020      0.480         0.664
#> 71:  Timor-Leste  2021      0.487         0.680
#>     country_name  year v2x_libdem v2x_polyarchy

They are missing v2x_libdem for 1995-1998. These years are during the Indonesian occupation and prior to their internationally recognized independence. They can be ignored or dropped unless you have a special use case.

Now Bahrain.

vdem[cow==692, .(country_name, year, v2x_libdem, v2x_polyarchy)]
#>     country_name  year v2x_libdem v2x_polyarchy
#>           <char> <num>      <num>         <num>
#>  1:      Bahrain  1951         NA         0.023
#>  2:      Bahrain  1952         NA         0.023
#>  3:      Bahrain  1953         NA         0.026
#>  4:      Bahrain  1954         NA         0.026
#>  5:      Bahrain  1955         NA         0.026
#>  6:      Bahrain  1956         NA         0.026
#>  7:      Bahrain  1957         NA         0.025
#>  8:      Bahrain  1958         NA         0.025
#>  9:      Bahrain  1959         NA         0.025
#> 10:      Bahrain  1960         NA         0.023
#> 11:      Bahrain  1961         NA         0.024
#> 12:      Bahrain  1962         NA         0.024
#> 13:      Bahrain  1963         NA         0.024
#> 14:      Bahrain  1964         NA         0.024
#> 15:      Bahrain  1965         NA         0.024
#> 16:      Bahrain  1966         NA         0.024
#> 17:      Bahrain  1967         NA         0.024
#> 18:      Bahrain  1968         NA         0.024
#> 19:      Bahrain  1969         NA         0.024
#> 20:      Bahrain  1970         NA         0.024
#> 21:      Bahrain  1971         NA         0.026
#> 22:      Bahrain  1972         NA         0.076
#> 23:      Bahrain  1973         NA         0.150
#> 24:      Bahrain  1974         NA         0.126
#> 25:      Bahrain  1975         NA         0.105
#> 26:      Bahrain  1976         NA         0.051
#> 27:      Bahrain  1977         NA         0.051
#> 28:      Bahrain  1978         NA         0.051
#> 29:      Bahrain  1979         NA         0.051
#> 30:      Bahrain  1980         NA         0.049
#> 31:      Bahrain  1981         NA         0.049
#> 32:      Bahrain  1982         NA         0.049
#> 33:      Bahrain  1983         NA         0.049
#> 34:      Bahrain  1984         NA         0.049
#> 35:      Bahrain  1985         NA         0.049
#> 36:      Bahrain  1986         NA         0.049
#> 37:      Bahrain  1987         NA         0.049
#> 38:      Bahrain  1988         NA         0.049
#> 39:      Bahrain  1989         NA         0.049
#> 40:      Bahrain  1990         NA         0.049
#> 41:      Bahrain  1991         NA         0.049
#> 42:      Bahrain  1992         NA         0.049
#> 43:      Bahrain  1993         NA         0.049
#> 44:      Bahrain  1994         NA         0.049
#> 45:      Bahrain  1995         NA         0.049
#> 46:      Bahrain  1996         NA         0.049
#> 47:      Bahrain  1997         NA         0.049
#> 48:      Bahrain  1998         NA         0.049
#> 49:      Bahrain  1999         NA         0.049
#> 50:      Bahrain  2000         NA         0.067
#> 51:      Bahrain  2001         NA         0.113
#> 52:      Bahrain  2002      0.071         0.147
#> 53:      Bahrain  2003      0.085         0.216
#> 54:      Bahrain  2004      0.081         0.216
#> 55:      Bahrain  2005      0.082         0.225
#> 56:      Bahrain  2006      0.089         0.230
#> 57:      Bahrain  2007      0.088         0.232
#> 58:      Bahrain  2008      0.088         0.232
#> 59:      Bahrain  2009      0.087         0.230
#> 60:      Bahrain  2010      0.084         0.227
#> 61:      Bahrain  2011      0.052         0.184
#> 62:      Bahrain  2012      0.045         0.166
#> 63:      Bahrain  2013      0.046         0.164
#> 64:      Bahrain  2014      0.047         0.162
#> 65:      Bahrain  2015      0.045         0.138
#> 66:      Bahrain  2016      0.045         0.131
#> 67:      Bahrain  2017      0.043         0.126
#> 68:      Bahrain  2018      0.042         0.121
#> 69:      Bahrain  2019      0.048         0.118
#> 70:      Bahrain  2020      0.048         0.118
#> 71:      Bahrain  2021      0.052         0.122
#>     country_name  year v2x_libdem v2x_polyarchy

Bahrain declared independence in 1971 and converted to a Constitutional Monarchy in 2001. The missing value in 2001 may pose an issue when trying to join on additional data sets. A simple fix would be to replace the 2001 value with the 2002 value. A more complicated fix would be some type of lead-in imputation. Let’s examine the time series.

ggplot2::ggplot(vdem[cow==692], ggplot2::aes(x=year, y=v2x_libdem))+
  ggplot2::geom_point(size = 2)+
  ggplot2::labs(title="Bahrain Libdem Time Series",
                x = "Year",
                y = "Libdem")+
  ggplot2::theme_minimal()
#> Warning: Removed 51 rows containing missing values (geom_point).

There is a bit of a linear trend, but imputation would be more trouble than it’s worth. An adequate correction is to put in the 2002 value.

vdem[cow==692 & year==2001, v2x_libdem := vdem[cow==692 & year==2002, v2x_libdem]]

Final Cleanup

Before finishing, we will perform a few final processing steps. First, extract only the minimum number of variables.

vdem <- vdem[, .(cow, year, v2x_libdem, v2x_polyarchy)]

Next, set the year and CoW columns to integers.

cols<-c("cow", "year")
vdem[, (cols):=lapply(.SD, as.integer), .SDcols = cols]

Lastly, if you are working with other colleagues, strip the data.table class from the object.

data.table::setDF(vdem)

And we’re finished. I hope you found this exercise informative, and please contact me with any questions, concerns, or tips.

References