Managing country codes is one of the most difficult tasks in environment-security analysis, and each dataset treats the concept differently. Country code designations may be very strict and only respect internationally recognized sovereign nations starting with their first official day of independence. Other datasets may include separate observations and codes for territories, disputed states, colonial nations, and other quasi-independent states. Coding systems like the Correlates of War (CorrelatesofWarProject2017?) or Gleditsch and Ward (Gleditsch1999?) fall into the former with strict, to the day, international recognition, while refugee and asylum datasets provided by the United Nations (UnitedNations2017?) falls into the latter. UN migration and refugee datasets include nearly every conceivable territorial designation/disaggregation (e.g. Puerto Rico, Taiwan, Faroe Islands, etc.).
When using an individual dataset, users need a general awareness for
the conceptual approach to the country code specification and any
potential inconsistencies, however, when attempting to harmonize
multiple datasets, more intricate knowledge is required. Merging
multiple datasets without first addressing disparate approaches to
country coding will lead to inconsistencies, numerous NA
values, and compound downstream error in the analysis. It takes time and
experience to anticipate country coding issues and know what to be on
the lookout for when undergoing project planning. The goal of this
vignette is to outline common issues and considerations for harmonizing
multiple country-year datasets. This walkthrough uses the Varieties of
Democracy (V-Dem) (Coppedge2020?) and Correlates
of War coding schemes to illustrate recurrent challenges. They are good
datasets to work with while learning for several reasons:
The Varieties of Democracy (V-Dem) presents a collection of highly disaggregated indicators (400+) depicting wide-ranging measures of democracy and institutional characteristics dating back to 1789. In contrast to several other political, social, and environmental datasets, V-Dem demonstrates an extreme level of transparency in their methods and copious documentation. This includes a dedicated supplementary manual depicting their country coding approach; V-Dem Country Coding Units v11.1 (Coppedge2021?). This tutorial will highlight the most relevant (in my opinion) portions of their approach and country-year coding decisions.
V-Dem defines a country as
…a political unit enjoying at least some degree of functional and/or formal sovereignty.
Generally speaking, V-Dem will have a country-year observation for a country if:
In my limited experience with V-Dem, this conceptual approach is carried out very consistently, but additional pre-processing is required to successfully merge with more strict datasets. Moreover, the Country Coding Units Manual contains information that helps users construct historical time series of nations that have limited independent observations in the dataset. For example, Bangladesh is coded independently starting in 1971, but to construct a longer historic record, the Country Coding Units Manual states that the following combination of observations may be used:
The Country Coding Units Manual is a great reference if you need to combine historical democracy indicators for nations with intermittent internationally recognized sovereignty with other historically disaggregated environmental, migration, or conflict data.
The histname
variable in the raw V-Dem dataset is also
an excellent source of context for the true nature of regime changes in
annual polity. While the country_name
field is static and
typically reflects the modern, common, country name in the V-Dem
dataset, the histname
reflects official or historic names
and potential states of occupation. Take Somalia for example:
country_name | histname | COWcode | Start | End |
---|---|---|---|---|
Somalia | Somalia under formal Italian control over most of the territory [except British Somaliland] | 520 | 1900 | 1909 |
Somalia | Somalia under effective (more or less) Italian control | 520 | 1910 | 1940 |
Somalia | Somalia under British occupation | 520 | 1941 | 1949 |
Somalia | UN Trust Territory of Somalia under Italian administration | 520 | 1950 | 1959 |
Somalia | Somali Republic [unites Somaliland with the Trust Territory of Somalia] | 520 | 1960 | 1968 |
Somalia | Somali Democratic Republic | 520 | 1969 | 1990 |
Somalia | Somali Democratic Republic [civil war] | 520 | 1991 | 2003 |
Somalia | Transitional Federal Government [civil war] | 520 | 2004 | 2011 |
Somalia | Federal Republic of Somalia [civil war] | 520 | 2012 | 2021 |
This provides a clear understanding of Somalia’s historical sovereignty and polity for the past 120 years. When combined with the Country Coding Units Manual, the user is able to make more informative choices regarding their research interests. It also serves as excellent reference material for historical nation-states that is much faster than being sucked into a Wikipedia black hole.
Now we will walk through some of the more important country coding considerations for applied cases using Varieties of Democracy and Correlates of War for 1900-2020.
Most of the Caribbean nations are coded by CoW starting with their official independence from the colonial parent nations; usually some time between 1962-1975. V-Dem starts most Caribbean countries between 1789-1900.
The remaining nations of Central and South America are similarly coded by CoW and V-Dem because they mostly achieved their independence prior to 1900.
The Balkans are consistently a source of frustration when attempting to merge disparate datasets. The affected states include Serbia/Yugoslavia, Bosnia and Herzegovina, Kosovo, Croatia, North Macedonia, Slovenia, and Montenegro. V-Dem disaggregates each component for as long as possible while CoW uses the official, internationally recognized, aggregated states. Because of this, when merging with most datasets, V-Dem will require some additional processing. At a minimum this may include averaging across multiple states to calculate Yugoslavia or the State Union of Serbia and Montenegro. Given time and resources, users may also consider calculating weighted averages using population data as not to give Kosovo or Slovenia equal weighting with Serbia/Yugoslavia or Montenegro. The V-Dem Country Coding manual is very helpful in reconstructing the complicated historic record of these nation-states.
V-Dem codes Austria and Hungary separately throughout their entire record. Therefore there is no corresponding record for CoW numeric code 300 (Austria-Hungary 1816-1918). The user must create this designation.
Germany consists of 3 separate designations in both datasets.
Lastly, as a result of these various configurations, CoW does not track any form of Germany during 1946-1953, and V-Dem does not contain “Unified” or West German observations from 1945-1948.
CoW includes designations for Czechoslovakia (315; 1918-1992), Czech Republic (316; 1993-), and Slovakia (317; 1993-), whereas V-Dem observes Czechoslovakia and the Czech Republic on a single code (157; 1918-) with Slovakia (201) diverging from 1939-1945 and again permanently in 1993.
Former members of the Soviet Union are handled similarly between both datasets. Due to the centralized control over polity and institutions on part of the USSR during this time, V-Dem does not code members independently (compared to colonial Africa or Caribbean). The exceptions are nations with a historical presence prior to annexation/occupation (Lithuania, Latvia, Uzbekistan, etc.). In these cases states are recorded independently from their pre WWI or WWII independence, and then again following the dissolution of the USSR; 1990 (V-Dem) or 1991 (CoW).
The remaining notable European countries are Poland, Ireland, and Luxembourg.
Along with the various configurations of Yugoslavia, Palestine is another consistent source of frustration when carrying out pre-processing. Many datasets and coding schemes do not recognize Palestine. In fact, CoW does not include a code for Palestine. This complicates interdisciplinary research, because including Palestine in your analysis can severely limit the number of additional datasets you can include without introducing more error through imputation or other fixes. V-Dem includes 3 separate designations for Palestine:
CoW includes separate codes for the Yemen Arab Republic/North Yemen (678; 1926-1990), unified modern Yemen (679; 1990-), and Yemen People’s Democratic Republic/South Yemen (680; 1967-).
Similar to the Czech Republic, V-Dem tracks historic, North, and modern Yemen on a single code (14; 1789-1850 and 1918-). They provide South Yemen a separate code (23; 1900-1990). V-Dem’s greater historic record for South Yemen includes only the city of Aden and its immediate surroundings from 1900-1963. There is no CoW equivalent for South Yemen/Aden during this time.
There are not many complicated distinctions throughout Africa. V-Dem codes most African nations independently starting in 1900 while indicating their colonial protectorate. Depending on the colonial parent nation, most African countries acquired internationally recognized independence between 1955-1975; this is when the CoW record begins. Some notable African countries:
V-Dem includes a greater historical record for most Asian countries, because, similar to Africa and the Caribbean, there was less centralized control exercised over their polities and institutions during colonial rule. As a result, most Asian states are recorded in V-Dem beginning sometime 1789-1900, and in CoW during one of the waves of colonial (1945-1950) or regional (1971-1975) independence. However, there are a few nation-states with more complicated records in V-Dem and CoW:
CoW tracks Korea on 3 separate numeric codes: Korea (Unified; 730), North Korea (731), and South Korea (732). Conversely, V-Dem assigns a single code for historic/unified Korea and South Korea (42) while placing North Korea on its own code (41). CoW does not list Korea while under Soviet, USA, or Japanese occupation
Similar to Korea(s), V-Dem places historic, unified, Vietnam and post WWII South Vietnam on a single code. North Vietnam is recognized in 1945 and then absorbs the former South Vietnam in 1976 carrying on to the present. CoW does not recognize colonial or occupied Vietnam(s).