Country Coding Considerations for Dataset Harmonization and Applied Uses

April 13, 2021

Introduction

Managing country codes is one of the most difficult tasks in environment-security analysis, and each dataset treats the concept differently. Country code designations may be very strict and only respect internationally recognized sovereign nations starting with their first official day of independence. Other datasets may include separate observations and codes for territories, disputed states, colonial nations, and other quasi-independent states. Coding systems like the Correlates of War (CorrelatesofWarProject2017?) or Gleditsch and Ward (Gleditsch1999?) fall into the former with strict, to the day, international recognition, while refugee and asylum datasets provided by the United Nations (UnitedNations2017?) falls into the latter. UN migration and refugee datasets include nearly every conceivable territorial designation/disaggregation (e.g. Puerto Rico, Taiwan, Faroe Islands, etc.).

When using an individual dataset, users need a general awareness for the conceptual approach to the country code specification and any potential inconsistencies, however, when attempting to harmonize multiple datasets, more intricate knowledge is required. Merging multiple datasets without first addressing disparate approaches to country coding will lead to inconsistencies, numerous NA values, and compound downstream error in the analysis. It takes time and experience to anticipate country coding issues and know what to be on the lookout for when undergoing project planning. The goal of this vignette is to outline common issues and considerations for harmonizing multiple country-year datasets. This walkthrough uses the Varieties of Democracy (V-Dem) (Coppedge2020?) and Correlates of War coding schemes to illustrate recurrent challenges. They are good datasets to work with while learning for several reasons:

V-Dem Country Codes

The Varieties of Democracy (V-Dem) presents a collection of highly disaggregated indicators (400+) depicting wide-ranging measures of democracy and institutional characteristics dating back to 1789. In contrast to several other political, social, and environmental datasets, V-Dem demonstrates an extreme level of transparency in their methods and copious documentation. This includes a dedicated supplementary manual depicting their country coding approach; V-Dem Country Coding Units v11.1 (Coppedge2021?). This tutorial will highlight the most relevant (in my opinion) portions of their approach and country-year coding decisions.

V-Dem defines a country as

…a political unit enjoying at least some degree of functional and/or formal sovereignty.

Generally speaking, V-Dem will have a country-year observation for a country if:

  1. The state made a formal declaration of independence; even if not yet fully recognized by the international community.
  2. In modern times this concerns states like Kosovo, Taiwan, and Western Sahel, in historic times this concerns states such as Colonial Asia and Africa.
  3. If the state in question operates with some degree of autonomy that distinguishes itself, at least in its polities and institutions, from the parent nation.

In my limited experience with V-Dem, this conceptual approach is carried out very consistently, but additional pre-processing is required to successfully merge with more strict datasets. Moreover, the Country Coding Units Manual contains information that helps users construct historical time series of nations that have limited independent observations in the dataset. For example, Bangladesh is coded independently starting in 1971, but to construct a longer historic record, the Country Coding Units Manual states that the following combination of observations may be used:

  1. India (Princely state of British India (1910 – 1947))
  2. Pakistan (1947 - 1971)
  3. Bangladesh (1971 - )

The Country Coding Units Manual is a great reference if you need to combine historical democracy indicators for nations with intermittent internationally recognized sovereignty with other historically disaggregated environmental, migration, or conflict data.

The histname variable in the raw V-Dem dataset is also an excellent source of context for the true nature of regime changes in annual polity. While the country_name field is static and typically reflects the modern, common, country name in the V-Dem dataset, the histname reflects official or historic names and potential states of occupation. Take Somalia for example:

V-Dem country name, V-Dem historic designation, Correlates of War country code, and start and end years for Somalia in the raw V-Dem Version 11.1 dataset.
country_name histname COWcode Start End
Somalia Somalia under formal Italian control over most of the territory [except British Somaliland] 520 1900 1909
Somalia Somalia under effective (more or less) Italian control 520 1910 1940
Somalia Somalia under British occupation 520 1941 1949
Somalia UN Trust Territory of Somalia under Italian administration 520 1950 1959
Somalia Somali Republic [unites Somaliland with the Trust Territory of Somalia] 520 1960 1968
Somalia Somali Democratic Republic 520 1969 1990
Somalia Somali Democratic Republic [civil war] 520 1991 2003
Somalia Transitional Federal Government [civil war] 520 2004 2011
Somalia Federal Republic of Somalia [civil war] 520 2012 2021

This provides a clear understanding of Somalia’s historical sovereignty and polity for the past 120 years. When combined with the Country Coding Units Manual, the user is able to make more informative choices regarding their research interests. It also serves as excellent reference material for historical nation-states that is much faster than being sucked into a Wikipedia black hole.

Now we will walk through some of the more important country coding considerations for applied cases using Varieties of Democracy and Correlates of War for 1900-2020.

The Americas

Canada

The Caribbean

Most of the Caribbean nations are coded by CoW starting with their official independence from the colonial parent nations; usually some time between 1962-1975. V-Dem starts most Caribbean countries between 1789-1900.

Central and South America

The remaining nations of Central and South America are similarly coded by CoW and V-Dem because they mostly achieved their independence prior to 1900.

Europe

The Balkans

The Balkans are consistently a source of frustration when attempting to merge disparate datasets. The affected states include Serbia/Yugoslavia, Bosnia and Herzegovina, Kosovo, Croatia, North Macedonia, Slovenia, and Montenegro. V-Dem disaggregates each component for as long as possible while CoW uses the official, internationally recognized, aggregated states. Because of this, when merging with most datasets, V-Dem will require some additional processing. At a minimum this may include averaging across multiple states to calculate Yugoslavia or the State Union of Serbia and Montenegro. Given time and resources, users may also consider calculating weighted averages using population data as not to give Kosovo or Slovenia equal weighting with Serbia/Yugoslavia or Montenegro. The V-Dem Country Coding manual is very helpful in reconstructing the complicated historic record of these nation-states.

Visual timeline of significant secessions from the former Yugoslavian state as internationally recognized and coded by the Correlates of War and cshapes project. The top panel depicts Yugoslavia (highlighted teal) in its 1989 unified state. Highlighted countries in remaining panels illustrate countries that gained sovereignty since the previous date.

Visual timeline of significant secessions from the former Yugoslavian state as internationally recognized and coded by the Correlates of War and cshapes project. The top panel depicts Yugoslavia (highlighted teal) in its 1989 unified state. Highlighted countries in remaining panels illustrate countries that gained sovereignty since the previous date.

Austria and Hungary

V-Dem codes Austria and Hungary separately throughout their entire record. Therefore there is no corresponding record for CoW numeric code 300 (Austria-Hungary 1816-1918). The user must create this designation.

Germany

Germany consists of 3 separate designations in both datasets.

Lastly, as a result of these various configurations, CoW does not track any form of Germany during 1946-1953, and V-Dem does not contain “Unified” or West German observations from 1945-1948.

Czechoslovakia, Czech Republic, and Slovak

CoW includes designations for Czechoslovakia (315; 1918-1992), Czech Republic (316; 1993-), and Slovakia (317; 1993-), whereas V-Dem observes Czechoslovakia and the Czech Republic on a single code (157; 1918-) with Slovakia (201) diverging from 1939-1945 and again permanently in 1993.

Former Soviet Union

Former members of the Soviet Union are handled similarly between both datasets. Due to the centralized control over polity and institutions on part of the USSR during this time, V-Dem does not code members independently (compared to colonial Africa or Caribbean). The exceptions are nations with a historical presence prior to annexation/occupation (Lithuania, Latvia, Uzbekistan, etc.). In these cases states are recorded independently from their pre WWI or WWII independence, and then again following the dissolution of the USSR; 1990 (V-Dem) or 1991 (CoW).

Remaining Notables

The remaining notable European countries are Poland, Ireland, and Luxembourg.

Middle East and North Africa

Palestine

Along with the various configurations of Yugoslavia, Palestine is another consistent source of frustration when carrying out pre-processing. Many datasets and coding schemes do not recognize Palestine. In fact, CoW does not include a code for Palestine. This complicates interdisciplinary research, because including Palestine in your analysis can severely limit the number of additional datasets you can include without introducing more error through imputation or other fixes. V-Dem includes 3 separate designations for Palestine:

Yemen(s)

CoW includes separate codes for the Yemen Arab Republic/North Yemen (678; 1926-1990), unified modern Yemen (679; 1990-), and Yemen People’s Democratic Republic/South Yemen (680; 1967-).

Similar to the Czech Republic, V-Dem tracks historic, North, and modern Yemen on a single code (14; 1789-1850 and 1918-). They provide South Yemen a separate code (23; 1900-1990). V-Dem’s greater historic record for South Yemen includes only the city of Aden and its immediate surroundings from 1900-1963. There is no CoW equivalent for South Yemen/Aden during this time.

Visual timeline of internationally recognized sovereign borders of Yemen, North Yemen, and South Yemen. Historic boundaries are determined by the cshapes packages. Note that cshapes does not include land area that lacks international independence. Hence, in the 1946 (left) panel, British Aden (South Yemen) and Djibouti are not present until they've gained their respective independence.

Visual timeline of internationally recognized sovereign borders of Yemen, North Yemen, and South Yemen. Historic boundaries are determined by the cshapes packages. Note that cshapes does not include land area that lacks international independence. Hence, in the 1946 (left) panel, British Aden (South Yemen) and Djibouti are not present until they’ve gained their respective independence.

Non-MENA Africa

There are not many complicated distinctions throughout Africa. V-Dem codes most African nations independently starting in 1900 while indicating their colonial protectorate. Depending on the colonial parent nation, most African countries acquired internationally recognized independence between 1955-1975; this is when the CoW record begins. Some notable African countries:

Asia

V-Dem includes a greater historical record for most Asian countries, because, similar to Africa and the Caribbean, there was less centralized control exercised over their polities and institutions during colonial rule. As a result, most Asian states are recorded in V-Dem beginning sometime 1789-1900, and in CoW during one of the waves of colonial (1945-1950) or regional (1971-1975) independence. However, there are a few nation-states with more complicated records in V-Dem and CoW:

Korea(s)

CoW tracks Korea on 3 separate numeric codes: Korea (Unified; 730), North Korea (731), and South Korea (732). Conversely, V-Dem assigns a single code for historic/unified Korea and South Korea (42) while placing North Korea on its own code (41). CoW does not list Korea while under Soviet, USA, or Japanese occupation

Vietnam

Similar to Korea(s), V-Dem places historic, unified, Vietnam and post WWII South Vietnam on a single code. North Vietnam is recognized in 1945 and then absorbs the former South Vietnam in 1976 carrying on to the present. CoW does not recognize colonial or occupied Vietnam(s).

References