Archaeological data as it can be found “in the wild” rarely conforms to the formats suitable for any kind of statistical analysis. This vignette is intended as a way of demonstrating a possibility of data cleaning on a data set as it can frequently be found in archaeological literature, or put together during archaeological research. The data is prepared for use with the datplot-package. An analysis based on this data can be found as a case study in the paper “datplot: A new R-Package for the Visualization of Date Ranges in Archaeology” (Weissova and Steinmann, n.d.) by the authors of this package.

The inscriptions of Bithynia data set, that is included in this package, was compiled for a publication analyzing the settlement patterns and road systems in ancient Bithynia (Weissova 2019). The vignette is thus meant as guidance for archaeologists looking to employ the datplot-package, and finding themselves in need of reformatting their data without much experience in programming languages. It points to certain problems arising with data sets, which are often incarnated in the form of spreadsheets. The Bithynia data set is a typical example of the structure of data as it is used by many (classical) archaeologists. The process of cleaning it highlights solutions for the issues encountered with such spreadsheets, and may also be adapted to other data as well as fields.

The “Inscriptions of Bithynia” data set

The manually curated data that B. Weissova prepared for her dissertation was saved as an Excel-spreadsheet. As this is very often the case, we continue directly from this original file, which is available in this repositories “inst/extdata/”-sub-directory. It can be loaded into R without further complications using the read_excel()-function from the package readxl. Alternatively, the file is suited for storage as a *.tsv-file, which is also provided in the same directory for better long-term support.

inscriptions <- readxl::read_excel(system.file("extdata",
                                               "Bithynia_Inscriptions.xlsx",
                                               package = "datplot",
                                               mustWork = TRUE))
summary(inscriptions)
#>      ikey             Location            Source          Chronological Frame
#>  Length:2878        Length:2878        Length:2878        Length:2878        
#>  Class :character   Class :character   Class :character   Class :character   
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
#>    Language        
#>  Length:2878       
#>  Class :character  
#>  Mode  :character

A large amount of the initial data was provided by the Searchable Greek Inscriptions Tool of the Packard Humanities Institute and the Epigraphische Datenbank Heidelberg. Those inscriptions (n = 2877) can be referenced in the databases with their ikey. The data set was supplemented with 323 inscriptions that were manually gathered from different sources. The relevant citations are to be found in the Source column. Additional information on the creation and curation of this data set can be found in the publication mentioned above (Weissova 2019).

The original file consists of five columns which each row representing a single inscription: ikey contains the reference to the Searchable Greek Inscriptions Tool of the Packard Humanities Institute, indicated via PH as a prefix, or to the Epigraphische Datenbank Heidelberg, indicated via HD as a prefix. Location refers to the find spot of the inscription. Source states the source of the data. Chronological Frame contains the dating in a verbose format, such as “Roman Imperial Period.” Language records the language in which the inscription was written, which can either be Latin, Greek, or both.

Data Preparation, Cleaning and Reformatting

The data set is not yet suited for analysis, as some variables, especially the chronological frame, have many inconsistencies. For further processing, we should also be sure to include an identifier-column. As 331 inscriptions do not have an ikey-Value, which might have otherwise been a good candidate for identification, we chose to create a new automatically generated ID, so that every inscription can be individually identifiable.

inscriptions$ID <- paste("I_", 1:nrow(inscriptions), sep = "")

Two of the variables of this data set are almost ready for further use, i.e. Location and Language. A look at their unique values reveals only small inconsistencies that can be easily fixed:

unique(inscriptions$Location)
#>  [1] "Apamea"                    "Apollonia ad Rhyndacum"   
#>  [3] "Caesarea Germanice"        "Chalcedon"                
#>  [5] "Claudiopolis"              "Creteia Flaviopolis"      
#>  [7] "Daskyleion"                "Heraclea Pontica"         
#>  [9] "Juliopolis"                "Kaisarea Hadrianopolis"   
#> [11] "Nicaea"                    "Nicomedia"                
#> [13] "Nicomedia (Dakibyza)"      "Prusias ad Hypium"        
#> [15] "Prusias ad Mare"           "Prusias ad Mare (Keramed)"
#> [17] "Prusias ad Olympum"        "Pylai"                    
#> [19] "Strobilos"                 "Tios"                     
#> [21] "unknown"                   "unknown (Bolu museum)"    
#> [23] "unknown (Lamounia ?)"
unique(inscriptions$Language)
#> [1] "Greek"    "Gr/Lat"   "Latin"    "Gr / Lat"

Using functions from the tidyverse package family, we can easily transform the columns. We rename the “Chronological Frame” to Dating, as shorter names without spaces are more convenient to work with, and add proper NA-values if there is no chronological assessment. With mutate() and replace() we also clear out the redundant variable values from Location and Language:

inscriptions <- inscriptions %>%
  rename(Dating = `Chronological Frame`) %>%
  mutate(Dating = na_if(Dating, "---"),
         Language = replace(Language, Language == "Gr/Lat", "Greek/Latin"),
         Language = replace(Language, Language == "Gr / Lat", "Greek/Latin"),
         Language = factor(Language, levels = c("Greek", "Latin",
                                                "Greek/Latin")),
         Location = replace(Location, str_detect(Location, "unknown"),
                            "unknown"),
         Location = replace(Location,
                            Location == "Prusias ad Mare (Keramed)",
                            "Prusias ad Mare"),
         Location = factor(Location))

This conversion leaves us with a more compact overview of the data sets contents:

summary(inscriptions)
#>      ikey                         Location      Source         
#>  Length:2878        Nicaea            :761   Length:2878       
#>  Class :character   Nicomedia         :501   Class :character  
#>  Mode  :character   Prusias ad Olympum:344   Mode  :character  
#>                     Claudiopolis      :252                     
#>                     Chalcedon         :200                     
#>                     Prusias ad Hypium :179                     
#>                     (Other)           :641                     
#>     Dating                 Language         ID           
#>  Length:2878        Greek      :2724   Length:2878       
#>  Class :character   Latin      : 125   Class :character  
#>  Mode  :character   Greek/Latin:  29   Mode  :character  
#>                                                          
#>                                                          
#>                                                          
#>

Cleaning up the Dating-variable

Some of the values in the Dating-variable contain question signs, indicating an uncertainty of the chronological frames assessment. To keep this information, we store it in a new variable uncertain_dating, which contains TRUE if there was a question mark in the original assessment and FALSE if the dating was certain, so that if one wishes the additional information could later be used to select or exclude uncertain values.

inscriptions$uncertain_dating <- FALSE
sel <- grep("\\?", inscriptions$Dating)
inscriptions$uncertain_dating[sel] <- TRUE
inscriptions$Dating <- gsub("\\?", "", inscriptions$Dating)

Creating a Concordance for Periods

The next step is sorting out values from the Dating variable that have to be manually entered, such as the “Roman Imperial Period.” We achieve this by excluding all values that contain a number, preparing a table in which we can manually enter the desired dating span and saving it as a .csv-file. After adding the values manually, we reload the file. The manually edited concordance can be found in the “inst/extdata/”-sub-directory of this repository in the file periods_edit.csv and corresponds with the chronological assessment of periods used in the original publication (Weissova 2019, 42).

sel <- grepl("[0-9]", inscriptions$Dating)
periods <- data.frame("Dating" = unique(inscriptions$Dating[which(sel == FALSE)]))
periods$DAT_min <- NA
periods$DAT_max <- NA
#write.csv(periods, file = "../data-raw/periods.csv", fileEncoding = "UTF-8")
# .... Manual editing of the resulting table, saving it as "periods_edit.csv".
join_dating <- read.csv(file = system.file('extdata', 'periods_edit.csv',
                                           package = 'datplot',
                                           mustWork = TRUE),
                        row.names = 1,
                        colClasses = c("character", "character",
                                       "integer", "integer"),
                        encoding = "UTF-8")

It is of course possible to add the corresponding values in R. Since the process could hardly be automated, this way seemed more efficient to us, though it is - sadly - less transparent in this vignette. The values can be examined by the reader in the csv-Table mentioned above, or by loading the table via system.file() as seen in the code chunk.

Reformatting of Partially Numerical Dating Values

There remains, however, a large amount of values that is not covered in this concordance. We can easily automate some of the conversions as a series of steps that we store in another data.frame called num_dating, encompassing all the unique values that contain some form of a numerical dating.

num_dating <- data.frame("Dating" = unique(inscriptions$Dating[which(sel == TRUE)]))
num_dating$DAT_min <- NA
num_dating$DAT_max <- NA

First, there is a number of inscriptions dated to a single year. We select these using a regular expression with grep()¹ and can simply delete the character-part of the values so that only the specified year remains and is stored in both of the DAT_-variables. We do this separately for AD and BC-values since BC needs to be stored as a negative number.

sel <- grep("^[0-9]{1,3} AD$", num_dating$Dating)
num_dating$DAT_min[sel] <- gsub(" AD", "", num_dating$Dating[sel]) 
num_dating$DAT_max[sel] <- gsub(" AD", "", num_dating$Dating[sel]) 
sel <- grep("^[0-9]{1,3} BC$", num_dating$Dating)
num_dating$DAT_min[sel] <- paste("-", gsub(" BC", "", num_dating$Dating[sel]), 
                                 sep = "")
num_dating$DAT_max[sel] <- paste("-", gsub(" BC", "", num_dating$Dating[sel]), 
                                 sep = "")

As a demonstration, this is the resulting table (num_dating) of up to this point converted values:

	Dating	DAT_min	DAT_max
136	196 AD	196	196
215	253 AD	253	253
258	114 AD	114	114
131	223 AD	223	223
98	400 BC	-400	-400
257	142 AD	142	142
214	260 AD	260	260
265	241 BC	-241	-241
12	585 AD	585	585
55	206 AD	206	206

Since we frequently check the values in num_dating to look for errors, we append the finished rows to join_dating, which we later use as a look-up-table for our data set, and remove the finished rows from num_dating.

join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]

We have to convert the Dating variable from factor to character now, so we can use the strsplit()-function on the values:

num_dating$Dating <- as.character(num_dating$Dating)

As some of the values are in the format year-year, e.g. “150-160 AD,” we can easily grab both numbers from the string. To achieve this, we select all the rows containing the relevant format, then loop over each of these rows and split the character string along either “-” or spaces, and later “/,” as we chose to treat values of the format “198/198 AD” as “197 - 198 AD.” Selecting the numerical values from the resulting list according to their position gives us the desired values for DAT_min and DAT_max. We need to do the same for the BC-values to make sure they return as negative numbers.

# Values like: 92-120 AD
sel <- grep("^[0-9]{1,3}–[0-9]{1,3} AD", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = "–| ")
  num_dating$DAT_min[r] <- split[[1]][1]
  num_dating$DAT_max[r] <- split[[1]][2]
}
# Values like: AD 92-120
sel <- grep("^AD [0-9]{1,3}–[0-9]{1,3}$", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = "–| ")
  num_dating$DAT_min[r] <- split[[1]][2]
  num_dating$DAT_max[r] <- split[[1]][3]
}
# Values like: AD 92 - 120
sel <- grep("^AD [0-9]{1,3} - [0-9]{1,3}", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = " - | ")
  num_dating$DAT_min[r] <- split[[1]][2]
  num_dating$DAT_max[r] <- split[[1]][3]
}
# Values like: 198/199 AD
sel <- grep("^[0-9]{1,3}/[0-9]{1,3} AD", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = "/| ")
  num_dating$DAT_min[r] <- split[[1]][1]
  num_dating$DAT_max[r] <- split[[1]][2]
}
# Values like: 525-75 BC
sel <- grep("^[0-9]{1,3}–[0-9]{1,3} BC", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = "–| ")
  num_dating$DAT_min[r] <- 0 - as.numeric(split[[1]][1])
  num_dating$DAT_max[r] <- 0 - as.numeric(split[[1]][2])
}

Another look at the data set can help us to check for possible errors.

	Dating	DAT_min	DAT_max
276	AD 84 - 96	84	96
135	199–210 AD	199	210
218	244–249 AD	244	249
205	98/99 AD	98	99
153	AD 150 - 200	150	200
246	222–235 AD	222	235
133	212–217 AD	212	217
165	363/364 AD	363	364
167	337–340 AD	337	340
37	AD 180–212	180	212

Putting aside the finished values again makes it easier to spot errors in the process:

join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]

Reformatting of Inscriptions Dated to Complete Centuries

Next, we separate values that identify complete centuries. As we want to express the dating in absolute numbers, we convert “1st c. AD” to a time span ranging from 0 to 99, and “1st c. BC” to -99 to 0 respectively. The regular expression selects all values, where a single number in the beginning of the string is followed by two letters (i.e. 2nd, 3rd, 1st) and “c. AD” resp. “c. BC.” We subtract one and multiply the number by 100 to get the respective boundaries, again taking care to treat AD and BC differently.

sel <- grep("^[0-9]{1}[a-z]{2} c\\. AD$", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = "[a-z]{2} c\\.")
  split <- as.numeric(split[[1]][1])
  num_dating$DAT_min[r] <- ((split - 1) * 100)
  num_dating$DAT_max[r] <- ((split - 1) * 100) + 99
}

sel <- grep("^[0-9]{1}[a-z]{2} c\\. BC$", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = "[a-z]{2} c\\.")
  split <- as.numeric(split[[1]][1])
  num_dating$DAT_min[r] <- 0-(split * 100) + 1
  num_dating$DAT_max[r] <- 0-((split - 1) * 100)
}

	Dating	DAT_min	DAT_max
42	6th c. AD	500	599
15	3rd c. BC	-299	-200
18	2nd c. AD	100	199
11	5th c. AD	400	499
16	3rd c. AD	200	299
47	4th c. BC	-399	-300
41	6th c. BC	-599	-500
25	1st c. BC	-99	0
116	4th c. AD	300	399
17	2nd c. BC	-199	-100

Again, putting aside the finished values makes it easier to spot errors in further processing:

join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]

Reformatting of Imprecise Dating Values

For dates that are around a certain value, i.e. of the format “ca. 190 AD,” we are not able to make a more informed decision than guess about what the researchers providing this assessment had in mind. This might also change from inscription to inscription. While a closer look at the individual inscriptions may yield a more sensible estimate, this seems not feasible as part of the data preparation process. It seems that in most cases, if the dating can be as precise as a span of around 10 years, researchers tend to emphasize this. Therefore, we decided to take a total span of 20 years, i.e. 10 years before and 10 years after the mentioned value, reflecting some uncertainty on the precision and duration of the estimate. “ca. 190 AD” thus becomes “180–200 AD,” with the same mechanism for BC in negative values.

sel <- grep("^ca\\. [0-9]{1,3} AD$", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = " ")
  split <- as.numeric(split[[1]][2])
  num_dating$DAT_min[r] <- split - 10
  num_dating$DAT_max[r] <- split + 10
}
sel <- grep("^ca\\. [0-9]{1,3} BC$", num_dating$Dating)
for (r in sel) {
  split <- strsplit(x = num_dating$Dating[r], split = " ")
  split <- 0-as.numeric(split[[1]][2])
  num_dating$DAT_min[r] <- split - 10
  num_dating$DAT_max[r] <- split + 10
}

	Dating	DAT_min	DAT_max
5	ca. 200 AD	190	210
6	ca. 100 AD	90	110
63	ca. 130 AD	120	140
202	ca. 214 AD	204	224
260	ca. 360 BC	-370	-350

Creating a Second Concordance for Verbose Chronological Assessments

Again, saving the finished values in join_dating leaves us with the list of not yet converted values as seen in num_datings Dating-variable.

join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]
unique(num_dating$Dating)[1:20]
#>  [1] "end of the 2nd c. AD"                
#>  [2] "end of 1st c. BC – beg. of 1st c. AD"
#>  [3] "end 2nd c. BC"                       
#>  [4] "early 1st c. AD"                     
#>  [5] "after 212 AD"                        
#>  [6] "AD 531"                              
#>  [7] "5th/6th century AD"                  
#>  [8] "3rd/4th century AD"                  
#>  [9] "2nd – 3rd c. AD"                     
#> [10] "2nd – 1st c. BC"                     
#> [11] "1st – beg. of the 2nd c. AD"         
#> [12] "1st – 2nd c. AD"                     
#> [13] "2nd century AD"                      
#> [14] "late 3rd – early 2nd c. BC"          
#> [15] "5th – 6th c. AD"                     
#> [16] "3rd quarter of the 6th c. BC"        
#> [17] "3rd – 2nd c. BC"                     
#> [18] "340/339 BC"                          
#> [19] "27 BC–14 AD"                         
#> [20] "1st century AD"

Due to the heterogeneous nature of these 99 values, we decided to convert them manually again. Our criteria for translating terms like “beginning of” are the following: “beginning of,” “end of,” “early,” “late” are – similarly to ca. – all translated to 20 years, as we assume that more information would have been given if the time span was greater than a quarter century. For the other values, we employed the same criteria as seen above in the automated process. We decided to measure the ‘beginning of a century’ at e.g. -199 for BC values and 100 for AD values, and accordingly identify the ‘end of a century’ with the values e.g. -100 (BC) / 199 (AD). Since the data is epigraphical, in the case of “before” or “after” (i.e. terminus ante/post quem) dates, we assume some connection to a datable event, and therefore add or subtract a span of 10 years, which is, however, still rather arbitrary.

As the last step, we switch the – in some cases automatically assigned – year 0, so that it will be treated as either 1 or -1:

join_dating$DAT_min[which(join_dating$DAT_min == 0)] <- 1
join_dating$DAT_max[which(join_dating$DAT_max == 0)] <- -1

We then have to reload the corrected data. The values that we put aside in the join_dating data.frame beforehand serve as the basis for our new data.frame that we use as a look-up table. As the variables were treated as character-strings, we need to convert them to a numeric format first. We append the newly loaded manually assigned values to this data.frame. Again, to keep this process as transparent as possible, we included the *.csv-Table in the “inst/extdata/” sub-directory as “num_dating_edit.csv.”

#write.csv(num_dating, file = "../data-raw/num_dating.csv", 
#          fileEncoding = "UTF-8")
num_dating <- read.csv(file = system.file('extdata', 'num_dating_edit.csv',
                                          package = 'datplot', mustWork = TRUE), 
                       encoding = "UTF-8",
                       row.names = 1,
                       colClasses = c("character", "character",
                                      "integer", "integer"))
join_dating <- join_dating %>%
  mutate(DAT_min = as.integer(DAT_min),
         DAT_max = as.integer(DAT_max)) %>%
  rbind(num_dating)

Joining the Reformatted Data with the Original Data Set

left_join() lets us add the DAT_-variables from the look-up table join_dating to our original data.frame.

inscriptions <- left_join(inscriptions, join_dating, by = "Dating")

The DAT_-variables – as we can now see – contain the desired information:

ID	Location	Dating	DAT_min	DAT_max
I_1350	Nicaea	2nd c. AD	100	199
I_2550	Prusias ad Olympum	2nd c. BC	-199	-100
I_388	Claudiopolis	117–138 AD	117	138
I_2713	Prusias ad Olympum	1st – 2nd c. AD	1	199
I_751	Kaisarea Hadrianopolis	6th c. AD	500	599
I_726	Kaisarea Hadrianopolis	mid. 3rd c. AD	240	260
I_1320	Nicaea	2nd c. AD	100	199
I_2616	Prusias ad Olympum	2nd c. AD	100	199
I_1702	Nicomedia	284–305 AD	284	305
I_1012	Nicaea	Roman Imp. period	-31	395
I_2597	Prusias ad Olympum	2nd c. AD	100	199
I_1418	Nicaea	202–205 AD	202	205
I_786	Kaisarea Hadrianopolis	292–305 AD	292	305
I_1117	Nicaea	Byzantine period	395	799
I_938	Nicaea	Roman Imp. period	-31	395

Fixing mistakes

We can now make a ‘test run’ using datplot with the stepsize-value set to 5 in order to save as much time as possible² and check for errors first. We have to select exactly 4 columns in the correct order of: Identifier, Grouping Variable, Minimum Dating, Maximum Dating. We transform the output from the pipe operator to a data.frame before handing it to datstepts(), as datplot needs this format.

inscr_steps <- inscriptions %>%
  select(ID, Location, DAT_min, DAT_max) %>%
  na.omit() %>%
  as.data.frame() %>%
  datplot::datsteps(stepsize = 5)
#> Warning in datplot::datsteps(., stepsize = 5): Warning: Dating seems to be in
#> wrong order at ID I_1162, I_2725 (Index: 637, 1458). Dates have been switched,
#> but be sure to check your original data for possible mistakes.
#> Warning in get.weights(DAT_mat[, "datmin"], DAT_mat[, "datmax"]): Warning:
#> DAT_min and DAT_max at Index: 40, 44, 45, 76, 77, 113, 126, 131, 132, 133, 147,
#> 154, 198, 201, 203, 204, 205, 218, 240, 242, 243, 292, 293, 323, 333, 336, 337,
#> 338, 339, 340, 341, 343, 346, 349, 350, 351, 354, 355, 356, 640, 881, 882, 886,
#> 892, 935, 940, 943, 944, 946, 947, 948, 955, 1024, 1025, 1026, 1031, 1034, 1035,
#> 1036, 1037, 1044, 1049, 1050, 1051, 1060, 1065, 1066, 1117, 1120, 1121, 1126,
#> 1128, 1133, 1154, 1175, 1183, 1255, 1260, 1456, 1480, 1487, 1494, 1498) have the
#> same value! Is this correct? Please check the table for possible errors.
#> Warning in create.sub.objects(DAT_mat, stepsize): stepsize is larger than the
#> range of the closest dated object at Index = 40, 44, 45, 60, 75, 76, 77, 113,
#> 126, 131, 132, 133, 141, 147, 154, 171, 196, 198, 199, 201, 203, 204, 205,
#> 214, 218, 240, 242, 243, 269, 292, 293, 323, 333, 334, 335, 336, 337, 338,
#> 339, 340, 341, 343, 346, 347, 348, 349, 350, 351, 354, 355, 356, 625, 627,
#> 628, 640, 646, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 879,
#> 881, 882, 883, 884, 885, 886, 888, 889, 890, 892, 893, 935, 936, 939, 940, 941,
#> 942, 943, 944, 945, 946, 947, 948, 949, 955, 956, 957, 987, 988, 1016, 1022,
#> 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1031, 1034, 1035, 1036, 1037, 1044,
#> 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1057, 1059, 1060, 1065, 1066,
#> 1106, 1112, 1114, 1115, 1116, 1117, 1120, 1121, 1126, 1128, 1133, 1154, 1172,
#> 1174, 1175, 1183, 1185, 1188, 1189, 1255, 1256, 1260, 1456, 1480, 1487, 1494,
#> 1498, 1499, 1500, 1502, 1514, 1515, 1516). For information see documentation of
#> get.step.sequence().

There are indeed problems in the data. The first of the three warnings issued by datsteps() informs us that we may have assigned some values wrong: “Warning: Dating seems to be in wrong order at ID I_1162, I_2725 (Index: 637, 1458). Dates have been switched, but be sure to check your original data for possible mistakes.” If dates are in the wrong order, datplot will automatically switch them before proceeding. This, however, might not always be the correct way of handling the situation, as other errors might have occurred as well. Therefore, we should check the data again using the Index-Values or IDs provided by datplots warning:

inscriptions %>%
  select(ID, Location, Dating, uncertain_dating, DAT_min, DAT_max) %>%
  na.omit() %>%
  slice(637, 1458) %>%
  kable()

ID	Location	Dating	uncertain_dating	DAT_min	DAT_max
I_1162	Nicaea	62/3 AD	FALSE	62	3
I_2725	Prusias ad Olympum	180–192 BC	FALSE	-180	-192

The problem here is twofold. In the first entry, our automated translation of the dating could not handle the format “62/3” and thus returned a wrong value for DAT_max. If we had a truly large data set, we should correct this in the original reformatting process. In this case, it is more efficient to fix this problem right now in the fastest possible way.

inscriptions[which(inscriptions$ID == "I_1162"),"DAT_max"] <- 63

In the other case, the Dating-column provided the BC-values in the wrong order. We can also fix this quickly right here, which is basically the same way datplot would handle both cases internally:

inscriptions[which(inscriptions$ID == "I_2725"),c("DAT_min", "DAT_max")] <-
  inscriptions[which(inscriptions$ID == "I_2725"),c("DAT_max", "DAT_min")]

Note that datsteps() changes the date and proceeds, but warns you of the possibility of errors. It is therefore recommended to check the data mentioned in that warning. As we ‘fixed’ our data set, we should save it again, so that the same error will not occur twice:

#write.table(inscriptions, file = "../data-raw/inscriptions.csv", 
#            fileEncoding = "UTF-8", sep = ";", row.names = FALSE)

Storing the Prepared Data Set

For later use and publication, we add explanations of the variables and metadata to the R-object and store them in the *.rda file. As a backup, we also save the finished table as a .csv-file, suitable for archiving. The .csv-Table, however, does not contain the additional information added as attributes.

All files are available in this repository and as part of this package for further use by other researchers. We kindly ask you to cite Weissova (2019), the Packard Humanities Institute and this repository at https://github.com/lsteinmann/datplot as sources.

The dataset can be loaded into R by simply calling

#library(datplot)
data("Inscr_Bithynia")

when datplot is loaded.

Selecting the Data for Further Analysis and using datplot

As our aim is to get an overview of the spatio-temporal distribution of the inscriptions in Bithynia, we can only analyze inscriptions with known Location and Dating. Thus, we have to remove all the rows from the data set that do not contain this information.

inscr_clean <- Inscr_Bithynia %>%
  filter(Dating != "NA",
         Location != "unknown") %>%
  droplevels()

This means that we removed a total of 1380 rows, which did not contain the information needed. The data set suitable for the analysis can be summarized as follows:

summary(inscr_clean)
#>       ID                ikey                             Location  
#>  Length:1498        Length:1498        Nicaea                :604  
#>  Class :character   Class :character   Prusias ad Olympum    :279  
#>  Mode  :character   Mode  :character   Nicomedia             :108  
#>                                        Kaisarea Hadrianopolis: 87  
#>                                        Apamea                : 77  
#>                                        Chalcedon             : 69  
#>                                        (Other)               :274  
#>     Source             Dating                 Language    uncertain_dating
#>  Length:1498        Length:1498        Greek      :1392   Mode :logical   
#>  Class :character   Class :character   Latin      :  82   FALSE:1327      
#>  Mode  :character   Mode  :character   Greek/Latin:  24   TRUE :171       
#>                                                                           
#>                                                                           
#>                                                                           
#>                                                                           
#>     DAT_min          DAT_max           URL           
#>  Min.   :-599.0   Min.   :-525.0   Length:1498       
#>  1st Qu.: -31.0   1st Qu.: 199.0   Class :character  
#>  Median : 100.0   Median : 266.5   Mode  :character  
#>  Mean   : 100.3   Mean   : 286.4                     
#>  3rd Qu.: 200.0   3rd Qu.: 395.0                     
#>  Max.   : 680.0   Max.   : 799.0                     
#>

datplot

We can now begin using the datplot-package…

library(datplot)

…and can actually try to use datsteps(), first with a stepsize of 25 years:

inscr_steps <- inscr_clean %>%
  select(ID, Location, DAT_min, DAT_max) %>%
  as.data.frame() %>%
  datplot::datsteps(stepsize = 25) %>%
  datplot::scaleweight(var = 2)
#> Warning in get.weights(DAT_mat[, "datmin"], DAT_mat[, "datmax"]): Warning:
#> DAT_min and DAT_max at Index: 40, 44, 45, 76, 77, 113, 126, 131, 132, 133, 147,
#> 154, 198, 201, 203, 204, 205, 218, 240, 242, 243, 292, 293, 323, 333, 336, 337,
#> 338, 339, 340, 341, 343, 346, 349, 350, 351, 354, 355, 356, 640, 881, 882, 886,
#> 892, 935, 940, 943, 944, 946, 947, 948, 955, 1024, 1025, 1026, 1031, 1034, 1035,
#> 1036, 1037, 1044, 1049, 1050, 1051, 1060, 1065, 1066, 1117, 1120, 1121, 1126,
#> 1128, 1133, 1154, 1175, 1183, 1255, 1260, 1456, 1480, 1487, 1494) have the same
#> value! Is this correct? Please check the table for possible errors.
#> Warning in create.sub.objects(DAT_mat, stepsize): stepsize is larger than the
#> range of the closest dated object at Index = 6, 12, 14, 18, 19, 20, 21, 22, 39,
#> 40, 41, 44, 45, 60, 74, 75, 76, 77, 79, 80, 83, 110, 113, 114, 125, 126, 131,
#> 132, 133, 141, 145, 147, 154, 161, 162, 163, 167, 168, 171, 173, 174, 175, 194,
#> 196, 198, 199, 201, 203, 204, 205, 206, 207, 208, 212, 214, 217, 218, 223, 231,
#> 232, 233, 234, 235, 236, 237, 238, 240, 242, 243, 244, 245, 246, 265, 269, 271,
#> 272, 273, 276, 292, 293, 323, 324, 331, 332, 333, 334, 335, 336, 337, 338, 339,
#> 340, 341, 342, 343, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356,
#> 579, 582, 583, 584, 585, 617, 618, 619, 620, 621, 622, 623, 625, 627, 628, 629,
#> 630, 631, 632, 633, 634, 635, 637, 640, 646, 723, 724, 725, 726, 727, 728, 729,
#> 730, 731, 732, 733, 734, 735, 736, 873, 874, 875, 876, 877, 878, 879, 880, 881,
#> 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 935, 936, 938, 939,
#> 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955,
#> 956, 957, 958, 959, 975, 976, 979, 980, 981, 982, 983, 984, 987, 988, 989, 1002,
#> 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028,
#> 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1044, 1045, 1046,
#> 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1057, 1058, 1059, 1060,
#> 1061, 1062, 1063, 1064, 1065, 1066, 1074, 1075, 1078, 1079, 1080, 1081, 1082,
#> 1083, 1084, 1085, 1086, 1087, 1106, 1107, 1112, 1113, 1114, 1115, 1116, 1117,
#> 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130,
#> 1131, 1132, 1133, 1147, 1148, 1149, 1153, 1154, 1172, 1174, 1175, 1176, 1183,
#> 1185, 1186, 1187, 1188, 1189, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1248,
#> 1250, 1251, 1252, 1253, 1255, 1256, 1260, 1382, 1456, 1457, 1458, 1459, 1460,
#> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1480, 1487, 1494, 1496, 1497,
#> 1498). For information see documentation of get.step.sequence().

datsteps() tells us that a number or objects are dated to one year, asking us if this is correct. We included this output to avoid errors occurring through faulty values, as objects dated to one year have a very high impact on the outcome. It might help to spot problems or discrepancies in datasets as well, as the warning about the order of dating above has already shown.

In a second step, we scale the weights according to our grouping variable (Location), so that the sum of weights in each group equals 1, which is important for displaying the weight correctly in a density plot.

Visualizing the Output of `datsteps()`

To get a general impression of the data set and the possibilities of visualization, we explore and recommend different plot methods below. The simplest and fastest way to plot the output with base R is the plot() function in combination with the density() function:

plot(density(inscr_steps$DAT_step))

Using `ggplot2`

A somewhat crowded overview containing the Locations of all Inscriptions can easily be achieved when using ggplot2 and its geom_density()-method, which is based on the same procedure as the density()-function used above. A result may look like this:

ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable,
                               weight = weight)) +
  geom_density(alpha = 0.3)

Note that the output without the weights calculated by datsteps() is very different:

ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable)) +
  geom_density(alpha = 0.3)

The weights calculated by datplot (internally using the get.weights()-function) are a way of assigning an importance to the objects in question. They relate to the range an object is dated to (more detailed information on the process is available in (Weissova and Steinmann, n.d.)). Therefore, any objects dated to large time spans as for example “Roman Imperial Period” ranging from 31 BC to 395 AD in our data set, are contributing to the curve significantly less than any object dated to one exact year. At times it can be very useful to look at both outputs and discuss them separately. We will look at a case study of a single city later on to clarify this.

Using `ggplot2` and `ggridges`

Since the graph is very crowded in this layout, we actually recommend using geom_density_ridges from the ggridges-package. Since the package has no built in support for assigning weight, we have to use stat = "density", which will reference the density()-function to get the appropriate calculations:

ggplot(data = inscr_steps,
       aes(x = DAT_step,
           y = fct_rev(as_factor(variable)),
           fill = variable,
           weight = weight)) +
  geom_density_ridges(aes(height = ..density..),
                      stat = "density", alpha = 0.9) +
  scale_fill_discrete(guide = FALSE)

Styling will help to make the plots more readable:

bluegreen <- colorRampPalette(c("#8dae25", "#17365c"))

ggplot(data = inscr_steps,
       aes(x = DAT_step,
           y = fct_rev(as_factor(variable)),
           fill = variable,
           weight = weight)) +
  geom_density_ridges(aes(height = ..density..), stat = "density", alpha = 0.9) +
  scale_x_continuous(breaks = seq(from = -800, to = 800, by = 100),
                     limits = c(-800,800), name = "") +
  geom_vline(xintercept = 0, alpha = 0.5, lwd = 1) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        panel.background = element_blank(),
        panel.grid.major.x = element_line(linetype = "dashed",
                                          color = "gray30"),
        panel.grid.minor.x = element_line(linetype = "dotted",
                                          color = "gray80")) +
  scale_fill_manual(guide=FALSE,
                    values = bluegreen(length(unique(inscr_steps$variable)))) +
  labs(title = "Epigraphic Evidence from Bithynia",
       subtitle = "Spatio-temporal distribution",
       y = "administrative centres",
       caption = attributes(inscriptions)$source)

Using `ggplot2` and `facet_wrap()`

Another option is to separate the variables with facet_wrap, which is not as condensed and does not support the option to scale the densities in the graph for maximum visibility, as geom_density_ridges() automatically does (see Documentation of ggridges). In our case, this leads to a worse visibility of smaller density curves:

ggplot(data = inscr_steps, aes(x = DAT_step, 
                               fill = variable, weight = weight)) +
  geom_density(alpha = 0.9) +
  scale_fill_discrete(guide = FALSE) +
  facet_wrap(variable ~ ., ncol = 1)

Styling also helps very much to improve the graphs readability:

ggplot(data = inscr_steps, aes(x = DAT_step,
                               fill = variable, weight = weight)) +
  geom_density(alpha = 0.9) +
  theme(panel.background = element_blank()) +
  scale_fill_manual(guide=FALSE,
                    values = bluegreen(length(unique(inscr_steps$variable)))) +
  scale_x_continuous(breaks = seq(from = -800, to = 800, by = 100),
                     limits = c(-800,800), name = "") +
  facet_wrap(variable ~ ., ncol = 1) +
  theme(strip.text.x = element_text(size=8),
        strip.background = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank(),
        panel.background = element_blank(),
        panel.grid.major.x = element_line(linetype = "dashed",
                                          color = "gray30"),
        panel.grid.minor.x = element_line(linetype = "dotted",
                                          color = "gray80")) +
  labs(title = "Epigraphic Evidence from Bithynia",
       subtitle = "Spatio-temporal distribution",
       caption = attributes(inscriptions)$source)

For information and guidance on regular expressions see e.g. https://regex101.com/. However, R has a slightly different handling of escape characters. Information can be found in the R-Documentation under ?regex or here.↩︎
Using different stepsizes is a bit of an artifact left over from a less efficient version of datplot, where is processing would take more than a minute if stepsize was set to 1, so I would usually start out with 100 or 25 just to look for errors and problems and to not lose much time. This, however, is technically not necessary in any way. The function now works faster, though there is a lot of room for improvement, as it may still take several seconds or longer, but generally not several minutes. We determined that the feature might still be useful, and therefore it was not removed.↩︎

Data Preparation and Visualization

The “Inscriptions of Bithynia” data set

Data Preparation, Cleaning and Reformatting

Cleaning up the Dating-variable

Creating a Concordance for Periods

Reformatting of Partially Numerical Dating Values

Reformatting of Inscriptions Dated to Complete Centuries

Reformatting of Imprecise Dating Values

Creating a Second Concordance for Verbose Chronological Assessments

Joining the Reformatted Data with the Original Data Set

Fixing mistakes

Storing the Prepared Data Set

Selecting the Data for Further Analysis and using datplot

datplot

Visualizing the Output of `datsteps()`

Using `ggplot2`

Using `ggplot2` and `ggridges`

Using `ggplot2` and `facet_wrap()`

How Does Datplot and Kernel Density Estimation Perform Compared to Histograms?

References

Data Preparation and Visualization

The “Inscriptions of Bithynia” data set

Data Preparation, Cleaning and Reformatting

Cleaning up the Dating-variable

Creating a Concordance for Periods

Reformatting of Partially Numerical Dating Values

Reformatting of Inscriptions Dated to Complete Centuries

Reformatting of Imprecise Dating Values

Creating a Second Concordance for Verbose Chronological Assessments

Joining the Reformatted Data with the Original Data Set

Fixing mistakes

Storing the Prepared Data Set

Selecting the Data for Further Analysis and using datplot

datplot

Visualizing the Output of datsteps()

Using ggplot2

Using ggplot2 and ggridges

Using ggplot2 and facet_wrap()

How Does Datplot and Kernel Density Estimation Perform Compared to Histograms?

References

Visualizing the Output of `datsteps()`

Using `ggplot2`

Using `ggplot2` and `ggridges`

Using `ggplot2` and `facet_wrap()`