Archaeological data as it can be found “in the wild” rarely conforms to the formats suitable for any kind of statistical analysis. This vignette is intended as a way of demonstrating a possibility of data cleaning on a data set as it can frequently be found in archaeological literature, or put together during archaeological research. The data is prepared for use with the datplot
-package. An analysis based on this data can be found as a case study in the paper “datplot: A new R-Package for the Visualization of Date Ranges in Archaeology” (Weissova and Steinmann, n.d.) by the authors of this package.
The inscriptions of Bithynia data set, that is included in this package, was compiled for a publication analyzing the settlement patterns and road systems in ancient Bithynia (Weissova 2019). The vignette is thus meant as guidance for archaeologists looking to employ the datplot-package, and finding themselves in need of reformatting their data without much experience in programming languages. It points to certain problems arising with data sets, which are often incarnated in the form of spreadsheets. The Bithynia data set is a typical example of the structure of data as it is used by many (classical) archaeologists. The process of cleaning it highlights solutions for the issues encountered with such spreadsheets, and may also be adapted to other data as well as fields.
First, we attach the packages to be used in the vignette:
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.4
library(stringr)
library(forcats)
#> Warning: package 'forcats' was built under R version 4.0.4
library(ggplot2)
library(knitr)
library(datplot)
library(ggridges)
library(reshape2)
The manually curated data that B. Weissova prepared for her dissertation was saved as an Excel-spreadsheet. As this is very often the case, we continue directly from this original file, which is available in this repositories “inst/extdata/”-sub-directory. It can be loaded into R without further complications using the read_excel()
-function from the package readxl
. Alternatively, the file is suited for storage as a *.tsv-file, which is also provided in the same directory for better long-term support.
<- readxl::read_excel(system.file("extdata",
inscriptions "Bithynia_Inscriptions.xlsx",
package = "datplot",
mustWork = TRUE))
summary(inscriptions)
#> ikey Location Source Chronological Frame
#> Length:2878 Length:2878 Length:2878 Length:2878
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#> Language
#> Length:2878
#> Class :character
#> Mode :character
A large amount of the initial data was provided by the Searchable Greek Inscriptions Tool of the Packard Humanities Institute and the Epigraphische Datenbank Heidelberg. Those inscriptions (n = 2877) can be referenced in the databases with their ikey
. The data set was supplemented with 323 inscriptions that were manually gathered from different sources. The relevant citations are to be found in the Source
column. Additional information on the creation and curation of this data set can be found in the publication mentioned above (Weissova 2019).
The original file consists of five columns which each row representing a single inscription: ikey
contains the reference to the Searchable Greek Inscriptions Tool of the Packard Humanities Institute, indicated via PH
as a prefix, or to the Epigraphische Datenbank Heidelberg, indicated via HD
as a prefix. Location
refers to the find spot of the inscription. Source
states the source of the data. Chronological Frame
contains the dating in a verbose format, such as “Roman Imperial Period.” Language
records the language in which the inscription was written, which can either be Latin, Greek, or both.
The data set is not yet suited for analysis, as some variables, especially the chronological frame, have many inconsistencies. For further processing, we should also be sure to include an identifier-column. As 331 inscriptions do not have an ikey-Value, which might have otherwise been a good candidate for identification, we chose to create a new automatically generated ID, so that every inscription can be individually identifiable.
$ID <- paste("I_", 1:nrow(inscriptions), sep = "") inscriptions
Two of the variables of this data set are almost ready for further use, i.e. Location and Language. A look at their unique values reveals only small inconsistencies that can be easily fixed:
unique(inscriptions$Location)
#> [1] "Apamea" "Apollonia ad Rhyndacum"
#> [3] "Caesarea Germanice" "Chalcedon"
#> [5] "Claudiopolis" "Creteia Flaviopolis"
#> [7] "Daskyleion" "Heraclea Pontica"
#> [9] "Juliopolis" "Kaisarea Hadrianopolis"
#> [11] "Nicaea" "Nicomedia"
#> [13] "Nicomedia (Dakibyza)" "Prusias ad Hypium"
#> [15] "Prusias ad Mare" "Prusias ad Mare (Keramed)"
#> [17] "Prusias ad Olympum" "Pylai"
#> [19] "Strobilos" "Tios"
#> [21] "unknown" "unknown (Bolu museum)"
#> [23] "unknown (Lamounia ?)"
unique(inscriptions$Language)
#> [1] "Greek" "Gr/Lat" "Latin" "Gr / Lat"
Using functions from the tidyverse
package family, we can easily transform the columns. We rename the “Chronological Frame” to Dating
, as shorter names without spaces are more convenient to work with, and add proper NA-values if there is no chronological assessment. With mutate()
and replace()
we also clear out the redundant variable values from Location
and Language
:
<- inscriptions %>%
inscriptions rename(Dating = `Chronological Frame`) %>%
mutate(Dating = na_if(Dating, "---"),
Language = replace(Language, Language == "Gr/Lat", "Greek/Latin"),
Language = replace(Language, Language == "Gr / Lat", "Greek/Latin"),
Language = factor(Language, levels = c("Greek", "Latin",
"Greek/Latin")),
Location = replace(Location, str_detect(Location, "unknown"),
"unknown"),
Location = replace(Location,
== "Prusias ad Mare (Keramed)",
Location "Prusias ad Mare"),
Location = factor(Location))
This conversion leaves us with a more compact overview of the data sets contents:
summary(inscriptions)
#> ikey Location Source
#> Length:2878 Nicaea :761 Length:2878
#> Class :character Nicomedia :501 Class :character
#> Mode :character Prusias ad Olympum:344 Mode :character
#> Claudiopolis :252
#> Chalcedon :200
#> Prusias ad Hypium :179
#> (Other) :641
#> Dating Language ID
#> Length:2878 Greek :2724 Length:2878
#> Class :character Latin : 125 Class :character
#> Mode :character Greek/Latin: 29 Mode :character
#>
#>
#>
#>
Some of the values in the Dating
-variable contain question signs, indicating an uncertainty of the chronological frames assessment. To keep this information, we store it in a new variable uncertain_dating
, which contains TRUE
if there was a question mark in the original assessment and FALSE
if the dating was certain, so that if one wishes the additional information could later be used to select or exclude uncertain values.
$uncertain_dating <- FALSE
inscriptions<- grep("\\?", inscriptions$Dating)
sel $uncertain_dating[sel] <- TRUE
inscriptions$Dating <- gsub("\\?", "", inscriptions$Dating) inscriptions
The next step is sorting out values from the Dating
variable that have to be manually entered, such as the “Roman Imperial Period.” We achieve this by excluding all values that contain a number, preparing a table in which we can manually enter the desired dating span and saving it as a .csv-file. After adding the values manually, we reload the file. The manually edited concordance can be found in the “inst/extdata/”-sub-directory of this repository in the file periods_edit.csv
and corresponds with the chronological assessment of periods used in the original publication (Weissova 2019, 42).
<- grepl("[0-9]", inscriptions$Dating)
sel <- data.frame("Dating" = unique(inscriptions$Dating[which(sel == FALSE)]))
periods $DAT_min <- NA
periods$DAT_max <- NA
periods#write.csv(periods, file = "../data-raw/periods.csv", fileEncoding = "UTF-8")
# .... Manual editing of the resulting table, saving it as "periods_edit.csv".
<- read.csv(file = system.file('extdata', 'periods_edit.csv',
join_dating package = 'datplot',
mustWork = TRUE),
row.names = 1,
colClasses = c("character", "character",
"integer", "integer"),
encoding = "UTF-8")
It is of course possible to add the corresponding values in R. Since the process could hardly be automated, this way seemed more efficient to us, though it is - sadly - less transparent in this vignette. The values can be examined by the reader in the csv-Table mentioned above, or by loading the table via system.file()
as seen in the code chunk.
There remains, however, a large amount of values that is not covered in this concordance. We can easily automate some of the conversions as a series of steps that we store in another data.frame
called num_dating
, encompassing all the unique values that contain some form of a numerical dating.
<- data.frame("Dating" = unique(inscriptions$Dating[which(sel == TRUE)]))
num_dating $DAT_min <- NA
num_dating$DAT_max <- NA num_dating
First, there is a number of inscriptions dated to a single year. We select these using a regular expression with grep()
1 and can simply delete the character-part of the values so that only the specified year remains and is stored in both of the DAT_
-variables. We do this separately for AD and BC-values since BC needs to be stored as a negative number.
<- grep("^[0-9]{1,3} AD$", num_dating$Dating)
sel $DAT_min[sel] <- gsub(" AD", "", num_dating$Dating[sel])
num_dating$DAT_max[sel] <- gsub(" AD", "", num_dating$Dating[sel])
num_dating<- grep("^[0-9]{1,3} BC$", num_dating$Dating)
sel $DAT_min[sel] <- paste("-", gsub(" BC", "", num_dating$Dating[sel]),
num_datingsep = "")
$DAT_max[sel] <- paste("-", gsub(" BC", "", num_dating$Dating[sel]),
num_datingsep = "")
As a demonstration, this is the resulting table (num_dating
) of up to this point converted values:
Dating | DAT_min | DAT_max | |
---|---|---|---|
136 | 196 AD | 196 | 196 |
215 | 253 AD | 253 | 253 |
258 | 114 AD | 114 | 114 |
131 | 223 AD | 223 | 223 |
98 | 400 BC | -400 | -400 |
257 | 142 AD | 142 | 142 |
214 | 260 AD | 260 | 260 |
265 | 241 BC | -241 | -241 |
12 | 585 AD | 585 | 585 |
55 | 206 AD | 206 | 206 |
Since we frequently check the values in num_dating
to look for errors, we append the finished rows to join_dating
, which we later use as a look-up-table for our data set, and remove the finished rows from num_dating
.
<- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
join_dating <- num_dating[which(is.na(num_dating$DAT_min)), ] num_dating
We have to convert the Dating
variable from factor
to character
now, so we can use the strsplit()
-function on the values:
$Dating <- as.character(num_dating$Dating) num_dating
As some of the values are in the format year-year, e.g. “150-160 AD,” we can easily grab both numbers from the string. To achieve this, we select all the rows containing the relevant format, then loop over each of these rows and split the character string along either “-” or spaces, and later “/,” as we chose to treat values of the format “198/198 AD” as “197 - 198 AD.” Selecting the numerical values from the resulting list according to their position gives us the desired values for DAT_min
and DAT_max
. We need to do the same for the BC-values to make sure they return as negative numbers.
# Values like: 92-120 AD
<- grep("^[0-9]{1,3}–[0-9]{1,3} AD", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = "–| ")
split $DAT_min[r] <- split[[1]][1]
num_dating$DAT_max[r] <- split[[1]][2]
num_dating
}# Values like: AD 92-120
<- grep("^AD [0-9]{1,3}–[0-9]{1,3}$", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = "–| ")
split $DAT_min[r] <- split[[1]][2]
num_dating$DAT_max[r] <- split[[1]][3]
num_dating
}# Values like: AD 92 - 120
<- grep("^AD [0-9]{1,3} - [0-9]{1,3}", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = " - | ")
split $DAT_min[r] <- split[[1]][2]
num_dating$DAT_max[r] <- split[[1]][3]
num_dating
}# Values like: 198/199 AD
<- grep("^[0-9]{1,3}/[0-9]{1,3} AD", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = "/| ")
split $DAT_min[r] <- split[[1]][1]
num_dating$DAT_max[r] <- split[[1]][2]
num_dating
}# Values like: 525-75 BC
<- grep("^[0-9]{1,3}–[0-9]{1,3} BC", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = "–| ")
split $DAT_min[r] <- 0 - as.numeric(split[[1]][1])
num_dating$DAT_max[r] <- 0 - as.numeric(split[[1]][2])
num_dating }
Another look at the data set can help us to check for possible errors.
Dating | DAT_min | DAT_max | |
---|---|---|---|
276 | AD 84 - 96 | 84 | 96 |
135 | 199–210 AD | 199 | 210 |
218 | 244–249 AD | 244 | 249 |
205 | 98/99 AD | 98 | 99 |
153 | AD 150 - 200 | 150 | 200 |
246 | 222–235 AD | 222 | 235 |
133 | 212–217 AD | 212 | 217 |
165 | 363/364 AD | 363 | 364 |
167 | 337–340 AD | 337 | 340 |
37 | AD 180–212 | 180 | 212 |
Putting aside the finished values again makes it easier to spot errors in the process:
<- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
join_dating <- num_dating[which(is.na(num_dating$DAT_min)), ] num_dating
Next, we separate values that identify complete centuries. As we want to express the dating in absolute numbers, we convert “1st c. AD” to a time span ranging from 0 to 99, and “1st c. BC” to -99 to 0 respectively. The regular expression selects all values, where a single number in the beginning of the string is followed by two letters (i.e. 2nd, 3rd, 1st) and “c. AD” resp. “c. BC.” We subtract one and multiply the number by 100 to get the respective boundaries, again taking care to treat AD and BC differently.
<- grep("^[0-9]{1}[a-z]{2} c\\. AD$", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = "[a-z]{2} c\\.")
split <- as.numeric(split[[1]][1])
split $DAT_min[r] <- ((split - 1) * 100)
num_dating$DAT_max[r] <- ((split - 1) * 100) + 99
num_dating
}
<- grep("^[0-9]{1}[a-z]{2} c\\. BC$", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = "[a-z]{2} c\\.")
split <- as.numeric(split[[1]][1])
split $DAT_min[r] <- 0-(split * 100) + 1
num_dating$DAT_max[r] <- 0-((split - 1) * 100)
num_dating }
Dating | DAT_min | DAT_max | |
---|---|---|---|
42 | 6th c. AD | 500 | 599 |
15 | 3rd c. BC | -299 | -200 |
18 | 2nd c. AD | 100 | 199 |
11 | 5th c. AD | 400 | 499 |
16 | 3rd c. AD | 200 | 299 |
47 | 4th c. BC | -399 | -300 |
41 | 6th c. BC | -599 | -500 |
25 | 1st c. BC | -99 | 0 |
116 | 4th c. AD | 300 | 399 |
17 | 2nd c. BC | -199 | -100 |
Again, putting aside the finished values makes it easier to spot errors in further processing:
<- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
join_dating <- num_dating[which(is.na(num_dating$DAT_min)), ] num_dating
For dates that are around a certain value, i.e. of the format “ca. 190 AD,” we are not able to make a more informed decision than guess about what the researchers providing this assessment had in mind. This might also change from inscription to inscription. While a closer look at the individual inscriptions may yield a more sensible estimate, this seems not feasible as part of the data preparation process. It seems that in most cases, if the dating can be as precise as a span of around 10 years, researchers tend to emphasize this. Therefore, we decided to take a total span of 20 years, i.e. 10 years before and 10 years after the mentioned value, reflecting some uncertainty on the precision and duration of the estimate. “ca. 190 AD” thus becomes “180–200 AD,” with the same mechanism for BC in negative values.
<- grep("^ca\\. [0-9]{1,3} AD$", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = " ")
split <- as.numeric(split[[1]][2])
split $DAT_min[r] <- split - 10
num_dating$DAT_max[r] <- split + 10
num_dating
}<- grep("^ca\\. [0-9]{1,3} BC$", num_dating$Dating)
sel for (r in sel) {
<- strsplit(x = num_dating$Dating[r], split = " ")
split <- 0-as.numeric(split[[1]][2])
split $DAT_min[r] <- split - 10
num_dating$DAT_max[r] <- split + 10
num_dating }
Dating | DAT_min | DAT_max | |
---|---|---|---|
5 | ca. 200 AD | 190 | 210 |
6 | ca. 100 AD | 90 | 110 |
63 | ca. 130 AD | 120 | 140 |
202 | ca. 214 AD | 204 | 224 |
260 | ca. 360 BC | -370 | -350 |
Again, saving the finished values in join_dating
leaves us with the list of not yet converted values as seen in num_dating
s Dating
-variable.
<- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
join_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]
num_dating unique(num_dating$Dating)[1:20]
#> [1] "end of the 2nd c. AD"
#> [2] "end of 1st c. BC – beg. of 1st c. AD"
#> [3] "end 2nd c. BC"
#> [4] "early 1st c. AD"
#> [5] "after 212 AD"
#> [6] "AD 531"
#> [7] "5th/6th century AD"
#> [8] "3rd/4th century AD"
#> [9] "2nd – 3rd c. AD"
#> [10] "2nd – 1st c. BC"
#> [11] "1st – beg. of the 2nd c. AD"
#> [12] "1st – 2nd c. AD"
#> [13] "2nd century AD"
#> [14] "late 3rd – early 2nd c. BC"
#> [15] "5th – 6th c. AD"
#> [16] "3rd quarter of the 6th c. BC"
#> [17] "3rd – 2nd c. BC"
#> [18] "340/339 BC"
#> [19] "27 BC–14 AD"
#> [20] "1st century AD"
Due to the heterogeneous nature of these 99 values, we decided to convert them manually again. Our criteria for translating terms like “beginning of” are the following: “beginning of,” “end of,” “early,” “late” are – similarly to ca. – all translated to 20 years, as we assume that more information would have been given if the time span was greater than a quarter century. For the other values, we employed the same criteria as seen above in the automated process. We decided to measure the ‘beginning of a century’ at e.g. -199 for BC values and 100 for AD values, and accordingly identify the ‘end of a century’ with the values e.g. -100 (BC) / 199 (AD). Since the data is epigraphical, in the case of “before” or “after” (i.e. terminus ante/post quem) dates, we assume some connection to a datable event, and therefore add or subtract a span of 10 years, which is, however, still rather arbitrary.
As the last step, we switch the – in some cases automatically assigned – year 0, so that it will be treated as either 1 or -1:
$DAT_min[which(join_dating$DAT_min == 0)] <- 1
join_dating$DAT_max[which(join_dating$DAT_max == 0)] <- -1 join_dating
We then have to reload the corrected data. The values that we put aside in the join_dating
data.frame
beforehand serve as the basis for our new data.frame
that we use as a look-up table. As the variables were treated as character-strings, we need to convert them to a numeric format first. We append the newly loaded manually assigned values to this data.frame. Again, to keep this process as transparent as possible, we included the *.csv-Table in the “inst/extdata/” sub-directory as “num_dating_edit.csv.”
#write.csv(num_dating, file = "../data-raw/num_dating.csv",
# fileEncoding = "UTF-8")
<- read.csv(file = system.file('extdata', 'num_dating_edit.csv',
num_dating package = 'datplot', mustWork = TRUE),
encoding = "UTF-8",
row.names = 1,
colClasses = c("character", "character",
"integer", "integer"))
<- join_dating %>%
join_dating mutate(DAT_min = as.integer(DAT_min),
DAT_max = as.integer(DAT_max)) %>%
rbind(num_dating)
left_join()
lets us add the DAT_
-variables from the look-up table join_dating
to our original data.frame.
<- left_join(inscriptions, join_dating, by = "Dating") inscriptions
The DAT_
-variables – as we can now see – contain the desired information:
ID | Location | Dating | DAT_min | DAT_max |
---|---|---|---|---|
I_1350 | Nicaea | 2nd c. AD | 100 | 199 |
I_2550 | Prusias ad Olympum | 2nd c. BC | -199 | -100 |
I_388 | Claudiopolis | 117–138 AD | 117 | 138 |
I_2713 | Prusias ad Olympum | 1st – 2nd c. AD | 1 | 199 |
I_751 | Kaisarea Hadrianopolis | 6th c. AD | 500 | 599 |
I_726 | Kaisarea Hadrianopolis | mid. 3rd c. AD | 240 | 260 |
I_1320 | Nicaea | 2nd c. AD | 100 | 199 |
I_2616 | Prusias ad Olympum | 2nd c. AD | 100 | 199 |
I_1702 | Nicomedia | 284–305 AD | 284 | 305 |
I_1012 | Nicaea | Roman Imp. period | -31 | 395 |
I_2597 | Prusias ad Olympum | 2nd c. AD | 100 | 199 |
I_1418 | Nicaea | 202–205 AD | 202 | 205 |
I_786 | Kaisarea Hadrianopolis | 292–305 AD | 292 | 305 |
I_1117 | Nicaea | Byzantine period | 395 | 799 |
I_938 | Nicaea | Roman Imp. period | -31 | 395 |
We can now make a ‘test run’ using datplot with the stepsize-value set to 5 in order to save as much time as possible2 and check for errors first. We have to select exactly 4 columns in the correct order of: Identifier, Grouping Variable, Minimum Dating, Maximum Dating. We transform the output from the pipe operator to a data.frame
before handing it to datstepts()
, as datplot needs this format.
<- inscriptions %>%
inscr_steps select(ID, Location, DAT_min, DAT_max) %>%
na.omit() %>%
as.data.frame() %>%
::datsteps(stepsize = 5)
datplot#> Warning in datplot::datsteps(., stepsize = 5): Warning: Dating seems to be in
#> wrong order at ID I_1162, I_2725 (Index: 637, 1458). Dates have been switched,
#> but be sure to check your original data for possible mistakes.
#> Warning in get.weights(DAT_mat[, "datmin"], DAT_mat[, "datmax"]): Warning:
#> DAT_min and DAT_max at Index: 40, 44, 45, 76, 77, 113, 126, 131, 132, 133, 147,
#> 154, 198, 201, 203, 204, 205, 218, 240, 242, 243, 292, 293, 323, 333, 336, 337,
#> 338, 339, 340, 341, 343, 346, 349, 350, 351, 354, 355, 356, 640, 881, 882, 886,
#> 892, 935, 940, 943, 944, 946, 947, 948, 955, 1024, 1025, 1026, 1031, 1034, 1035,
#> 1036, 1037, 1044, 1049, 1050, 1051, 1060, 1065, 1066, 1117, 1120, 1121, 1126,
#> 1128, 1133, 1154, 1175, 1183, 1255, 1260, 1456, 1480, 1487, 1494, 1498) have the
#> same value! Is this correct? Please check the table for possible errors.
#> Warning in create.sub.objects(DAT_mat, stepsize): stepsize is larger than the
#> range of the closest dated object at Index = 40, 44, 45, 60, 75, 76, 77, 113,
#> 126, 131, 132, 133, 141, 147, 154, 171, 196, 198, 199, 201, 203, 204, 205,
#> 214, 218, 240, 242, 243, 269, 292, 293, 323, 333, 334, 335, 336, 337, 338,
#> 339, 340, 341, 343, 346, 347, 348, 349, 350, 351, 354, 355, 356, 625, 627,
#> 628, 640, 646, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 879,
#> 881, 882, 883, 884, 885, 886, 888, 889, 890, 892, 893, 935, 936, 939, 940, 941,
#> 942, 943, 944, 945, 946, 947, 948, 949, 955, 956, 957, 987, 988, 1016, 1022,
#> 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1031, 1034, 1035, 1036, 1037, 1044,
#> 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1057, 1059, 1060, 1065, 1066,
#> 1106, 1112, 1114, 1115, 1116, 1117, 1120, 1121, 1126, 1128, 1133, 1154, 1172,
#> 1174, 1175, 1183, 1185, 1188, 1189, 1255, 1256, 1260, 1456, 1480, 1487, 1494,
#> 1498, 1499, 1500, 1502, 1514, 1515, 1516). For information see documentation of
#> get.step.sequence().
There are indeed problems in the data. The first of the three warnings issued by datsteps()
informs us that we may have assigned some values wrong: “Warning: Dating seems to be in wrong order at ID I_1162, I_2725 (Index: 637, 1458). Dates have been switched, but be sure to check your original data for possible mistakes.” If dates are in the wrong order, datplot will automatically switch them before proceeding. This, however, might not always be the correct way of handling the situation, as other errors might have occurred as well. Therefore, we should check the data again using the Index-Values or IDs provided by datplots warning:
%>%
inscriptions select(ID, Location, Dating, uncertain_dating, DAT_min, DAT_max) %>%
na.omit() %>%
slice(637, 1458) %>%
kable()
ID | Location | Dating | uncertain_dating | DAT_min | DAT_max |
---|---|---|---|---|---|
I_1162 | Nicaea | 62/3 AD | FALSE | 62 | 3 |
I_2725 | Prusias ad Olympum | 180–192 BC | FALSE | -180 | -192 |
The problem here is twofold. In the first entry, our automated translation of the dating could not handle the format “62/3” and thus returned a wrong value for DAT_max
. If we had a truly large data set, we should correct this in the original reformatting process. In this case, it is more efficient to fix this problem right now in the fastest possible way.
which(inscriptions$ID == "I_1162"),"DAT_max"] <- 63 inscriptions[
In the other case, the Dating-column provided the BC-values in the wrong order. We can also fix this quickly right here, which is basically the same way datplot would handle both cases internally:
which(inscriptions$ID == "I_2725"),c("DAT_min", "DAT_max")] <-
inscriptions[which(inscriptions$ID == "I_2725"),c("DAT_max", "DAT_min")] inscriptions[
Note that datsteps()
changes the date and proceeds, but warns you of the possibility of errors. It is therefore recommended to check the data mentioned in that warning. As we ‘fixed’ our data set, we should save it again, so that the same error will not occur twice:
#write.table(inscriptions, file = "../data-raw/inscriptions.csv",
# fileEncoding = "UTF-8", sep = ";", row.names = FALSE)
For later use and publication, we add explanations of the variables and metadata to the R-object and store them in the *.rda file. As a backup, we also save the finished table as a .csv-file, suitable for archiving. The .csv-Table, however, does not contain the additional information added as attributes.
All files are available in this repository and as part of this package for further use by other researchers. We kindly ask you to cite Weissova (2019), the Packard Humanities Institute and this repository at https://github.com/lsteinmann/datplot as sources.
The dataset can be loaded into R by simply calling
#library(datplot)
data("Inscr_Bithynia")
when datplot
is loaded.
As our aim is to get an overview of the spatio-temporal distribution of the inscriptions in Bithynia, we can only analyze inscriptions with known Location and Dating. Thus, we have to remove all the rows from the data set that do not contain this information.
<- Inscr_Bithynia %>%
inscr_clean filter(Dating != "NA",
!= "unknown") %>%
Location droplevels()
This means that we removed a total of 1380 rows, which did not contain the information needed. The data set suitable for the analysis can be summarized as follows:
summary(inscr_clean)
#> ID ikey Location
#> Length:1498 Length:1498 Nicaea :604
#> Class :character Class :character Prusias ad Olympum :279
#> Mode :character Mode :character Nicomedia :108
#> Kaisarea Hadrianopolis: 87
#> Apamea : 77
#> Chalcedon : 69
#> (Other) :274
#> Source Dating Language uncertain_dating
#> Length:1498 Length:1498 Greek :1392 Mode :logical
#> Class :character Class :character Latin : 82 FALSE:1327
#> Mode :character Mode :character Greek/Latin: 24 TRUE :171
#>
#>
#>
#>
#> DAT_min DAT_max URL
#> Min. :-599.0 Min. :-525.0 Length:1498
#> 1st Qu.: -31.0 1st Qu.: 199.0 Class :character
#> Median : 100.0 Median : 266.5 Mode :character
#> Mean : 100.3 Mean : 286.4
#> 3rd Qu.: 200.0 3rd Qu.: 395.0
#> Max. : 680.0 Max. : 799.0
#>
We can now begin using the datplot-package…
library(datplot)
…and can actually try to use datsteps()
, first with a stepsize
of 25 years:
<- inscr_clean %>%
inscr_steps select(ID, Location, DAT_min, DAT_max) %>%
as.data.frame() %>%
::datsteps(stepsize = 25) %>%
datplot::scaleweight(var = 2)
datplot#> Warning in get.weights(DAT_mat[, "datmin"], DAT_mat[, "datmax"]): Warning:
#> DAT_min and DAT_max at Index: 40, 44, 45, 76, 77, 113, 126, 131, 132, 133, 147,
#> 154, 198, 201, 203, 204, 205, 218, 240, 242, 243, 292, 293, 323, 333, 336, 337,
#> 338, 339, 340, 341, 343, 346, 349, 350, 351, 354, 355, 356, 640, 881, 882, 886,
#> 892, 935, 940, 943, 944, 946, 947, 948, 955, 1024, 1025, 1026, 1031, 1034, 1035,
#> 1036, 1037, 1044, 1049, 1050, 1051, 1060, 1065, 1066, 1117, 1120, 1121, 1126,
#> 1128, 1133, 1154, 1175, 1183, 1255, 1260, 1456, 1480, 1487, 1494) have the same
#> value! Is this correct? Please check the table for possible errors.
#> Warning in create.sub.objects(DAT_mat, stepsize): stepsize is larger than the
#> range of the closest dated object at Index = 6, 12, 14, 18, 19, 20, 21, 22, 39,
#> 40, 41, 44, 45, 60, 74, 75, 76, 77, 79, 80, 83, 110, 113, 114, 125, 126, 131,
#> 132, 133, 141, 145, 147, 154, 161, 162, 163, 167, 168, 171, 173, 174, 175, 194,
#> 196, 198, 199, 201, 203, 204, 205, 206, 207, 208, 212, 214, 217, 218, 223, 231,
#> 232, 233, 234, 235, 236, 237, 238, 240, 242, 243, 244, 245, 246, 265, 269, 271,
#> 272, 273, 276, 292, 293, 323, 324, 331, 332, 333, 334, 335, 336, 337, 338, 339,
#> 340, 341, 342, 343, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356,
#> 579, 582, 583, 584, 585, 617, 618, 619, 620, 621, 622, 623, 625, 627, 628, 629,
#> 630, 631, 632, 633, 634, 635, 637, 640, 646, 723, 724, 725, 726, 727, 728, 729,
#> 730, 731, 732, 733, 734, 735, 736, 873, 874, 875, 876, 877, 878, 879, 880, 881,
#> 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 935, 936, 938, 939,
#> 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955,
#> 956, 957, 958, 959, 975, 976, 979, 980, 981, 982, 983, 984, 987, 988, 989, 1002,
#> 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028,
#> 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1044, 1045, 1046,
#> 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1057, 1058, 1059, 1060,
#> 1061, 1062, 1063, 1064, 1065, 1066, 1074, 1075, 1078, 1079, 1080, 1081, 1082,
#> 1083, 1084, 1085, 1086, 1087, 1106, 1107, 1112, 1113, 1114, 1115, 1116, 1117,
#> 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130,
#> 1131, 1132, 1133, 1147, 1148, 1149, 1153, 1154, 1172, 1174, 1175, 1176, 1183,
#> 1185, 1186, 1187, 1188, 1189, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1248,
#> 1250, 1251, 1252, 1253, 1255, 1256, 1260, 1382, 1456, 1457, 1458, 1459, 1460,
#> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1480, 1487, 1494, 1496, 1497,
#> 1498). For information see documentation of get.step.sequence().
datsteps()
tells us that a number or objects are dated to one year, asking us if this is correct. We included this output to avoid errors occurring through faulty values, as objects dated to one year have a very high impact on the outcome. It might help to spot problems or discrepancies in datasets as well, as the warning about the order of dating above has already shown.
In a second step, we scale the weights according to our grouping variable (Location), so that the sum of weights in each group equals 1, which is important for displaying the weight correctly in a density plot.
datsteps()
To get a general impression of the data set and the possibilities of visualization, we explore and recommend different plot methods below. The simplest and fastest way to plot the output with base R is the plot()
function in combination with the density()
function:
plot(density(inscr_steps$DAT_step))
ggplot2
A somewhat crowded overview containing the Locations of all Inscriptions can easily be achieved when using ggplot2
and its geom_density()
-method, which is based on the same procedure as the density()
-function used above. A result may look like this:
ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable,
weight = weight)) +
geom_density(alpha = 0.3)
Note that the output without the weights calculated by datsteps()
is very different:
ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable)) +
geom_density(alpha = 0.3)
The weights calculated by datplot (internally using the get.weights()
-function) are a way of assigning an importance to the objects in question. They relate to the range an object is dated to (more detailed information on the process is available in (Weissova and Steinmann, n.d.)). Therefore, any objects dated to large time spans as for example “Roman Imperial Period” ranging from 31 BC to 395 AD in our data set, are contributing to the curve significantly less than any object dated to one exact year. At times it can be very useful to look at both outputs and discuss them separately. We will look at a case study of a single city later on to clarify this.
ggplot2
and ggridges
Since the graph is very crowded in this layout, we actually recommend using geom_density_ridges
from the ggridges
-package. Since the package has no built in support for assigning weight, we have to use stat = "density"
, which will reference the density()
-function to get the appropriate calculations:
ggplot(data = inscr_steps,
aes(x = DAT_step,
y = fct_rev(as_factor(variable)),
fill = variable,
weight = weight)) +
geom_density_ridges(aes(height = ..density..),
stat = "density", alpha = 0.9) +
scale_fill_discrete(guide = FALSE)
Styling will help to make the plots more readable:
<- colorRampPalette(c("#8dae25", "#17365c"))
bluegreen
ggplot(data = inscr_steps,
aes(x = DAT_step,
y = fct_rev(as_factor(variable)),
fill = variable,
weight = weight)) +
geom_density_ridges(aes(height = ..density..), stat = "density", alpha = 0.9) +
scale_x_continuous(breaks = seq(from = -800, to = 800, by = 100),
limits = c(-800,800), name = "") +
geom_vline(xintercept = 0, alpha = 0.5, lwd = 1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
panel.background = element_blank(),
panel.grid.major.x = element_line(linetype = "dashed",
color = "gray30"),
panel.grid.minor.x = element_line(linetype = "dotted",
color = "gray80")) +
scale_fill_manual(guide=FALSE,
values = bluegreen(length(unique(inscr_steps$variable)))) +
labs(title = "Epigraphic Evidence from Bithynia",
subtitle = "Spatio-temporal distribution",
y = "administrative centres",
caption = attributes(inscriptions)$source)
ggplot2
and facet_wrap()
Another option is to separate the variables with facet_wrap
, which is not as condensed and does not support the option to scale the densities in the graph for maximum visibility, as geom_density_ridges()
automatically does (see Documentation of ggridges
). In our case, this leads to a worse visibility of smaller density curves:
ggplot(data = inscr_steps, aes(x = DAT_step,
fill = variable, weight = weight)) +
geom_density(alpha = 0.9) +
scale_fill_discrete(guide = FALSE) +
facet_wrap(variable ~ ., ncol = 1)
Styling also helps very much to improve the graphs readability:
ggplot(data = inscr_steps, aes(x = DAT_step,
fill = variable, weight = weight)) +
geom_density(alpha = 0.9) +
theme(panel.background = element_blank()) +
scale_fill_manual(guide=FALSE,
values = bluegreen(length(unique(inscr_steps$variable)))) +
scale_x_continuous(breaks = seq(from = -800, to = 800, by = 100),
limits = c(-800,800), name = "") +
facet_wrap(variable ~ ., ncol = 1) +
theme(strip.text.x = element_text(size=8),
strip.background = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
panel.background = element_blank(),
panel.grid.major.x = element_line(linetype = "dashed",
color = "gray30"),
panel.grid.minor.x = element_line(linetype = "dotted",
color = "gray80")) +
labs(title = "Epigraphic Evidence from Bithynia",
subtitle = "Spatio-temporal distribution",
caption = attributes(inscriptions)$source)
In order to compare this output to generally more common visualizations, we can prepare a histogram of the original data. Here we have two possible approaches. We first prepare a two-part histogram for the upper and lower boundaries of the respective dating for each object. We select only three locations to keep the visualization short:
%>%
inscr_clean select(ID, Location, DAT_min, DAT_max) %>%
filter(Location %in% c("Prusias ad Hypium", "Nicomedia", "Apamea")) %>%
::melt(id.vars = c("ID", "Location")) %>%
reshape2ggplot(aes(x = value, fill = variable)) +
geom_histogram(binwidth = 25, position = "dodge") +
facet_wrap(. ~ Location, ncol = 1) +
labs(title = "Distribution of Dated Inscriptions in Bithynia (selection)",
x = "Dating", y = "Number of Inscriptions") +
theme(legend.position = "top")
#> Warning: attributes are not identical across measure variables; they will be
#> dropped
Alternatively, we could display the mean of the Dating range:
%>%
inscr_clean transmute(ID, Location, Language, DAT_mean = ((DAT_min + DAT_max)/2)) %>%
filter(Location %in% c("Prusias ad Hypium", "Nicomedia", "Apamea")) %>%
::melt(id.vars = c("ID", "Location", "Language")) %>%
reshape2ggplot(aes(x = value, fill = Language)) +
geom_histogram(binwidth = 25) +
facet_wrap(. ~ Location, ncol = 1) +
labs(title = "Distribution of Dated Inscriptions in Bithynia (selection)",
x = "Dating", y = "Number of Inscriptions")
While this also gives us an impression of the distribution, it seems some information got lost on the way. Many inscriptions were dated to large time spans, and those have now been gathered into their mean value. The upside is that the histograms show the real count of inscriptions, which density graphs cannot.
We have shown in this vignette, how a data set as it is often encountered by archaeologists can be prepared and formatted in order to be processed by datplot. Furthermore, we have made some suggestions on visualization. This vignette hopes to help researchers that are not yet very familiar with the processes of data cleaning to solutions for their own projects.
For information and guidance on regular expressions see e.g. https://regex101.com/. However, R has a slightly different handling of escape characters. Information can be found in the R-Documentation under ?regex
or here.↩︎
Using different stepsizes is a bit of an artifact left over from a less efficient version of datplot, where is processing would take more than a minute if stepsize was set to 1, so I would usually start out with 100 or 25 just to look for errors and problems and to not lose much time. This, however, is technically not necessary in any way. The function now works faster, though there is a lot of room for improvement, as it may still take several seconds or longer, but generally not several minutes. We determined that the feature might still be useful, and therefore it was not removed.↩︎