One of the main benefits of Crunch is that it lets analysts and clients work with the same datasets. Instead of emailing datasets to clients, you can update the live dataset and ensure that they will see see the most up-to-date information. The potential problem with this setup is that it can become difficult to make provisional changes to the dataset without publishing it to the client. Sometimes an analyst wants to investigate or experiment with a dataset without the risk of sending incorrect or confusing information to the end user This is why we implemented a fork-edit-merge workflow for Crunch datasets.
“Fork” originates in computer version control systems and just means to take a copy of something with the intention of making some changes to the copy, and then incorporating those changes back into the original. A helpful mnemonic is to think of a path which forks off from the main road, but then rejoins it later on. To see how this works lets first upload a new dataset to Crunch.
library(crunch)
<- newDataset(SO_survey, "stackoverflow_survey") ds
Imagine that this dataset is shared with several users, and you want
to update it without affecting their usage. You might also want to
consult with other analysts or decision makers to make sure that the
data is accurate before sharing it with clients. To do this you call
forkDataset()
to create a copy of the dataset.
<- forkDataset(ds) forked_ds
You now have a copied dataset which is identical to the original, and are free to make changes without fear of disrupting the client’s experience. You can add or remove variables, delete records, or change the dataset’s organization. These changes will be isolated to your fork and won’t be visible to the end user until you decide to merge the fork back into the original dataset. This lets you edit the dataset with confidence because your work is isolated.
In this case, let’s create a new categorical array variable.
$ImportantHiringCA <- makeArray(forked_ds[, c("ImportantHiringTechExp", "ImportantHiringPMExp")],
forked_dsname = "importantCatArray")
Our forked dataset has diverged from the original dataset. Which we can see by comparing their names.
all.equal(names(forked_ds), names(ds))
## [1] "Lengths (22, 23) differ (string compare on first 22)" "14 string mismatches"
You can work with the forked dataset as long as you like, if you want
to see it in the web App or share it with other analysts by you can do
so by calling webApp(forked_ds)
. You might create many
forks and discard most of them without merging them into the original
dataset.
If you do end up with changes to the forked dataset that you want to
include in the original dataset you can do so with the
mergeFork()
function. This function figures out what
changes you made the fork, and then applies those changes to the
original dataset.
<- mergeFork(ds, forked_ds) ds
After merging the original dataset includes the categorical array variable which we created on the fork.
$ImportantHiringCA ds
## importantCatArray (categorical_array)
## Subvariables:
## $ImportantHiringTechExp | ImportantHiringTechExp
## $ImportantHiringPMExp | ImportantHiringPMExp
It’s possible to to make changes to a fork which can’t be easily
merged into the original dataset. For instance if, while we were working
on this fork someone added another variable called
ImportantHiringCA
to the original dataset the merge might
fail because there’s no safe way to reconcile the two forks. This is
called a “merge conflict” and there are a couple best practices that you
can follow to avoid this problem:
Another good use of the fork-edit-merge workflow is when you want to append data to an existing dataset. When appending data you usually want to check that the append operation completed successfully before publishing the data to users. This might come up if you are adding a second wave of a survey, or including some additional data which came in after the dataset was initially sent to clients. The first step is to upload the second survey wave as its own dataset.
<- newDataset(SO_survey, "SO_survey_wave2") wave2
We then fork the original dataset and append the new wave onto the forked dataset.
<- forkDataset(ds)
ds_fork <- appendDataset(ds_fork, wave2) ds_fork
ds_fork
now has twice as many rows as ds
which we can verify with nrow
:
nrow(ds)
## [1] 1634
nrow(ds_fork)
## [1] 3268
Once we’ve confirmed that the append completed successfully we can merge the forked dataset back into the original one.
<- mergeFork(ds, ds_fork) ds
ds
now has the additional rows.
nrow(ds)
## [1] 3268
Merging two datasets together can often be the source of unexpected behavior like misaligning or overwriting variables, and so it’s a good candidate for this workflow. Let’s create a fake dataset with household size to merge onto the original one.
<- data.frame(Respondent = unique(as.vector(ds$Respondent)))
house_table $HouseholdSize <- sample(
house_table1:5,
nrow(house_table),
TRUE
)<- newDataset(house_table, "House Size") house_ds
There are a few reasons why we might not want to merge this new table onto our user facing data. For instance we might make a mistake in constructing the table, or have some category names which don’t quite match up. Merging the data onto a forked dataset again gives us the safety to make changes and verify accuracy without affecting client-facing data.
<- forkDataset(ds)
ds_fork <- merge(ds_fork, house_ds, by = "Respondent") ds_fork
Before merging the fork back into the original dataset, we can check that everything went well with the join.
crtabs(~ TabsSpaces + HouseholdSize, ds_fork)
## HouseholdSize
## TabsSpaces 1 2 3 4 5
## Both 128 120 106 132 100
## Spaces 264 260 248 220 270
## Tabs 310 256 294 284 252
And finally once we’re comfortable that everything went as expected we can send the data to the client by merging the fork back to the original dataset.
<- mergeFork(ds, ds_fork)
ds $HouseholdSize ds
## HouseholdSize (numeric)
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.948 4.000 5.000
Forking and merging datasets is a great way to make changes to the data. It allows you to verify your work and get approval before putting the data in front of clients, and gives you the freedom to make changes and mistakes without worrying about disrupting production data.