In medical research, it is common for researchers to gather data from multiple sources. However, due to privacy concerns (e.g. HIPAA) or the propitiatory nature of the data, one or more of the potential sources may not be in a position to share the data. Various methods have been proposed in the literature, and some R packages have been written (e.g distcomp
, ppmHR
), which allow analyses to be performed on distributed data in a secure setting. That is, the analyses are performed as if the data were aggregated, but in reality each data partner maintains control over their own data and only shares high level statistics in such a way that the original data cannot be deduced by the other data partners.
Each of these methods assumes that the data is horizontally partitioned. That is, each data partner has the same response variable and covariates for distinct cohorts. What makes this package novel is that we assume that the data is vertically partitioned. That is, each data partner holds a set of unique covariates for the same cohort of observations. (One way to think about this is with horizontally partitioned data, the observation matrix is partitioned by horizontal lines and in vertically partitioned data, the observation matrix is partitioned by vertical lines.)
In this package, we implement three different protocols for vertical distributed regression analysis (VDRA) in a secure setting. They are called \(2\)-party, \(2^T\)-party, and \(k^T\)-party. Each of these protocols can be used to perform linear, logistic, and Cox regression. The purpose of this vignette is to demonstrate how to use this package, so we will not go into details of the protocols and algorithms here. Specific information on the algorithms and a discussion of their security can be found in the following papers:
This package was designed to run in Base-R. However, there are select cases when using functions from the survival
package is possible. In these cases, if the survival
package is installed, those functions will be used, otherwise, we use our own functions.
When using this package on PopMedNet, the computations can take tens of minutes for linear regression on a data set with a few thousand observations, to hours for Cox regression on a data set with hundreds of thousands, or millions of observations. Even when running the simulation on your local computer using the PopMedNet simulator pmn()
, the computations can take several minutes. These times are due to the large about of data that needs to be transferred back and forth between the data partners in order to ensure that the computations are secure and that there is no data leakage. If one were able to aggregate the data from the different data sources without concern for privacy, then they would do so and use more efficient functions. However, there is no free lunch, and the price for security and privacy is time. So, when working through the examples in this vignette, grab a drink, put on some nice music, and find pleasure in the journey.
With \(2\)-party VDRA, there are only two data holders, DP0 and DP1 (short for “Data Partner 0” and “Data Partner 1”). This protocol allows DP0 and DP1 to communicate directly with each other, but they never share patient level data with each other nor information which would allow the other party to deduce patient level information. DP0 and DP1 share intermediate statistics with each other multiple times in an iterative manner until the final regression results are computed by DP0 and shared with DP1. A pictorial representation of the data flow between the parties is shown below.
The name DP0 indicates which party is acting as the analysis center. This is the only protocol where the analysis center also provides data. For the next two protocols, the analysis center does not provide any data.
With \(2^T\)-party VDRA, there are only two data holders, DP1 and DP2, and an analysis center, DP0. DP1 and DP2 cannot communicate directly with each other, but all communication must pass through DP0. DP0 is a trusted third party that helps facilitate communication and performs much of the computation that was performed by DP1 in the \(2\)-party protocol. Whenever possible, any intermediate statistics shared with DP0 from one data partner (never patient level data) are multiplied by a random orthonormal matrix before being sent to the other data partner. This adds an extra layer of security at the expense of sending more data. The final regression results are computed by DP0 and shared with DP1 and DP2. A pictorial representation of the data flow between the parties is shown below.
With \(k^T\)-party VDRA, there are two or more data holders, DP1, DP2, … DPk, and one analysis center, DP0. We have tested the package with up to 10 data holders, but there is no reason why there could not be more. With this protocol, all data partners are able to communicate with each other with the benefit that less data is transferred. As with \(2^T\)-party VDRA, DP0 facilitates the computations and computes the final regression results, which are then sent to all the data partners. The one possible concern with this method is that a data breach at both the analysis center and one of the data partners could expose another data partner’s data, even though neither the analysis center nor any data partner have enough information to reconstruct any part of any other data partners’ data on their own. A pictorial representation of the data flow between the parties is shown below.
PopMedNet (https://pmnplayground-1.popmednet.org/), short for Population Medicine Network, is a “scalable and extensible open-source informatics platform designed to facilitate the implementation and operation of distributed health data networks.” PopMedNet is maintained by Department of Population Medicine at the Harvard Medical School and the Harvard Pilgrim Health Care Institute. Through two generous National Institute of Health Grants, we were able to make modifications to PopMedNet which allowed us to transmit data between data partners in a secure and automatic fashion. These modification were made specifically with the intent of the creation of this package. While this package has been designed to work seamlessly with PopMedNet as a means of communication, it should be possible to use other file transfer software to perform the same task.
If a group of data partners wishes to use PopMedNet and this package for analysis of vertically distributed data, please refer to the vignette “How to Use the VDRA Package with PopMedNet” for further information. However, for the individuals which are interested in testing out the package to see how it works, a PopMedNet simulator, pmn()
, has been supplied as part of the package. This allows the individual to play the part of the all the data partners and the analysis center on a single computer in order to gain an understanding of how to use the package before implementation in a real world setting.
We demonstrate the usage of this package using pmn()
as the file transfer protocol for \(2\)-party, \(2^T\)-party, and \(k^T\)-party situations. Take careful note the use of the parameter popmednet = FALSE
in all function calls. When popmednet = TRUE
is used, an offset is added to make sure that when two or more data partners are running in parallel, there is a at least a 15 second window between when they signal PopMedNet that they are ready to upload files. There are technical reasons for this that are specific to PopMedNet, but are not applicable when other file transfer protocols are utilized.
The VDRA
package comes with a data set vdra_data
which contains simulated data from a BMI study by the Harvard School of Medicine. The data in vdra_data
are not fit for scientific research, but are rather provided as an example for use when learning to use this package. The data contains information from 5,740 subjects with four possible response variables and seven covariates.
The four response variables are:
Variable.s. | Intended.Use |
---|---|
Change_BMI | Linear Regression |
WtLost | Logistic Regression |
Time, Status | Cox Regression |
If each data partner is using their own data, it is assumed that each data partner has the same number of observations and that observations on corresponding rows are for the same patient. In other words, it is assumed that the data are already aligned according to some common key, and it will be treated as such. Additionally, it is expected that the data will have already been cleaned and only valid values are presented. Some checking of the data is performed by the package, including looking for missing values. During this step, if any problems are found a descriptive error message will be given and the program will terminate.
The directories in which each data partner reads and writes files is fairly rigid, but the home location where these directories are placed is up to the user. For this vignette, we set the base directory to be ~/vdra
, but in practice, you can name it whatever you want and place it wherever you desire. This directory is passed as a parameter to pmn()
and the data partners, and they assume that it exists. If it does not, they terminate execution. At this point, please create your work directory before proceeding.
Before each run, ~/vdra
and all of its subdirectories should not contain any file named files_done.ok
, job_done.ok
, or job_fail.ok
. These are special files that signal PopMedNed (and the simulator) that files have been transferred from another data partner, are ready to be transferred to another data partner, or that the computation has completed or failed. In the event of a normal program termination, these files are removed automatically, but in the event of a program crash or other unplanned events, one or more of these files may be left behind which could have unintended consequences the next time that a simulation is run. Because of this, we suggest that all files are deleted from ~/pmn
each time you run a simulation.
When is run the first time (where \(k\) is some positive integer signaling how many parties (data partners and the analysis center) are going to participate in the computation, it first checks that ~/vdra
exists and then it creates the subdirectories dp0
, dp1
, through dp(k)
in the directory ~/vdra
. Each subdirectory contains the further subdirectories inputfiles
, msoc
, msoc1
, \(\dots\), msoc(k)
. All files that Data Partner \(n\) wants transferred to another data partner are written to ~/vdra/dp(n)/inputfiles
. On the other hand, ~/vdra/dp(n)/msoc
contains the files sent to Data Partner \(n\) by the analysis center (Data Partner 0) and ~/vdra/dp(n)/msoc(m)
contains the files sent to Data Partner \(n\) by Data Partner \(m\).
As pmn()
(or PopMedNet if you are using that) and the data partners all run in parallel, we now outline the order of execution. For simplicity, assume that we have the analysis center (Data Partner 0) and data partners 1 and 2. When we first start the scripts, data partners 1 and 2 are active. They do some computations and then write files to their respective directory inputfiles
, along with a file file_list.csv
which indicates which files are to be sent to which data partners. A data partner can transfer a file to any other data partner within the limitations of the protocol being used, the only additional exception being that they cannot transfer a file to themselves. Finally, they write files_done.csv
which indicates to pmn()
(PopMedNet) that the files are ready to be transfered. Once all active data partners have written files_done.csv
, pmn()
(PopMedNet) transfers the files to the appropriate read directories of the appropriate data partners. Once the files are transferred, pmn()
(PopMedNet) writes files_done.csv
in each directory that received a file. Each data partner that received a file is now active. The active data partners delete the various files_done.csv
, read the input files, process, write files, and the process continues. The process ends when the analysis center determines that the computation is over. At this point, the analysis center writes the file job_done.ok
in the case of a successful computation, or job_fail.ok
in the case of an unsuccessful computation. This signals pmn()
(PopMedNet) to tell all the other data partners to shutdown and then pmn()
(PopMedNet) shuts down itself.
The next sections of the vignette demonstrate how to use this package to perform parallel computations.
For \(2\)-party Vertically Distributed Regression, we will run three sessions of R simultaneously. If you are using R-Studio, you can open up two more sessions of R-Studio by choosing the menu item Session
\(\rightarrow\)New Session
. The first session will run the PopMedNet Simulator, pmn()
, and the other two sessions will run the analysis center (Data Partner 0) and the Data Partner 1, respectively. Once you run one block of code in a session of R, immediately move to the next R session to execute the next block of code. The scripts interact in such a way that they all need to run in parallel for the the computation to proceed to completion.
In our code, we assume that we run pmn()
first, as this simulates PopMedNet being run first, which delivers the requests in a real-world setting. Before you run pmn()
, be sure that ~/vdra
is empty (or at least does not contain any files of the form *.ok
. When we run pmn()
, we tell it that there will be one data partner beyond the analysis center and that the working directory is ~/vdra
.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(2, "~/vdra")
At this point, pmn()
will have created directories ~/vdra/dp0
and ~/vdra/dp1
which are used in the communication process. For this protocol, it is assumed that Data Partner 0 has the response variable(s) and potentially some covariates, while it is assumed that Data Partner 1 has at least one covariate. For this vignette, we let Data Partner 0 have the covariates stored in columns 5 through 7 of vdra_data
and Data Partner 1 has the covariates stored in columns 8 through 11.
Once pmn()
is running, proceed to the section indicating the regression you wish to perform.
In order to perform linear regression in a \(2\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first.
library(vdra)
= AnalysisCenter.2Party(regression = "linear",
fit data = vdra_data[, c(1, 5:7)],
response = "Change_BMI",
monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner.2Party(regression = "linear",
fit data = vdra_data[, 8:11],
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
After a few minutes, you should see the following output for both parties.
## Party Estimate Std. Error t value Pr(>|t|)
## (Intercept) dp0 2.284e+01 1.620344 1.409e+01 < 2.22e-16 ***
## Exposure dp0 -5.779e+00 0.394282 -1.466e+01 < 2.22e-16 ***
## Age dp0 2.024e-01 0.020086 1.007e+01 < 2.22e-16 ***
## ComorbidScore dp0 2.626e-01 0.106151 2.474e+00 0.0133813 *
## NumRx dp1 -1.324e-01 0.096882 -1.367e+00 0.1717263
## BMI_pre dp1 -1.176e-02 0.021796 -5.394e-01 0.5896362
## Race:Race 1 dp1 2.108e+00 0.788397 2.673e+00 0.0075283 **
## Race:Race 2 dp1 1.675e+00 0.740472 2.262e+00 0.0237404 *
## Race:Race 3 dp1 3.641e+00 0.733589 4.964e+00 7.1126e-07 ***
## Race:Race 4 dp1 4.722e+00 0.738904 6.391e+00 1.7827e-10 ***
## Race:Race 5 dp1 4.033e-02 0.746852 5.400e-02 0.9569363
## Sex:M dp1 -1.244e+00 0.563097 -2.208e+00 0.0272501 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.92 on 5728 degrees of freedom
## Multiple R-squared: 0.06851 , Adjusted R-squared: 0.06672
## F-statistic: 38.3 on 11 and 5728 DF, p-value: < 2.22e-16
The output is similar if we had used lm()
:
= lm(Change_BMI ~ ., vdra_data[, c(1, 5:11)])
fit summary(fit)
##
## Call:
## lm(formula = Change_BMI ~ ., data = vdra_data[, c(1, 5:11)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.935 -10.324 -1.444 8.571 77.989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.83755 1.62034 14.094 < 2e-16 ***
## Exposure -5.77936 0.39428 -14.658 < 2e-16 ***
## Age 0.20235 0.02009 10.074 < 2e-16 ***
## ComorbidScore 0.26264 0.10615 2.474 0.01338 *
## NumRx -0.13242 0.09688 -1.367 0.17173
## BMI_pre -0.01176 0.02180 -0.539 0.58964
## RaceRace 1 2.10776 0.78840 2.673 0.00753 **
## RaceRace 2 1.67488 0.74047 2.262 0.02374 *
## RaceRace 3 3.64139 0.73359 4.964 7.11e-07 ***
## RaceRace 4 4.72205 0.73890 6.391 1.78e-10 ***
## RaceRace 5 0.04033 0.74685 0.054 0.95694
## SexM -1.24359 0.56310 -2.208 0.02725 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.92 on 5728 degrees of freedom
## Multiple R-squared: 0.06851, Adjusted R-squared: 0.06672
## F-statistic: 38.3 on 11 and 5728 DF, p-value: < 2.2e-16
As with the linear regression, we are already running the following block of code in an R session.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(2, "~/vdra")
In order to perform logistic regression in a \(2\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first. Notice that this time we are using the binary variable WtLost
as the response.
library(vdra)
= AnalysisCenter.2Party(regression = "logistic",
fit data = vdra_data[, c(2, 5:7)],
response = "WtLost",
monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner.2Party(regression = "logistic",
fit data = vdra_data[, 8:11],
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
After a few minutes, you should see the following output:
## Party Estimate Std. Error t value Pr(>|t|)
## (Intercept) dp0 1.633e+00 0.377311 4.327351 1.5091e-05 ***
## Exposure dp0 -9.409e-01 0.099919 -9.416743 < 2.22e-16 ***
## Age dp0 3.772e-02 0.004815 7.833597 4.7411e-15 ***
## ComorbidScore dp0 3.653e-02 0.025524 1.431209 0.15237037
## NumRx dp1 -4.366e-02 0.023002 -1.898102 0.05768263 .
## BMI_pre dp1 -3.459e-03 0.005094 -0.679062 0.49709830
## Race:Race 1 dp1 3.164e-01 0.190301 1.662715 0.09636954 .
## Race:Race 2 dp1 4.087e-01 0.176844 2.311323 0.02081501 *
## Race:Race 3 dp1 4.870e-01 0.175991 2.767129 0.00565524 **
## Race:Race 4 dp1 6.556e-01 0.188746 3.473682 0.00051337 ***
## Race:Race 5 dp1 -2.440e-01 0.153508 -1.589724 0.11189697
## Sex:M dp1 -3.154e-01 0.139085 -2.267681 0.02334863 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null Deviance: 3530 on 5739 degrees of freedom
## Residual deviance: 3311 on 5728 degrees of freedom
## AIC: 3335
## BIC: 3415
##
## Number of Newton-Raphson iterations: 7
The output is similar if we had used glm()
:
= glm(WtLost ~ ., vdra_data[c(2, 5:11)], family = binomial)
fit summary(fit)
##
## Call:
## glm(formula = WtLost ~ ., family = binomial, data = vdra_data[c(2,
## 5:11)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8468 0.2797 0.3737 0.4848 1.0130
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.632756 0.377298 4.327 1.51e-05 ***
## Exposure -0.940912 0.099913 -9.417 < 2e-16 ***
## Age 0.037721 0.004815 7.834 4.73e-15 ***
## ComorbidScore 0.036530 0.025523 1.431 0.152356
## NumRx -0.043661 0.023001 -1.898 0.057674 .
## BMI_pre -0.003459 0.005094 -0.679 0.497085
## RaceRace 1 0.316417 0.190294 1.663 0.096357 .
## RaceRace 2 0.408743 0.176838 2.311 0.020811 *
## RaceRace 3 0.486991 0.175986 2.767 0.005654 **
## RaceRace 4 0.655642 0.188734 3.474 0.000513 ***
## RaceRace 5 -0.244035 0.153505 -1.590 0.111890
## SexM -0.315401 0.139079 -2.268 0.023343 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3530.2 on 5739 degrees of freedom
## Residual deviance: 3310.6 on 5728 degrees of freedom
## AIC: 3334.6
##
## Number of Fisher Scoring iterations: 5
As with the linear regression, we are already running the following block of code in an R session.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(2, "~/vdra")
In order to perform Cox regression in a \(2\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first. Notice that this time we are using the two variables for the response: Time
which measures the time to event and the binary variable Status
which records if the event happened or was censored.
library(vdra)
= AnalysisCenter.2Party(regression = "cox",
fit data = vdra_data[, c(3:4, 5:7)],
response = c("Time", "Status"),
monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner.2Party(regression = "cox",
fit data = vdra_data[, 8:11],
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
After a few minutes, you should see the following output:
## n= 5740, number of events= 4127
##
## party coef exp(coef) se(coef) z Pr(>|z|)
## Exposure dp0 -0.046067 0.954978 0.031227 -1.475201 0.140158
## Age dp0 0.002521 1.002524 0.001567 1.608509 0.107724
## ComorbidScore dp0 -0.004967 0.995045 0.008322 -0.596861 0.550600
## NumRx dp1 -0.004932 0.995081 0.007601 -0.648776 0.516483
## BMI_pre dp1 0.058895 1.060664 0.001615 36.457746 < 2e-16 ***
## Race:Race 1 dp1 -0.054626 0.946839 0.062410 -0.875274 0.381425
## Race:Race 2 dp1 0.019202 1.019388 0.058428 0.328648 0.742422
## Race:Race 3 dp1 0.098094 1.103067 0.056990 1.721246 0.085206 .
## Race:Race 4 dp1 0.024423 1.024724 0.058176 0.419815 0.674621
## Race:Race 5 dp1 -0.120371 0.886592 0.060084 -2.003364 0.045138 *
## Sex:M dp1 0.011307 1.011371 0.044496 0.254116 0.799406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## party exp(coef) exp(-coef) lower .95 upper .95
## Exposure dp0 0.954978 1.047144 0.898282 1.015253
## Age dp0 1.002524 0.997482 0.999449 1.005608
## ComorbidScore dp0 0.995045 1.004980 0.978947 1.011408
## NumRx dp1 0.995081 1.004944 0.980365 1.010017
## BMI_pre dp1 1.060664 0.942806 1.057311 1.064028
## Race:Race 1 dp1 0.946839 1.056145 0.837824 1.070040
## Race:Race 2 dp1 1.019388 0.980981 0.909087 1.143071
## Race:Race 3 dp1 1.103067 0.906563 0.986488 1.233423
## Race:Race 4 dp1 1.024724 0.975873 0.914298 1.148487
## Race:Race 5 dp1 0.886592 1.127915 0.788098 0.997396
## Sex:M dp1 1.011371 0.988757 0.926906 1.103533
##
## Concordance= 0.6607 (se = 0.004373 )
## Likelihood ratio test= 1194 on 11 df, p= < 2.22e-16
## Wald test = 1354 on 11 df, p= < 2.22e-16
## Score test = 1378 on 11 df, p= < 2.22e-16
##
## Number of Newton-Raphson iterations: 5
The output is similar if we had used coxph()
in the survival
package:
library(survival)
## Warning: package 'survival' was built under R version 4.0.5
= coxph(Surv(Time, Status) ~ ., data = vdra_data[, 3:11])
fit summary(fit)
## Call:
## coxph(formula = Surv(Time, Status) ~ ., data = vdra_data[, 3:11])
##
## n= 5740, number of events= 4127
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Exposure -0.046067 0.954978 0.031227 -1.475 0.1402
## Age 0.002521 1.002524 0.001567 1.609 0.1077
## ComorbidScore -0.004967 0.995045 0.008322 -0.597 0.5506
## NumRx -0.004932 0.995081 0.007601 -0.649 0.5165
## BMI_pre 0.058895 1.060664 0.001615 36.458 <2e-16 ***
## RaceRace 1 -0.054626 0.946839 0.062410 -0.875 0.3814
## RaceRace 2 0.019202 1.019388 0.058428 0.329 0.7424
## RaceRace 3 0.098094 1.103067 0.056990 1.721 0.0852 .
## RaceRace 4 0.024423 1.024724 0.058176 0.420 0.6746
## RaceRace 5 -0.120371 0.886592 0.060084 -2.003 0.0451 *
## SexM 0.011307 1.011371 0.044496 0.254 0.7994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Exposure 0.9550 1.0471 0.8983 1.0153
## Age 1.0025 0.9975 0.9994 1.0056
## ComorbidScore 0.9950 1.0050 0.9789 1.0114
## NumRx 0.9951 1.0049 0.9804 1.0100
## BMI_pre 1.0607 0.9428 1.0573 1.0640
## RaceRace 1 0.9468 1.0561 0.8378 1.0700
## RaceRace 2 1.0194 0.9810 0.9091 1.1431
## RaceRace 3 1.1031 0.9066 0.9865 1.2334
## RaceRace 4 1.0247 0.9759 0.9143 1.1485
## RaceRace 5 0.8866 1.1279 0.7881 0.9974
## SexM 1.0114 0.9888 0.9269 1.1035
##
## Concordance= 0.661 (se = 0.004 )
## Likelihood ratio test= 1194 on 11 df, p=<2e-16
## Wald test = 1354 on 11 df, p=<2e-16
## Score (logrank) test = 1378 on 11 df, p=<2e-16
For \(2^T\)-party Vertically Distributed Regression, we will run four sessions of R simultaneously. If you are using R-Studio, you can open up three more sessions of R-Studio by choosing the menu item Session
\(\rightarrow\)New Session
. The first session will run the PopMedNet Simulator, pmn()
, and the other three sessions will run the analysis center (Data Partner 0), Data Partner 1, and Data Partner 2, respectively. Once you run one block of code in a session of R, immediately move to the next R session to execute the next block of code. The scripts interact in such a way that they all need to run in parallel for the the computation to proceed to completion.
In our code, we assume that we run pmn()
first, as this simulates PopMedNet being run first, which delivers the requests in a real-world setting. Before you run pmn()
, be sure that ~/vdra
is empty (or at least does not contain any files of the form *.ok
). When we run pmn()
, we tell it that there will be one data partner beyond the analysis center and that the working directory is ~/vdra
.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(3, "~/vdra")
At this point, pmn()
will have created directories ~/vdra/dp0
, ~/vdra/dp1
, and ~/vdra/dp2
which are used in the communication process. For this protocol, it is assumed that the analysis center has no data, Data Partner 1 has the response variable(s) and potentially some covariates, and that Data Partner 2 has at least one covariate. For this vignette, we let Data Partner 1 have the covariates stored in columns 5 through 7 of vdra_data
and Data Partner 2 has the covariates stored in columns 8 through 11.
Once pmn()
is running, proceed to the section indicating the regression you wish to perform.
In order to perform linear regression in a \(2^T\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first.
library(vdra)
= AnalysisCenter.3Party(regression = "linear",
fit monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner1.3Party(regression = "linear",
fit data = vdra_data[, c(1, 5:7)],
response = "Change_BMI",
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
Finally, run the following code in the fourth R session simultaneously with the other two R sessions. This code is for Data Partner 2.
library(vdra)
= DataPartner2.3Party(regression = "linear",
fit data = vdra_data[, 8:11],
monitorFolder = "~/vdra/dp2",
popmednet = FALSE)
summary(fit)
After a few minutes, you should see the same output as in the \(2\)-party scenario.
As with linear regression, we are already running the following block of code in an R session.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(3, "~/vdra")
In order to perform logistic regression in a \(2^T\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first. Notice that this time we are using the binary variable WtLost
as the response.
library(vdra)
= AnalysisCenter.3Party(regression = "logistic",
fit monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner1.3Party(regression = "logistic",
fit data = vdra_data[, c(2, 5:7)],
response = "WtLost",
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
Finally, run the following code in the fourth R session simultaneously with the other two R sessions. This code is for Data Partner 2.
library(vdra)
= DataPartner2.3Party(regression = "logistic",
fit data = vdra_data[, 8:11],
monitorFolder = "~/vdra/dp2",
popmednet = FALSE)
summary(fit)
After a few minutes, you should see the same output as in the \(2\)-party scenario
As with linear regression, we are already running the following block of code in an R session.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(3, "~/vdra")
In order to perform Cox regression in a \(2^T\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first. Notice that this time we are using the two variables for the response: Time
which measures the time to event and the binary variable Status
which records if the event happened or was censored.
library(vdra)
= AnalysisCenter.3Party(regression = "cox",
fit monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner1.3Party(regression = "cox",
fit data = vdra_data[, c(3:4, 5:7)],
response = c("Time", "Status"),
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
Finally, run the following code in the fourth R session simultaneously with the other two R sessions. This code is for Data Partner 2.
library(vdra)
= DataPartner2.3Party(regression = "cox",
fit data = vdra_data[, 8:11],
monitorFolder = "~/vdra/dp2",
popmednet = FALSE)
summary(fit)
After a few minutes, you should see the same output as in the \(2\)-party scenario
For \(k\)-party Vertically Distributed Regression, we are not limited to two data partners beyond the analysis center, but we can have any number. In fact, we have successfully run this program for \(k = 10\) data partners. However, for the sake of this vignette, we will restrict ourselves to just two and note that we would only have to change the value of numDataPartners
. For example, if we had \(k = 10\) data partners, we would need to set numDataPartners = 10
in each of the following code blocks. Since we have two data partners, we will run four sessions of R simultaneously. If you are using R-Studio, you can open up three more sessions of R-Studio by choosing the menu item Session
\(\rightarrow\)New Session
. The first session will run the PopMedNet Simulator, pmn()
, and the other three sessions will run the analysis Center (Data Partner 0), Data Partner 1, and Data Partner 2, respectively. Once you run one block of code in a session of R, immediately move to the next R session to execute the next block of code. The scripts interact in such a way that they all need to run in parallel for the the computation to proceed to completion.
In our code, we assume that we run pmn()
first, as this simulates PopMedNet being run first, which delivers the requests in a real-world setting. Before you run pmn()
, be sure that ~/vdra
is empty (or at least does not contain any files of the form *.ok
). When we run pmn()
, we tell it that there will be one data partner beyond the analysis center and that the working directory is ~/vdra
.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(3, "~/vdra")
At this point, pmn()
will have created directories ~/vdra/dp0
, ~/vdra/dp1
, and ~/vdra/dp2
which are used in the communication process. For this protocol, it is assumed that the analysis center has no data, Data Partner 1 has the response variable(s) and potentially some covariates, and that Data Partner 2 has at least one covariate. For this vignette, we let Data Partner 1 have the covariates stored in columns 5 through 7 of vdra_data
and Data Partner 2 has the covariates stored in columns 8 through 11.
Once pmn()
is running, proceed to the section indicating the regression you wish to perform.
In order to perform linear regression in a \(k^T\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first.
library(vdra)
= AnalysisCenter.KParty(regression = "linear",
fit numDataPartners = 2,
monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner.KParty(regression = "linear",
fit data = vdra_data[, c(1, 5:7)],
response = "Change_BMI",
numDataPartners = 2,
dataPartnerID = 1,
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
Finally, run the following code in the fourth R session simultaneously with the other two R sessions. This code is for Data Partner 2.
library(vdra)
= DataPartner.KParty(regression = "linear",
fit data = vdra_data[, 8:11],
numDataPartners = 2,
dataPartnerID = 2,
monitorFolder = "~/vdra/dp2",
popmednet = FALSE)
summary(fit)
As with linear regression, we are already running the following block of code in an R session.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(3, "~/vdra")
In order to perform logistic regression in a \(k^T\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first. Notice that this time we are using the binary variable WtLost
as the response.
library(vdra)
= AnalysisCenter.KParty(regression = "logistic",
fit numDataPartners = 2,
monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner.KParty(regression = "logistic",
fit data = vdra_data[, c(2, 5:7)],
response = "WtLost",
numDataPartners = 2,
dataPartnerID = 1,
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
Finally, run the following code in the fourth R session simultaneously with the other two R sessions. This code is for Data Partner 2.
library(vdra)
= DataPartner.KParty(regression = "logistic",
fit data = vdra_data[, 8:11],
numDataPartners = 2,
dataPartnerID = 2,
monitorFolder = "~/vdra/dp2",
popmednet = FALSE)
summary(fit)
As with linear regression, we are already running the following block of code in an R session.
library(vdra)
if (!dir.exists("~/vdra")) dir.create("~/vdra")
pmn(3, "~/vdra")
In order to perform Cox regression in a \(k^T\)-party setting run the following code in the second R session simultaneously with the first R session. This code is for the analysis center (Data Partner 0). In reality, it does not matter which data partner is run first. Notice that this time we are using the two variables for the response: Time
which measures the time to event and the binary variable Status
which records if the event happened or was censored.
library(vdra)
= AnalysisCenter.KParty(regression = "cox",
fit numDataPartners = 2,
monitorFolder = "~/vdra/dp0",
popmednet = FALSE)
summary(fit)
Now, run the following code in the third R session simultaneously with the other two R sessions. This code is for Data Partner 1.
library(vdra)
= DataPartner.KParty(regression = "cox",
fit data = vdra_data[, c(3:4, 5:7)],
response = c("Time", "Status"),
numDataPartners = 2,
dataPartnerID = 1,
monitorFolder = "~/vdra/dp1",
popmednet = FALSE)
summary(fit)
Finally, run the following code in the fourth R session simultaneously with the other two R sessions. This code is for Data Partner 2.
library(vdra)
= DataPartner.KParty(regression = "cox",
fit data = vdra_data[, 8:11],
numDataPartners = 2,
dataPartnerID = 2,
monitorFolder = "~/vdra/dp2",
popmednet = FALSE)
summary(fit)
In order to facilitate the the user to further analyze the data and the model, four utilities have been provided. We present these for each of the three types of regression that we perform.
For linear regression, we allow the creation of sub-models. As an example, we have stored the output of the distributed linear regression (either 2-party, \(2^T\)-party, or \(k^T\)-party) from the data partner which holds the response in vdra_fit_linear_A
(which comes with the package). We can check the fit of different sub-models as shown in the following script. All data partners can use this function. The first argument is a standard R formula using the variables that are found in the fit, and the second argument is the results of the regression.
library(vdra)
= differentModel(Age ~ ., vdra_fit_linear_A)
fit1 summary(fit1)
## Party Estimate Std. Error t value Pr(>|t|)
## (Intercept) dp0 3.707e+01 0.956664 38.746947 < 2e-16 ***
## Change_BMI dp0 8.604e-02 0.008540 10.074380 < 2e-16 ***
## Exposure dp0 2.481e-01 0.261858 0.947364 0.343493
## ComorbidScore dp0 -5.007e-02 0.069252 -0.723004 0.469707
## NumRx dp1 7.715e-02 0.063176 1.221213 0.222056
## BMI_pre dp1 -9.351e-03 0.014212 -0.657944 0.510600
## Race:Race 1 dp1 1.297e-01 0.514410 0.252062 0.801002
## Race:Race 2 dp1 9.192e-01 0.482905 1.903412 0.057037 .
## Race:Race 3 dp1 -1.663e-01 0.479376 -0.347000 0.728604
## Race:Race 4 dp1 8.517e-01 0.483403 1.761807 0.078155 .
## Race:Race 5 dp1 6.624e-01 0.486924 1.360370 0.173766
## Sex:M dp1 4.415e-01 0.367290 1.202165 0.229349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.732 on 5728 degrees of freedom
## Multiple R-squared: 0.01999 , Adjusted R-squared: 0.01811
## F-statistic: 10.62 on 11 and 5728 DF, p-value: < 2.22e-16
= differentModel(Change_BMI ~ Exposure + Age, vdra_fit_linear_A)
fit2 summary(fit2)
## Party Estimate Std. Error t value Pr(>|t|)
## (Intercept) dp0 23.473918 0.858507 27.3427 < 2.22e-16 ***
## Exposure dp0 -5.683961 0.397325 -14.3056 < 2.22e-16 ***
## Age dp0 0.204154 0.020230 10.0918 < 2.22e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.05 on 5737 degrees of freedom
## Multiple R-squared: 0.05128 , Adjusted R-squared: 0.05095
## F-statistic: 155 on 2 and 5737 DF, p-value: < 2.22e-16
= differentModel(Change_BMI ~ Exposure, vdra_fit_linear_A)
fit3 summary(fit3)
## Party Estimate Std. Error t value Pr(>|t|)
## (Intercept) dp0 31.65929 0.283832 111.5424 < 2.22e-16 ***
## Exposure dp0 -5.73301 0.400771 -14.3050 < 2.22e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.18 on 5738 degrees of freedom
## Multiple R-squared: 0.03443 , Adjusted R-squared: 0.03427
## F-statistic: 204.6 on 1 and 5738 DF, p-value: < 2.22e-16
= differentModel(Change_BMI ~ Exposure + Age + `Sex:M` + `Race:Race 1`, vdra_fit_linear_A)
fit4 summary(fit4)
## Party Estimate Std. Error t value Pr(>|t|)
## (Intercept) dp0 24.870147 0.927739 26.807267 < 2.22e-16 ***
## Exposure dp0 -5.704460 0.396184 -14.398517 < 2.22e-16 ***
## Age dp0 0.204923 0.020174 10.157533 < 2.22e-16 ***
## Sex:M dp1 -2.582645 0.489395 -5.277216 1.3598e-07 ***
## Race:Race 1 dp1 -0.467666 0.561828 -0.832401 0.40522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.01 on 5735 degrees of freedom
## Multiple R-squared: 0.05714 , Adjusted R-squared: 0.05649
## F-statistic: 86.89 on 4 and 5735 DF, p-value: < 2.22e-16
For logistic regression, we can perform both the Hosmer-Lemeshow goodness of fit test for logistic regression and plot the receiver operating curve (ROC). We run these tests on vdra_fit_logistic_A
, which is the results of the distributed logistic regression (either 2-party, \(2^T\)-party, or \(k^T\)-party) as seen by the data partner which holds the response. Only the data partner which holds the response can run these functions. In both cases, the second argument is optional and is the number or groups, or bins, on which to perform the analysis.
HoslemTest(vdra_fit_logistic_A)
## Hosmer and Lemeshow goodness of fit (GOF) test
## Chi-squared: 4.167579 with DF 8, p-value: 0.8416962
HoslemTest(vdra_fit_logistic_A, 50)
## Hosmer and Lemeshow goodness of fit (GOF) test
## Chi-squared: 41.12232 with DF 48, p-value: 0.7484554
RocTest(vdra_fit_logistic_A)
RocTest(vdra_fit_logistic_A, 50)
We can compute the survival curve for a distributed Cox regression (either 2-party, \(2^T\)-party, or \(k^T\)-party). This function is works similar to survfit()
in the survival
package. The results can be both printed and plotted. Only the data partner which holds the response can run this function. The plot command takes only takes the plotting parameters: xlim
, ylim
, xlab
, ylab
, and main
.
= survfitDistributed(vdra_fit_cox_A)
sf print(sf)
## n events
## 5740 4127
plot(sf)
# Calculate the results based on strat that the data partner with the response holds
= survfitDistributed(vdra_fit_cox_A, ~ Exposure, data = vdra_data[c(3:4, 5:7)])
sf print(sf)
## n events
## Exposure=0 2861 2076
## Exposure=1 2879 1968
# plot both curves in the same plot
plot(sf, xlim = c(0, 400), ylim = c(0, 1),
xlab = "Time to Event", ylab = "Survival Percentage",
main = "BMI Study")
# plot curves in two different plots.
plot(sf, merge = FALSE)