Using the CVEK R package

Wenying Deng

2020-11-27

Short Description

Using a library of base kernels, CVEK learns the generating function from data by directly minimizing the ensemble model’s error, and tests whether the data is generated by the RKHS under the null hypothesis. Part I presents a simple example to conduct Gaussian process regression and hypothesis testing using cvek function on simulated data. Part II shows a real-world application where we use CVEK to understand whether the per capita crime rate impacts the relationship between the local socioeconomic status and the housing price at Boston, MA, U.S.A. Finally, Part III provides code showing how to visualize the interaction effect from a cvek model.

Download the package from CRAN or GitHub and then install and load it.

library(CVEK)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.2
library(ggrepel)

Tutorial using simulated dataset

Generate Data and Define Model

We generate a simulated dataset using the \("linear"\) kernel, and set the relative interaction strength to be \(0.2\). The outcome \(y_i\) is generated as, \[\begin{align*} y_i=h_1(\mathbf{x}_{i, 1})+h_2(\mathbf{x}_{i, 2})+0.2 * h_{12}(\mathbf{x}_{i, 1}, \mathbf{x}_{i, 2})+\epsilon_i, \end{align*}\] where \(h_1\), \(h_2\), \(h_{12}\) are sampled from RKHSs \(\textit{H}_1\), \(\textit{H}_2\), \(\textit{H}_{12}\), generated using the corresponding linear kernel. We standardize all sampled functions to have unit form, so that \(0.2\) represents the strength of interaction relative to the main effect.

The resulting data look as follows.

y z1 z2 z3 z4
1.206494 -0.3540220 -0.8477765 -1.9983111 1.3628378
1.595668 -1.3521966 0.9002141 1.0221309 -0.7187911
1.469933 0.5276426 -0.8567797 0.0372108 0.4386086
1.593581 -1.0576862 0.7018883 0.9085519 -0.8034505
1.363133 0.9926991 0.7144279 -0.9475966 -0.2037185

Now we can apply the cvek function to conduct Gaussian process regression. Below table is a detailed list of all the arguments of the function cvek.

Suppose we want our kernel library to contain three kernels: \("linear"\), \("polynomial"\) with \(p=2\), and \("rbf"\) with \(l=1\) (the effective parameter for \("polynomial"\) is \(p\) and the effective parameter for \("rbf"\) is \(l\), so we can set anything to \(l\) for \("polynomial"\) kernel and \(p\) for \("rbf"\) kernel). We then first apply define_library.

The null model is then \(y \sim z1 + z2 + k(z3, z4)\).

Estimation and Testing

With all these parameters specified, we can conduct Gaussian process regression.

We can see that the ensemble weight assigns \(0.99\) to the \("linear"\) kernel, which is the true kernel. This illustrates the accuracy and efficiency of the CVEK method.

We next specify the testing procedure. Note that we can use the same function cvek to perform hypothesis testing, as we did for estimation, but we need to provide \(formula\_test\), which is the user-supplied formula indicating the additional alternative effect (e.g., interactions) to test for. Specifically, we will first show how to conduct the classic score test by specifying \(test="asymp"\), followed by a bootstrap test where we specify \(test="boot"\), and the number of bootstrap samples \(B=200\).

Both tests come to the same conclusion. At the significance level \(0.05\), we reject the null hypothesis that there’s no interaction effect, which matches our data generation mechanism. Additionally, we can prediction new outcomes based on estimation result est_res.

y_pred y z1 z2 z3 z4
41 1.459658 1.455218 0.4551541 0.8838175 -0.1064603 -0.7294840
42 1.522598 1.592729 -0.3551146 -0.7374137 -1.0941214 -2.1262211
43 1.499522 1.519744 -2.9798328 -1.0640729 -0.2713319 -0.2268063
44 1.493907 1.517606 0.0145789 -0.2485406 -0.4154637 -0.9278958
45 1.486986 1.530861 -0.2603291 -1.0785118 -0.8478335 -0.8378207

Detecting Nonlinear Interaction in Boston Housing Price

In this part, we show an example of using cvek test to detect nonlinear interactions between socioeconomic factors that contribute to housing price in the city of Boston, Massachusetts, USA. We consider the Boston dataset (available in the MASS package), which is collected by the U.S Census Service about the median housing price (\(medv\)) in Boston, along with additional variables describing local socioeconomic information such as per capita crime rate, proportion of non-retail business, number of rooms per household, etc. Below table lists the \(14\) variables.

knitr::include_graphics("table2.pdf", auto_pdf = TRUE)

Here we use cvek to study whether the per capita crime rate (\(crim\)) impacts the relationship between the local socioeconomic status (\(lstat\)) and the housing price. The null model is, \[\begin{align*} medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat), \end{align*}\] where \(\mathbf{x}^\top=(1, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black)\), and \(k()\) is specified as a semi-parametric model with a model library that includes linear and rbf kernels with \(l=1\). This inclusion of nonlinearity (i.e., the rbf kernel) is important, since per classic results in the macroeconmics literature, the crime rates and socioeconomic status of local community are known to have nonlinear association with the local housing price harrison_hedonic_1978.

kern_par <- data.frame(method = c("linear", "rbf"), 
                       l = rep(1, 2), p = 1:2, stringsAsFactors = FALSE)
# define kernel library
kern_func_list <- define_library(kern_par)

To this end, the hypothesis regarding whether the crime rate (\(crim\)) impacts the association between local socioeconomic status (\(lstat\)) and the housing price (\(medv\)) is equivalent to testing whether there exists a nonlinear interaction between \(crim\) and \(lstat\) in predicting \(medv\), i.e., \[\begin{align*} \mathcal{H}_0: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat), \\ \mathcal{H}_a: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat) + k(crim):k(lstat). \end{align*}\]

To test this hypothesis using cvek, we specify the null model using formula, and specify the additional interaction term (\(k(crim):k(lstat)\)) in the alternative model using formula_test, as shown below:

formula <- medv ~ zn + indus + chas + nox + rm + age + dis + 
  rad + tax + ptratio + black + k(crim) + k(lstat)
formula_test <- medv ~ k(crim):k(lstat)
fit_bos <- cvek(formula, kern_func_list = kern_func_list, data = Boston, 
                formula_test = formula_test, 
                lambda = exp(seq(-3, 5)), test = "asymp")

Given the fitted object (fit_bos), the p-value of the cvek test can be extracted as below:

fit_bos$pvalue
#>              [,1]
#> [1,] 4.614106e-06

Since \(p<0.05\), we reject the null hypothesis that there’s no \(crim:lstat\) interaction, and conclude that the data does suggest an impact of the crime rate on the relationship between the local socioeconomic status and the housing price.

Visualizing Nonlinear Interaction in Boston Housing Price

A versatile and sometimes the most intepretable method for understanding interaction effects is via plotting. Next we show the Boston example of how to visualize the fitted interaction from a cvek model. We visualize the interaction effects by creating five datasets: Fix all confounding variables to their means, vary \(lstat\) in a reasonable range (i.e., from \(12.5\) to \(17.5\), since the original range of \(lstat\) in Boston dataset is \((1.73, 37.97)\)), and respectively set \(crim\) value to its \(5\%, 25\%, 50\%, 75\%\) and \(95\%\) quantiles.

# first fit the alternative model
formula_alt <- medv ~ zn + indus + chas + nox + rm + age + dis + 
  rad + tax + ptratio + black + k(crim):k(lstat)
fit_bos_alt <- cvek(formula = formula_alt, kern_func_list = kern_func_list, 
                    data = Boston, lambda = exp(seq(-3, 5)))

# mean-center all confounding variables not involved in the interaction 
# so that the predicted values are more easily interpreted
pred_name <- c("zn", "indus", "chas", "nox", "rm", "age", 
               "dis", "rad", "tax", "ptratio", "black")
covar_mean <- apply(Boston, 2, mean)
pred_cov <- covar_mean[pred_name]
pred_cov_df <- t(as.data.frame(pred_cov))
lstat_list <- seq(12.5, 17.5, length.out = 100)
crim_quantiles <- quantile(Boston$crim, probs = c(.05, .25, .5, .75, .95))

# crim is set to its 5% quantile
data_test1 <- data.frame(pred_cov_df, lstat = lstat_list, 
                             crim = crim_quantiles[1])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[1]): row names were found from a short variable and have been
#> discarded
data_test1_pred <- predict(fit_bos_alt, data_test1)

# crim is set to its 25% quantile
data_test2 <- data.frame(pred_cov_df, lstat = lstat_list, 
                             crim = crim_quantiles[2])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[2]): row names were found from a short variable and have been
#> discarded
data_test2_pred <- predict(fit_bos_alt, data_test2)

# crim is set to its 50% quantile
data_test3 <- data.frame(pred_cov_df, lstat = lstat_list, 
                             crim = crim_quantiles[3])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[3]): row names were found from a short variable and have been
#> discarded
data_test3_pred <- predict(fit_bos_alt, data_test3)

# crim is set to its 75% quantile
data_test4 <- data.frame(pred_cov_df, lstat = lstat_list, 
                             crim = crim_quantiles[4])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[4]): row names were found from a short variable and have been
#> discarded
data_test4_pred <- predict(fit_bos_alt, data_test4)

# crim is set to its 95% quantile
data_test5 <- data.frame(pred_cov_df, lstat = lstat_list, 
                             crim = crim_quantiles[5])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[5]): row names were found from a short variable and have been
#> discarded
data_test5_pred <- predict(fit_bos_alt, data_test5)

# combine five sets of prediction data together
medv <- rbind(data_test1_pred, data_test2_pred, data_test3_pred, 
              data_test4_pred, data_test5_pred)
data_pred <- data.frame(lstat = rep(lstat_list, 5), medv = medv, 
                        crim = rep(c("5% quantile", "25% quantile", 
                                     "50% quantile", "75% quantile", 
                                     "95% quantile"), each = 100))
data_pred$crim <- factor(data_pred$crim, 
                         levels = c("5% quantile", "25% quantile", 
                                    "50% quantile", "75% quantile", 
                                    "95% quantile"))

data_label <- data_pred[which(data_pred$lstat == 17.5), ]
data_label$value <- c("0.028%", "0.082%", "0.257%", "3.677%", "15.789%")
data_label$value <- factor(data_label$value, levels = 
                             c("0.028%", "0.082%", "0.257%", 
                               "3.677%", "15.789%"))

ggplot(data = data_pred, aes(x = lstat, y = medv, color = crim)) + 
  geom_point(size = 0.1) + 
  geom_text_repel(aes(label = value), data = data_label, 
                  color = "black", size = 3.6) + 
    scale_colour_manual(values = c("firebrick1", "chocolate2", 
                                   "darkolivegreen3", "skyblue2", 
                                   "purple2")) + 
  geom_line() + theme_set(theme_bw()) +
  theme(panel.grid = element_blank(),
        axis.title.x = element_text(size = 12), 
        axis.title.y = element_text(size = 12), 
        legend.title = element_text(size = 12, face = "bold"), 
        legend.text = element_text(size = 12)) + 
    labs(x = "percentage of lower status", 
         y = "median value of owner-occupied homes ($1000)", 
         col = "per capita crime rate")

The figure above shows the \(medv\) - \(lstat\) relationship under different levels of \(crim\). Numbers at the end of each curves indicate the actual values of \(crim\) rate (per capita crime rate by town) at the corresponding quantiles. From the figure we see that crime rate does impact the relationship between the local socioeconomic status v.s. housing price. Building on this code, user can continue to refine the visualization (e.g., by adding in confidence levels) and use it to improve the the model fit based on domain knowledge (e.g., by experimenting different kernels / hyper-parameters).

References

  1. Jeremiah Zhe Liu and Brent Coull. Robust Hypothesis Test for Nonlinear Effect with Gaussian Processes. October 2017.
  2. Xiang Zhan, Anna Plantinga, Ni Zhao, and Michael C. Wu. A fast small-sample kernel independence test for microbiome community-level association analysis. December 2017.
  3. Arnak S. Dalalyan and Alexandre B. Tsybakov. Aggregation by Exponential Weighting and Sharp Oracle Inequalities. In Learning Theory, Lecture Notes in Computer Science, pages 97– 111. Springer, Berlin, Heidelberg, June 2007.
  4. Arnab Maity and Xihong Lin. Powerful tests for detecting a gene effect in the presence of possible gene-gene interactions using garrote kernel machines. December 2011.
  5. The MIT Press. Gaussian Processes for Machine Learning, 2006.
  6. Xihong Lin. Variance component testing in generalised linear models with random effects. June 1997.
  7. Philip S. Boonstra, Bhramar Mukherjee, and Jeremy M. G. Taylor. A Small-Sample Choice of the Tuning Parameter in Ridge Regression. July 2015.
  8. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer- Verlag, New York, 2 edition, 2009.
  9. Hirotogu Akaike. Information Theory and an Extension of the Maximum Likelihood Principle. In Selected Papers of Hirotugu Akaike, Springer Series in Statistics, pages 199–213. Springer, New York, NY, 1998.
  10. Clifford M. Hurvich and Chih-Ling Tsai. Regression and time series model selection in small samples. June 1989.
  11. Hurvich Clifford M., Simonoff Jeffrey S., and Tsai Chih-Ling. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. January 2002.