hermiter
facilitates the estimation of the probability density function and cumulative distribution function in univariate and bivariate settings using Hermite series based estimators. In addition, hermiter
allows the estimation of the quantile function in the univariate case and nonparametric correlation coefficients in the bivariate case. The package is applicable to streaming, batch and grouped data. The core methods of the package are written in C++ for speed.
These estimators are particularly useful in the sequential setting (both stationary and non-stationary data streams). In addition, they are useful in efficient, one-pass batch estimation which is particularly relevant in the context of large data sets. Finally, the Hermite series based estimators are applicable in decentralized (distributed) settings in that estimators formed on subsets of the data can be consistently merged. The Hermite series based estimators have the distinct advantage of being able to estimate the full density function, distribution function and quantile function (univariate setting) along with the Spearman Rho and Kendall Tau correlation coefficients (bivariate setting) in an online manner. The theoretical and empirical properties of most of these estimators have been studied in-depth in the articles below. The investigations demonstrate that the Hermite series based estimators are particularly effective in distribution function, quantile function and Spearman correlation estimation.
In order to utilize the hermiter package, the package must be loaded using the following command:
library(hermiter)
Other packages that are used in this vignette are loaded as follows:
library(magrittr)
library(ggplot2)
library(dplyr)
library(data.table)
library(DT)
library(mvtnorm)
library(patchwork)
A hermite_estimator S3 object is constructed as below. The argument, N, adjusts the number of terms in the Hermite series based estimator and controls the trade-off between bias and variance. A lower N value implies a higher bias but lower variance and vice versa for higher values of N. The argument, standardize, controls whether or not to standardize observations before applying the estimator. Standardization usually yields better results and is recommended for most estimation settings.
A univariate estimator is constructed as follows (note that the default estimator type is univariate, so this argument does not need to be explicitly set):
<- hermite_estimator(N=10, standardize=TRUE,
hermite_est est_type = "univariate")
Similarly for constructing a bivariate estimator:
<- hermite_estimator(N=10, standardize=TRUE,
hermite_est est_type = "bivariate")
Once the hermite_estimator object has been constructed, it can be updated with a batch of observations as below.
For univariate observations:
<- rlogis(n=1000)
observations <- hermite_estimator(N=10, standardize=TRUE)
hermite_est <- update_batch(hermite_est,observations) hermite_est
For bivariate observations:
<- matrix(data = rnorm(2000),nrow = 1000, ncol=2)
observations <- hermite_estimator(N=10, standardize=TRUE,
hermite_est est_type = "bivariate")
<- update_batch(hermite_est,observations) hermite_est
Functional piped syntax can also be used as below provided the magrittr
package is installed.
<- rlogis(n=1000)
observations <- hermite_estimator(N=10, standardize=TRUE)
hermite_est <- hermite_est %>% update_batch(observations) hermite_est
<- matrix(data = rnorm(2000),nrow = 1000, ncol=2)
observations <- hermite_estimator(N=10, standardize=TRUE,
hermite_est est_type = "bivariate")
<- hermite_est %>% update_batch(observations) hermite_est
In the sequential setting, observations are revealed one at a time. A hermite_estimator object can be updated sequentially with a single new observation by utilizing the update_sequential method. Note that when updating the Hermite series based estimator sequentially, observations are also standardized sequentially if the standardize argument is set to true in the constructor.
For univariate observations:
<- rlogis(n=1000)
observations <- hermite_estimator(N=10, standardize=TRUE)
hermite_est for (idx in seq_along(observations)) {
<- update_sequential(hermite_est,observations[idx])
hermite_est }
For bivariate observations:
<- matrix(data = rnorm(2000),nrow = 1000, ncol=2)
observations <- hermite_estimator(N=10, standardize=TRUE,
hermite_est est_type = "bivariate")
for (idx in seq_len(nrow(observations))) {
<- update_sequential(hermite_est,observations[idx,])
hermite_est }
For univariate observations:
<- rlogis(n=1000)
observations <- hermite_estimator(N=10, standardize=TRUE)
hermite_est for (idx in seq_along(observations)) {
<- hermite_est %>% update_sequential(observations[idx])
hermite_est }
For bivariate observations:
<- matrix(data = rnorm(2000),nrow = 1000, ncol=2)
observations <- hermite_estimator(N=10, standardize=TRUE,
hermite_est est_type = "bivariate")
for (idx in seq_len(nrow(observations))) {
<- hermite_est %>% update_sequential(observations[idx,])
hermite_est }
Hermite series based estimators can be consistently combined/merged in both the univariate and bivariate settings. In particular, when standardize = FALSE, the results obtained from combining/merging distinct hermite_estimators updated on subsets of a data set are exactly equal to those obtained by constructing a single hermite_estimator and updating on the full data set (corresponding to the concatenation of the aforementioned subsets). This holds true for the pdf, cdf and quantile results in the univariate case and the pdf, cdf and nonparametric correlation results in the bivariate case. When standardize = TRUE, the equivalence is no longer exact, but is accurate enough to be practically useful. Combining/merging hermite_estimators is illustrated below.
For the univariate case:
<- rlogis(n=1000)
observations_1 <- rlogis(n=1000)
observations_2 <- hermite_estimator(N=10, standardize=TRUE) %>%
hermite_est_1 update_batch(observations_1)
<- hermite_estimator(N=10, standardize=TRUE) %>%
hermite_est_2 update_batch(observations_2)
<- merge_hermite(list(hermite_est_1,hermite_est_2)) hermite_est_merged
For the bivariate case:
<- matrix(data = rnorm(2000),nrow = 1000, ncol=2)
observations_1 <- matrix(data = rnorm(2000),nrow = 1000, ncol=2)
observations_2 <- hermite_estimator(N=10, standardize=TRUE,
hermite_est_1 est_type = "bivariate") %>%
update_batch(observations_1)
<- hermite_estimator(N=10, standardize=TRUE,
hermite_est_2 est_type = "bivariate") %>%
update_batch(observations_2)
<- merge_hermite(list(hermite_est_1,hermite_est_2)) hermite_est_merged
The ability to combine/merge estimators is particularly useful in applications involving grouped data (as we will illustrate below).
The central advantage of Hermite series based estimators is that they can be updated in a sequential/one-pass manner as above and subsequently probability densities and cumulative probabilities at arbitrary x values can be obtained, along with arbitrary quantiles. The hermite_estimator object only maintains a small and fixed number of coefficients and thus uses minimal memory. The syntax to calculate probability densities, cumulative probabilities and quantiles in the univariate setting is presented below.
<- rlogis(n=2000)
observations <- hermite_estimator(N=10, standardize=TRUE)
hermite_est <- update_batch(hermite_est, observations)
hermite_est
<- seq(-15,15,0.1)
x <- dens(hermite_est,x)
pdf_est <- cum_prob(hermite_est,x)
cdf_est
<- seq(0.05,1,0.05)
p <- quant(hermite_est,p) quantile_est
<- rlogis(n=2000)
observations <- hermite_estimator(N=10, standardize=TRUE)
hermite_est <- hermite_est %>% update_batch(observations)
hermite_est
<- seq(-15,15,0.1)
x <- hermite_est %>% dens(x)
pdf_est <- hermite_est %>% cum_prob(x)
cdf_est
<- seq(0.05,0.95,0.05)
p <- hermite_est %>% quant(p) quantile_est
<- dlogis(x)
actual_pdf <- plogis(x)
actual_cdf <- data.frame(x,pdf_est,cdf_est,actual_pdf,actual_cdf)
df_pdf_cdf
<- qlogis(p)
actual_quantiles <- data.frame(p,quantile_est,actual_quantiles) df_quant
ggplot(df_pdf_cdf,aes(x=x)) + geom_line(aes(y=pdf_est, colour="Estimated")) +
geom_line(aes(y=actual_pdf, colour="Actual")) +
scale_colour_manual("",
breaks = c("Estimated", "Actual"),
values = c("blue", "black")) + ylab("Probability Density")
ggplot(df_pdf_cdf,aes(x=x)) + geom_line(aes(y=cdf_est, colour="Estimated")) +
geom_line(aes(y=actual_cdf, colour="Actual")) +
scale_colour_manual("",
breaks = c("Estimated", "Actual"),
values = c("blue", "black")) +
ylab("Cumulative Probability")
ggplot(df_quant,aes(x=actual_quantiles)) + geom_point(aes(y=quantile_est),
color="blue") +
geom_abline(slope=1,intercept = 0) +xlab("Theoretical Quantiles") +
ylab("Estimated Quantiles")
The aforementioned suitability of Hermite series based estimators in sequential and one-pass batch estimation settings extends to the bivariate case. Probability densities and cumulative probabilities can be obtained at arbitrary points. The syntax to calculate probability densities and
cumulative probabilities along with the Spearman and Kendall correlation coefficients in the bivariate setting is presented below.
# Prepare bivariate normal data
<- 1
sig_x <- 1
sig_y <- 4000
num_obs <- 0.5
rho <- mvtnorm::rmvnorm(n=num_obs,mean=rep(0,2),
observations_mat sigma = matrix(c(sig_x^2,rho*sig_x*sig_y,rho*sig_x*sig_y,sig_y^2),
nrow=2,ncol=2, byrow = TRUE))
<- hermite_estimator(N = 30, standardize = TRUE,
hermite_est est_type = "bivariate")
<- update_batch(hermite_est,observations_mat)
hermite_est <- seq(-5,5,by=0.25)
vals <- as.matrix(expand.grid(X=vals, Y=vals))
x_grid <- dens(hermite_est,x_grid)
pdf_est <- cum_prob(hermite_est,x_grid)
cdf_est <- spearmans(hermite_est)
spear_est <- kendall(hermite_est) kendall_est
<- 1
sig_x <- 1
sig_y <- 4000
num_obs <- 0.5
rho <- mvtnorm::rmvnorm(n=num_obs,mean=rep(0,2),
observations_mat sigma = matrix(c(sig_x^2,rho*sig_x*sig_y,rho*sig_x*sig_y,sig_y^2),
nrow=2, ncol=2, byrow = TRUE))
<- hermite_estimator(N = 30, standardize = TRUE,
hermite_est est_type = "bivariate")
<- hermite_est %>% update_batch(observations_mat)
hermite_est
<- seq(-5,5,by=0.25)
vals <- as.matrix(expand.grid(X=vals, Y=vals))
x_grid <- hermite_est %>% dens(x_grid, clipped = TRUE)
pdf_est <- hermite_est %>% cum_prob(x_grid, clipped = TRUE)
cdf_est <- hermite_est %>% spearmans()
spear_est <- hermite_est %>% kendall() kendall_est
<-mvtnorm::dmvnorm(x_grid,mean=rep(0,2),
actual_pdf sigma = matrix(c(sig_x^2,rho*sig_x*sig_y,rho*sig_x*sig_y,sig_y^2),
nrow=2,ncol=2, byrow = TRUE))
<- rep(NA,nrow(x_grid))
actual_cdf for (row_idx in seq_len(nrow(x_grid))) {
<- mvtnorm::pmvnorm(lower = c(-Inf,-Inf),
actual_cdf[row_idx] upper=as.numeric(x_grid[row_idx,]),mean=rep(0,2),sigma = matrix(c(sig_x^2,
*sig_x*sig_y,rho*sig_x*sig_y,sig_y^2), nrow=2,ncol=2,byrow = TRUE))
rho
}<- cor(observations_mat,method = "spearman")[1,2]
actual_spearmans <- cor(observations_mat,method = "kendall")[1,2]
actual_kendall <- data.frame(x_grid,pdf_est,cdf_est,actual_pdf,actual_cdf) df_pdf_cdf
<- ggplot(df_pdf_cdf) + geom_tile(aes(X, Y, fill= actual_pdf)) +
p1 scale_fill_gradient2(low="blue", mid="cyan", high="purple",
midpoint=.1,
breaks=seq(0,.2,by=.05),
limits=c(0,.2))
<- ggplot(df_pdf_cdf) + geom_tile(aes(X, Y, fill= pdf_est)) +
p2 scale_fill_gradient2(low="blue", mid="cyan", high="purple",
midpoint=.1,
breaks=seq(0,.2,by=.05),
limits=c(0,.2))
+ ggtitle("Actual PDF")+ theme(legend.title = element_blank()) + p2 +
p1ggtitle("Estimated PDF") +theme(legend.title = element_blank()) +
plot_layout(guides = 'collect')
<- ggplot(df_pdf_cdf) + geom_tile(aes(X, Y, fill= actual_cdf)) +
p1 scale_fill_gradient2(low="blue", mid="cyan", high="purple",
midpoint=0.5,
breaks=seq(0,1,by=.2),
limits=c(0,1))
<- ggplot(df_pdf_cdf) + geom_tile(aes(X, Y, fill= cdf_est)) +
p2 scale_fill_gradient2(low="blue", mid="cyan", high="purple",
midpoint=0.5,
breaks=seq(0,1,by=.2), #breaks in the scale bar
limits=c(0,1))
+ ggtitle("Actual CDF") + theme(legend.title = element_blank()) + p2 +
p1ggtitle("Estimated CDF") + theme(legend.title = element_blank())+
plot_layout(guides = 'collect')
Spearman’s correlation coefficient results:
Spearman’s Correlation | |
---|---|
Actual | 0.453 |
Estimated | 0.449 |
Kendall correlation coefficient results:
Kendall Correlation | |
---|---|
Actual | 0.312 |
Estimated | 0.309 |
A useful application of the hermite_estimator class is to obtain pdf, cdf and quantile function estimates on groups of data as illustrated in the example below. The fact that the memory usage of hermite_estimator objects is small and fixed irrespective of group size and that batch estimation is fast, implies that it is practical to perform estimation on a very large number of groups.
We first present the example using the data.table
package.
# Prepare Test Data
<- data.frame()
test_data for (i in seq_len(5)) {
<- rexp(n=1000)
exponential_data <- rlogis(n=1000)
logistic_data <- rlnorm(n=1000)
logn_data <- rbind(test_data,data.frame(dist_name=rep("exp",
test_data length(exponential_data)),idx=i,observations=exponential_data))
<- rbind(test_data,data.frame(dist_name=rep("logis",
test_data length(logistic_data)),idx=i,observations=logistic_data))
<- rbind(test_data,data.frame(dist_name=rep("lnorm",
test_data length(logn_data)),idx=i,observations=logn_data))
}setDT(test_data)
# Group observations by distribution and idx and create Hermite estimators
<- test_data[,.(herm_est = list(hermite_estimator(N=10,
estimates standardize = TRUE) %>% update_batch(observations))),
=.(dist_name,idx)]
by
estimates#> dist_name idx herm_est
#> 1: exp 1 <hermite_estimator_univar[10]>
#> 2: logis 1 <hermite_estimator_univar[10]>
#> 3: lnorm 1 <hermite_estimator_univar[10]>
#> 4: exp 2 <hermite_estimator_univar[10]>
#> 5: logis 2 <hermite_estimator_univar[10]>
#> 6: lnorm 2 <hermite_estimator_univar[10]>
#> 7: exp 3 <hermite_estimator_univar[10]>
#> 8: logis 3 <hermite_estimator_univar[10]>
#> 9: lnorm 3 <hermite_estimator_univar[10]>
#> 10: exp 4 <hermite_estimator_univar[10]>
#> 11: logis 4 <hermite_estimator_univar[10]>
#> 12: lnorm 4 <hermite_estimator_univar[10]>
#> 13: exp 5 <hermite_estimator_univar[10]>
#> 14: logis 5 <hermite_estimator_univar[10]>
#> 15: lnorm 5 <hermite_estimator_univar[10]>
# Group observations by distribution and merge Hermite estimators
<- estimates[,.(herm_comb = list(merge_hermite(herm_est))),
merged_estimates =.(dist_name)]
by
merged_estimates#> dist_name herm_comb
#> 1: exp <hermite_estimator_univar[10]>
#> 2: logis <hermite_estimator_univar[10]>
#> 3: lnorm <hermite_estimator_univar[10]>
# Estimate probability densities, cumulative probabilities and quantiles
<- merged_estimates[,.(dens_est = list(dens(herm_comb[[1]],
dens_vals c(0.5,1,1.5,2)))),by=.(dist_name)]
<- merged_estimates[,.(cum_prob_est = list(cum_prob(herm_comb[[1]]
cum_prob_vals c(0.5,1,1.5,2)))),by=.(dist_name)]
,<- merged_estimates[,.(quantile_est = list(quant(herm_comb[[1]],
quantile_vals c(0.25,0.5,0.75)))),by=.(dist_name)]
The same illustrative example is presented using the dplyr
package instead of the data.table
package below.
# Prepare Test Data
<- data.frame()
test_data for (i in seq_len(5)) {
<- rexp(n=1000)
exponential_data <- rlogis(n=1000)
logistic_data <- rlnorm(n=1000)
logn_data <- rbind(test_data,data.frame(dist_name=rep("exp",
test_data length(exponential_data)),idx=i,observations=exponential_data))
<- rbind(test_data,data.frame(dist_name=rep("logis",
test_data length(logistic_data)),idx=i,observations=logistic_data))
<- rbind(test_data,data.frame(dist_name=rep("lnorm",
test_data length(logn_data)),idx=i,observations=logn_data))
}
# Group observations by distribution and idx and create Hermite estimators
<- test_data %>% group_by(dist_name,idx) %>% summarise(herm_est =
estimates list(hermite_estimator(N=10,standardize = TRUE) %>% update_batch(observations)))
#> `summarise()` has grouped output by 'dist_name'. You can override using the `.groups` argument.
estimates#> # A tibble: 15 x 3
#> # Groups: dist_name [3]
#> dist_name idx herm_est
#> <chr> <int> <list>
#> 1 exp 1 <hrmt_st_ [10]>
#> 2 exp 2 <hrmt_st_ [10]>
#> 3 exp 3 <hrmt_st_ [10]>
#> 4 exp 4 <hrmt_st_ [10]>
#> 5 exp 5 <hrmt_st_ [10]>
#> 6 lnorm 1 <hrmt_st_ [10]>
#> 7 lnorm 2 <hrmt_st_ [10]>
#> 8 lnorm 3 <hrmt_st_ [10]>
#> 9 lnorm 4 <hrmt_st_ [10]>
#> 10 lnorm 5 <hrmt_st_ [10]>
#> 11 logis 1 <hrmt_st_ [10]>
#> 12 logis 2 <hrmt_st_ [10]>
#> 13 logis 3 <hrmt_st_ [10]>
#> 14 logis 4 <hrmt_st_ [10]>
#> 15 logis 5 <hrmt_st_ [10]>
# Group observations by distribution and merge Hermite estimators
<- estimates %>% group_by(dist_name) %>% summarise(herm_comb
merged_estimates = list(merge_hermite(herm_est)))
merged_estimates#> # A tibble: 3 x 2
#> dist_name herm_comb
#> <chr> <list>
#> 1 exp <hrmt_st_ [10]>
#> 2 lnorm <hrmt_st_ [10]>
#> 3 logis <hrmt_st_ [10]>
# Estimate probability densities, cumulative probabilities and quantiles
<- merged_estimates %>%
dens_vals rowwise() %>% mutate(dens_est = list(dens(herm_comb,c(0.5,1,1.5,2))))
<- merged_estimates %>%
cum_prob_vals rowwise() %>% mutate(cum_prob_est = list(cum_prob(herm_comb,c(0.5,1,1.5,2))))
<- merged_estimates %>%
quantile_vals rowwise() %>% mutate(quantile_est = list(quant(herm_comb,c(0.25,0.5,0.75))))
# Compute Mean Absolute Error
<- dens_vals %>%
dens_vals rowwise() %>% mutate(dens_actual = list(do.call(paste0("d",dist_name),
list(c(0.5,1,1.5,2))))) %>% mutate(mean_abs_error_density =
mean(abs(dens_est-dens_actual)))
<- cum_prob_vals %>%
cum_prob_vals rowwise() %>% mutate(cum_prob_actual = list(do.call(paste0("p",dist_name),
list(c(0.5,1,1.5,2)))))%>% mutate(mean_abs_error_cum_prob =
mean(abs(cum_prob_est-cum_prob_actual)))
<- quantile_vals %>%
quantile_vals rowwise() %>% mutate(quantile_actual= list(do.call(paste0("q",dist_name),
list(c(0.25,0.5,0.75)))))%>% mutate(mean_abs_error_quantiles =
mean(abs(quantile_est-quantile_actual)))
<- data.frame(dist_name=dens_vals$dist_name,
mean_abs_error_summary mean_abs_error_density=dens_vals$mean_abs_error_density,
mean_abs_error_cum_prob=cum_prob_vals$mean_abs_error_cum_prob,
mean_abs_error_quantiles=quantile_vals$mean_abs_error_quantiles)
datatable(mean_abs_error_summary) %>% formatRound(columns =
c("mean_abs_error_density","mean_abs_error_cum_prob",
"mean_abs_error_quantiles"),digits = 3)
Another useful application of the hermite_estimator class is to obtain pdf, cdf and quantile function estimates on streaming data. The speed of estimation allows the pdf, cdf and quantile functions to be estimated in real time. We illustrate this below for cdf and quantile estimation with a sample Shiny application. We reiterate that the particular usefulness is that the full pdf, cdf and quantile functions are updated in real time. Thus, any arbitrary quantile can be evaluated at any point in time. We include a stub for reading streaming data that generates micro-batches of standard exponential i.i.d. random data. This stub can easily be swapped out for a method reading micro-batches from a Kafka topic or similar.
The Shiny sample code below can be pasted into a single app.R file and run directly.
# Not Run. Copy and paste into app.R and run.
library(shiny)
library(hermiter)
library(ggplot2)
library(magrittr)
<- fluidPage(
ui titlePanel("Streaming Statistics Analysis Example: Exponential
i.i.d. stream"),
sidebarLayout(
sidebarPanel(
sliderInput("percentile", "Percentile:",
min = 0.01, max = 0.99,
value = 0.5, step = 0.01)
),mainPanel(
plotOutput("plot"),
textOutput("quantile_text")
)
)
)
<- function(input, output) {
server <- reactiveValues(hermite_est =
values hermite_estimator(N = 10, standardize = TRUE))
<- seq(-15, 15, 0.1)
x # Note that the stub below could be replaced with code that reads streaming
# data from various sources, Kafka etc.
<- reactive({
read_stream_stub_micro_batch invalidateLater(1000)
<- rexp(10)
new_observation return(new_observation)
})<- reactive({
updated_cdf_calc <- read_stream_stub_micro_batch()
micro_batch for (idx in seq_along(micro_batch)) {
"hermite_est"]] <- isolate(values[["hermite_est"]]) %>%
values[[update_sequential(micro_batch[idx])
}<- isolate(values[["hermite_est"]]) %>%
cdf_est cum_prob(x, clipped = TRUE)
<- data.frame(x, cdf_est)
df_cdf return(df_cdf)
})<- reactive({
updated_quantile_calc "hermite_est"]] %>% quant(input$percentile)
values[[
})$plot <- renderPlot({
outputggplot(updated_cdf_calc(), aes(x = x)) + geom_line(aes(y = cdf_est)) +
ylab("Cumulative Probability")
}
)$quantile_text <- renderText({
outputreturn(paste(input$percentile * 100, "th Percentile:",
round(updated_quantile_calc(), 2)))
})
}shinyApp(ui = ui, server = server)
The hermite_estimator is also applicable to non-stationary data streams. A weighted form of the Hermite series based estimator can be applied to handle this case. The estimator will adapt to the new distribution and “forget” the old distribution as illustrated in the example below. In this univariate example, the distribution from which the observations are drawn switches from a Chi-square distribution to a logistic distribution and finally to a normal distribution. In order to use the exponentially weighted form of the hermite_estimator, the exp_weight_lambda argument must be set to a non-NA value. Typical values for this parameter are 0.01, 0.05 and 0.1. The lower the exponential weighting parameter, the slower the estimator adapts and vice versa for higher values of the parameter. However, variance increases with higher values of exp_weight_lambda, so there is a trade-off to bear in mind.
# Prepare Test Data
<-2000
num_obs <- rchisq(num_obs,5)
test <- c(test,rlogis(num_obs))
test <- c(test,rnorm(num_obs)) test
# Calculate theoretical pdf, cdf and quantile values for comparison
<- seq(-15,15,by=0.1)
x <- dchisq(x,5)
actual_pdf_lognorm <- dlogis(x)
actual_pdf_logis <- dnorm(x)
actual_pdf_norm <- pchisq(x,5)
actual_cdf_lognorm <- plogis(x)
actual_cdf_logis <- pnorm(x)
actual_cdf_norm <- seq(0.05,0.95,by=0.05)
p <- qchisq(p,5)
actual_quantiles_lognorm <- qlogis(p)
actual_quantiles_logis <- qnorm(p) actual_quantiles_norm
# Construct Hermite Estimator
<- hermite_estimator(N=20,standardize = TRUE,exp_weight_lambda = 0.005) h_est
# Loop through test data and update h_est to simulate observations arriving
# sequentially
<- 1
count <- data.frame()
res <- data.frame()
res_q for (idx in seq_along(test)) {
<- h_est %>% update_sequential(test[idx])
h_est if (idx %% 100 == 0){
if (floor(idx/num_obs)==0){
<- actual_cdf_lognorm
actual_cdf_vals <-actual_pdf_lognorm
actual_pdf_vals <- actual_quantiles_lognorm
actual_quantile_vals
}if (floor(idx/num_obs)==1){
<- actual_cdf_logis
actual_cdf_vals <-actual_pdf_logis
actual_pdf_vals <- actual_quantiles_logis
actual_quantile_vals
}if (floor(idx/num_obs)==2){
<- actual_cdf_norm
actual_cdf_vals <- actual_pdf_norm
actual_pdf_vals <- actual_quantiles_norm
actual_quantile_vals
}<- rep(count,length(x))
idx_vals <- h_est %>% cum_prob(x, clipped=TRUE)
cdf_est_vals <- h_est %>% dens(x, clipped=TRUE)
pdf_est_vals <- h_est %>% quant(p)
quantile_est_vals <- rbind(res,data.frame(idx_vals,x,cdf_est_vals,actual_cdf_vals,
res
pdf_est_vals,actual_pdf_vals))<- rbind(res_q,data.frame(idx_vals=rep(count,length(p)),p,
res_q
quantile_est_vals,actual_quantile_vals))<- count +1
count
}
}<- res %>% mutate(idx_vals=idx_vals*100)
res <- res_q %>% mutate(idx_vals=idx_vals*100) res_q
# Visualize Results for PDF (Not run, requires gganimate, gifski and transformr
# packages)
<- ggplot(res,aes(x=x)) + geom_line(aes(y=pdf_est_vals, colour="Estimated")) +
p geom_line(aes(y=actual_pdf_vals, colour="Actual")) +
scale_colour_manual("",
breaks = c("Estimated", "Actual"),
values = c("blue", "black")) +
ylab("Probability Density") +
transition_states(idx_vals,transition_length = 2,state_length = 1) +
ggtitle('Observation index {closest_state}')
anim_save("pdf.gif",p)
# Visualize Results for CDF (Not run, requires gganimate, gifski and transformr
# packages)
<- ggplot(res,aes(x=x)) + geom_line(aes(y=cdf_est_vals, colour="Estimated")) +
p geom_line(aes(y=actual_cdf_vals, colour="Actual")) +
scale_colour_manual("",
breaks = c("Estimated", "Actual"),
values = c("blue", "black")) +
ylab("Cumulative Probability") +
transition_states(idx_vals, transition_length = 2,state_length = 1) +
ggtitle('Observation index {closest_state}')
anim_save("cdf.gif", p)
# Visualize Results for Quantiles (Not run, requires gganimate, gifski and
# transformr packages)
<- ggplot(res_q,aes(x=actual_quantile_vals)) +
p geom_point(aes(y=quantile_est_vals), color="blue") +
geom_abline(slope=1,intercept = 0) +xlab("Theoretical Quantiles") +
ylab("Estimated Quantiles") +
transition_states(idx_vals,transition_length = 2, state_length = 1) +
ggtitle('Observation index {closest_state}')
anim_save("quant.gif",p)
We illustrate tracking a non-stationary bivariate data stream with another sample Shiny application. The bivariate Hermite estimator leverages an exponential weighting scheme as described in the univariate case and does not need to maintain a sliding window. We include a stub for reading streaming data that generates micro-batches of bivariate normal i.i.d. random data with a chosen Spearman’s correlation coefficient (as this is easily linked to the standard correlation matrix). This stub can again be readily swapped out for a method reading micro-batches from a Kafka topic or similar.
The Shiny sample code below can be pasted into a single app.R file and run directly.
# Not Run. Copy and paste into app.R and run.
library(shiny)
library(hermiter)
library(ggplot2)
library(magrittr)
<- fluidPage(
ui titlePanel("Bivariate Streaming Statistics Analysis Example"),
sidebarLayout(
sidebarPanel(
sliderInput("spearmans", "True Spearman's Correlation:",
min = -0.9, max = 0.9,
value = 0, step = 0.1)
),mainPanel(
plotOutput("plot"),
textOutput("spearman_text")
)
)
)
<- function(input, output) {
server <- reactiveValues(hermite_est =
values hermite_estimator(N = 10, standardize = TRUE,
exp_weight_lambda = 0.01,
est_type="bivariate"))
# Note that the stub below could be replaced with code that reads streaming
# data from various sources, Kafka etc.
<- reactive({
read_stream_stub_micro_batch invalidateLater(1000)
<- 1
sig_x <- 1
sig_y <- 100
num_obs <- 2 *sin(pi/6 * input$spearmans)
rho <- mvtnorm::rmvnorm(n=num_obs,mean=rep(0,2),
observations_mat sigma = matrix(c(sig_x^2,rho*sig_x*sig_y,rho*sig_x*sig_y,sig_y^2),
nrow=2,ncol=2, byrow = TRUE))
return(observations_mat)
})<- reactive({
updated_spear_calc <- read_stream_stub_micro_batch()
micro_batch for (idx in seq_len(nrow(micro_batch))) {
"hermite_est"]] <- isolate(values[["hermite_est"]]) %>%
values[[update_sequential(micro_batch[idx,])
}<- isolate(values[["hermite_est"]]) %>%
spear_est spearmans(clipped = TRUE)
return(spear_est)
})$plot <- renderPlot({
output<- seq(-5,5,by=0.25)
vals <- as.matrix(expand.grid(X=vals, Y=vals))
x_grid <- 2 *sin(pi/6 * input$spearmans)
rho <-mvtnorm::dmvnorm(x_grid,mean=rep(0,2),
actual_pdf sigma = matrix(c(sig_x^2,rho*sig_x*sig_y,rho*sig_x*sig_y,sig_y^2),
nrow=2,ncol=2, byrow = TRUE))
<- data.frame(x_grid,actual_pdf)
df_pdf <- ggplot(df_pdf) + geom_tile(aes(X, Y, fill= actual_pdf)) +
p1 scale_fill_gradient2(low="blue", mid="cyan", high="purple",
midpoint=.2,
breaks=seq(0,.4,by=.1),
limits=c(0,.4)) +ggtitle(paste("True Bivariate
Normal Density with matched Spearman's correlation")) +
theme(legend.title = element_blank())
p1
}
)$spearman_text <- renderText({
outputreturn(paste("Spearman's Correlation Estimate from Hermite Estimator:",
round(updated_spear_calc(), 1)))
})
}shinyApp(ui = ui, server = server)
Use the following code to automatically generate an appropriate BibTex entry.
citation("hermiter")