Cumulative median

In this vignette we compare two different implementations of the cumulative median. The cumstats package provides a naive method, which uses the standard median function in a for loop. Each call to the standard median function is log-linear, so the total expected complexity is log-quadratic. The binsegRcpp package provides a different implementation that uses a log-linear algorithm, previously described in the 2017 NeurIPS research paper Maximum Margin Interval Trees by Alexandre Drouin, Toby Hocking, Francois Laviolette.

atime.list <- atime::atime(
  N=2^seq(1, 20),
  setup={
    set.seed(1)
    data.vec <- rnorm(N)
  },
  "cumstats::cummedian"={
    cumstats::cummedian(data.vec)
  },
  "binsegRcpp::cum_median"={
    binsegRcpp::cum_median(data.vec)
  },
  times=5)
plot(atime.list)

plot of chunk unnamed-chunk-1


best.list <- atime::references_best(atime.list)
ref.dt <- best.list$ref[each.sign.rank==1]
library(data.table)
if(require(ggplot2)){
  hline.df <- with(atime.list, data.frame(seconds.limit, unit="seconds"))
  gg <- ggplot()+
    theme_bw()+
    facet_grid(unit ~ ., scales="free")+
    geom_line(aes(
      N, reference, group=paste(expr.name, fun.name)),
      color="grey",
      data=ref.dt)+
    geom_hline(aes(
      yintercept=seconds.limit),
      color="grey",
      data=hline.df)+
    geom_line(aes(
      N, empirical, color=expr.name),
      data=best.list$measurements)+
    geom_ribbon(aes(
      N, ymin=min, ymax=max, fill=expr.name),
      data=best.list$measurements[unit=="seconds"],
      alpha=0.5)+
    scale_x_log10()+
    scale_y_log10("median line, min/max band")+
    coord_cartesian(xlim=c(1,1e7))
  if(require("directlabels")){
    gg+
      theme(legend.position="none")+
      directlabels::geom_dl(aes(
        N, empirical, color=expr.name, label=expr.name),
        data=best.list$meas,
        method="last.polygons")+
      directlabels::geom_dl(aes(
        N, reference, label.group=paste(expr.name, fun.name), label=fun.name),
        data=ref.dt,
        color="grey",
        method="bottom.polygons")
  }else{
    gg
  }
}
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis

plot of chunk unnamed-chunk-1

The plots above show significant speed differences between the two implementations of cumulative median. It is clear from the different asymptotic slopes that the empirical timings are consistent with the expected complexity, O(N log N) for binsegRcpp, and O(N2 log N) for cumstats. The code below shows that the results of the two implementations are identical.

result.wide <- dcast(atime.list$meas, N ~ expr.name, value.var="result")
result.complete <- result.wide[sapply(`cumstats::cummedian`, length) == N]
result.complete[, identical(`binsegRcpp::cum_median`,`cumstats::cummedian`)]
#> [1] TRUE

Conclusion: binsegRcpp::cum_median should be the preferred method for computing the cumulative median because it is much faster.