TDAvec vignette

The TDAvec package provides implementations of several vector summaries of persistence diagrams studied in Topological Data Analysis (TDA). Each is obtained by discretizing the associated summary function computed from a persistence diagram. The summary functions included in this package are

For improved computational efficiency, all code behind the vector summaries is written in C++ using the Rcpp package. Whenever applicable, when compare our code with existing implementations in terms of accuracy and run-time cost. In this vignette, we illustrate the basic usage of the TDAvec package using simple examples.

Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise:

Next, we use the TDA package to compute the persistence diagram (PD) using the Vietoris-Rips filtration built on top of the point cloud \(X\).

D <- ripsDiag(X,maxdimension = 1,maxscale = 2)$diagram
sum(D[,1]==0) # number of connected components
#> [1] 100
sum(D[,1]==1) # number of loops
#> [1] 13
sum(D[,1]==2) # number of voids
#> [1] 0

In the ripsDiag() function, maxdimension is the maximum homological dimension of the topological features to be computed (connected components if maxdimension=0, connected components and loops if 1, connected components, loops and voids if 2, etc.) maxscale is the maximum value of the scale parameter of the filtration (which we set equal to 2 since the points are sampled from a circle with diameter 2).

The persistence diagram \(D\) has 100 connected components (the same as the point cloud size), 13 loops (one with long life-span, the rest are short-lived) and 0 voids along with their birth and death values. To plot the diagram, we can use the plot() function.

Let’s compute the persistence landscape (PL) and persistence silhouette (PS) summaries from the persistence diagram \(D\).

# sequence of scale values to vectorize the summary function
scaleSeq = seq(0,2,length.out=11) 
# compute the PL and PS summaries for homological dimension H_0
computePL(D,homDim = 0,scaleSeq,k=1)
#>  [1] 0.0 0.2 0.4 0.6 0.8 1.0 0.8 0.6 0.4 0.2 0.0
computePS(D,homDim = 0,scaleSeq,p=1)
#>  [1] 0.00000000 0.04682016 0.04923603 0.07385405 0.09847206 0.12309008
#>  [7] 0.09847206 0.07385405 0.04923603 0.02461802 0.00000000

Here, the vectorization is performed by evaluating the respective summary function at each element of scaleSeq (i.e. \(0.0,0.2,0.4,\ldots,2.0\)) and arranging the values into a vector.

The parameter k in computePL() is the order of the persistence landscape function. The p in computePS() is the power of the weights for the silhouette function.

To compute the summaries for homological dimension \(H_1\), set the homDim argument equal to 1:

# compute the PL and PS summaries for homological dimension H_1
computePL(D,homDim = 1,scaleSeq,k=1)
#>  [1] 0.000000000 0.014217630 0.000931983 0.191114536 0.391114536 0.269212150
#>  [7] 0.069212150 0.000000000 0.000000000 0.000000000 0.000000000
computePS(D,homDim = 1,scaleSeq,p=1)
#>  [1] 0.000000e+00 1.114105e-03 5.251128e-05 1.362787e-01 2.788933e-01
#>  [6] 1.919680e-01 4.935333e-02 0.000000e+00 0.000000e+00 0.000000e+00
#> [11] 0.000000e+00

The TDA package also provides implementations of the persistence landscapes and silhouettes. Below we compare the two implementations in terms of accuracy of results and run-time cost.

pl1 <- computePL(D,homDim = 0,k=1,scaleSeq)
pl2 <- as.vector(landscape(D,dimension = 0,KK = 1, tseq = scaleSeq))
all.equal(pl1,pl2) # -> TRUE (the results are the same)
#> [1] TRUE

compCost <- microbenchmark(
  computePL(D,homDim = 0,k=1,scaleSeq),
  landscape(D,dimension = 0,KK = 1, tseq = scaleSeq),
  times = 500
)
sm <- summary(compCost)
costRatioPL <- sm$mean[2]/sm$mean[1] # ratio of computational time means
print(costRatioPL)
#> [1] 236.5175

For homological dimension \(H_0\), it took TDA::landscape() about 237 times more time than TDAvec::computePL(). Similarly, we can compare TDA::silhouette() and TDAvec::computePL():

ps1 <- computePS(D,homDim = 0, p = 1,scaleSeq)
ps2 <- as.vector(silhouette(D,dimension = 0,p = 1, tseq = scaleSeq))
all.equal(ps1,ps2) # -> TRUE (the results are the same)
#> [1] TRUE

compCost <- microbenchmark(
  computePS(D,homDim = 0, p = 1,scaleSeq),
  silhouette(D,dimension = 0,p = 1, tseq = scaleSeq),
  times = 500
)
sm <- summary(compCost)
costRatioPS <- sm$mean[2]/sm$mean[1]
print(costRatioPS)
#> [1] 71.20121

To discretize the persistent entropy summary function, Euler characteristic curve and Betti curve, we employ a different vectorization scheme. Rather than evaluating a summary function at increasing scales and arranging the values into a vector, we compute the average values of the summary function between two consecutive scale points using integration. More specifically, if \(f\) is a (univariate) summary function and \(t_1,t_2,\ldots,t_n\) are increasing scale values, we discretize \(f\) as:

\[\begin{equation} \Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1}, \end{equation}\]

where \(\Delta t_k=t_{k+1}-t_k\), \(k=1,\ldots,n-1\). For the above three summary functions, the computation of integrals can be done analytically and efficiently implemented. For the Betti curve, we use the identity weight function (i.e., \(w(b,d)\equiv 1\)). If the weight function \(w(b,d)=-(l/L)\log_2(l/L)\), where \(l=d-b\) (life-span) and \(L=\sum_{i=1}^n l_i\) (sum of all life-spans), the vector summary of the Betti curve will be the same as that of the persistent entropy summary function.

# Persistent Entropy Summary (PES) function
# compute PES for homological dimension H0
computePES(D,homDim = 0,scaleSeq) 
#>  [1] 4.8268301 1.0173158 0.3720045 0.3720045 0.3720045 0.3720045 0.3720045
#>  [8] 0.3720045 0.3720045 0.3720045
# compute PES for homological dimension H1
computePES(D,homDim = 1,scaleSeq)
#>  [1] 0.05677674 0.26458225 0.33845861 0.34789260 0.34789260 0.34789260
#>  [7] 0.12039198 0.00000000 0.00000000 0.00000000

# Euler Characteristic Curve (ECC) 
computeECC(D,maxhomDim = 1,scaleSeq) # maxhomDim = maximal homological dimension considered
#>  [1] 65.64393340  5.93893485 -0.02801848  0.00000000  0.00000000  0.00000000
#>  [7]  0.65393925  1.00000000  1.00000000  1.00000000

# Vector of Averaged Bettis (VAB) - a vectorization of Betti Curve
# compute VAB for homological dimension H0
computeVAB(D,homDim = 0,scaleSeq) 
#>  [1] 65.928137  7.313179  1.000000  1.000000  1.000000  1.000000  1.000000
#>  [8]  1.000000  1.000000  1.000000
# compute VAB for homological dimension H1
computeVAB(D,homDim = 1,scaleSeq)
#>  [1] 0.2842034 1.3742440 1.0280185 1.0000000 1.0000000 1.0000000 0.3460608
#>  [8] 0.0000000 0.0000000 0.0000000

To discretize the persistence surface and persistence block, we first need to switch from the birth-death to the birth-persistence coordinates.

The resulting vector summaries are called the persistence image (PI) and the vectorized of persistence block (VPB) respectively.

# Persistence Image (PI)
res <- 5 # resolution or grid size
# find min and max persistence values
minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3])
sigma <- 0.5*(maxPH0-minPH0)/res # default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram
# construct one-dimensional grid of scale values
ySeqH0 <- seq(minPH0,maxPH0,length.out=res+1)
# compute PI for homological dimension H_0
computePI(D,homDim=0,xSeq=NA,ySeqH0,res,sigma)
#> [1] 4.56670703 0.96562735 0.01135007 0.02272340 0.47724987

# Vectorized Persistence Block (VPB)
# construct one-dimensional grid of scale values
ySeqH0 <- unique(quantile(D[D[,1]==0,3],probs = seq(0,1,by=0.2))) 
tau <- 0.3 # parameter in [0,1] which controls the size of blocks around each point of the diagram 
# compute VPB for homological dimension H_0
computeVPB(D,homDim = 0,xSeq=NA,ySeqH0,tau) 
#> [1] 0.809001 3.644815 5.486612 6.382816 1.013953

Since the \(H_0\) features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization.

For homological dimension \(H_1\), the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid. For the VPB summary, since the blocks around each point of the persistence diagram have different sizes, we construct the grid with scale values spread out non-uniformly (i.e. the rectangular grid cells have different dimensions). In experiments, this construction of the grid tends to provide better performance over the grid with equal cell sizes.

# PI
res <- 5 # resolution or grid size
# find min & max birth and persistence values
minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2])
minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3])
xSeqH1 <- seq(minBH1,maxBH1,length.out=res+1)
ySeqH1 <- seq(minPH1,maxPH1,length.out=res+1)
sigma <- 0.5*(maxPH1-minPH1)/res
# compute PI for homological dimension H_1
computePI(D,homDim=1,xSeqH1,ySeqH1,res,sigma)
#>  [1] 4.371601e-02 4.955567e-02 4.512378e-02 3.299957e-02 1.864442e-02
#>  [6] 7.527654e-03 7.760847e-03 6.305986e-03 4.250972e-03 2.278547e-03
#> [11] 6.038006e-05 5.841014e-05 4.427399e-05 3.094726e-05 2.021064e-05
#> [16] 1.834675e-04 8.921009e-04 2.714215e-03 5.171823e-03 6.175378e-03
#> [21] 3.853840e-03 1.874023e-02 5.701774e-02 1.086451e-01 1.297270e-01

# VPB
xSeqH1 <- unique(quantile(D[D[,1]==1,2],probs = seq(0,1,by=0.2)))
ySeqH1 <- unique(quantile(D[D[,1]==1,3],probs = seq(0,1,by=0.2)))
tau <- 0.3
# compute VPB for homological dimension H_1
computeVPB(D,homDim = 1,xSeqH1,ySeqH1,tau) 
#>  [1] 0.0000000000 0.0004995598 0.0090337733 0.0305098818 0.0057973687
#>  [6] 0.0095194496 0.0261053157 0.0203673036 0.0042870130 0.0107446149
#> [11] 0.0425768331 0.1053468321 0.1458807241 0.0000000000 0.0000000000
#> [16] 0.0328035481 0.0276785592 0.1089088594 0.1318822254 0.0168304895
#> [21] 0.2939394561 0.3162986126 0.3344510989 0.3567486932 0.3636226941

As a note, the code for computePI() is adopted from the pers.image() function (available in the kernelTDA package) with minor modifications. For example, pers.image() uses a one-dimensional grid for homological dimension \(H_0\) regardless of the filtration type. In contast, computePI() uses a one-dimensional grid only if additionally the birth values are the same (which may not be true for some filtrations such as the lower-star filtration).

References

Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1), 77-102.
Chazal, F., Fasy, B. T., Lecci, F., Rinaldo, A., & Wasserman, L. (2014, June). Stochastic convergence of persistence landscapes and silhouettes. In Proceedings of the thirtieth annual symposium on Computational geometry (pp. 474-483).
Atienza, N., Gonzalez-Díaz, R., & Soriano-Trigueros, M. (2020). On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recognition, 107, 107509.
Richardson, E., & Werman, M. (2014). Efficient classification using the Euler characteristic. Pattern Recognition Letters, 49, 99-106.
Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Frontiers in Artificial Intelligence, 108.
Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. Advances in Computational Mathematics, 48(1), 1-42.
Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., … & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18.
Chan, K. C., Islambekov, U., Luchinsky, A., & Sanders, R. (2022). A computationally efficient framework for vector representation of persistence diagrams. Journal of Machine Learning Research, 23, 1-33.