In the introduction we have see that a dependency network can be built using get_dep()
. While it is theoretically possible to use get_dep()
iteratively to obtain all dependencies of all packages available on CRAN, it is not practical to do so. This package provides two functions get_dep_all_packages()
and get_graph_all_packges()
for obtaining the dependencies of all CRAN packages directly, as well as an example dataset.
The example dataset cran_dependencies
contains all dependencies as of 2020-05-09.
data(cran_dependencies)
cran_dependencies
#> # A tibble: 211,381 × 4
#> from to type reverse
#> <chr> <chr> <chr> <lgl>
#> 1 A3 xtable depends FALSE
#> 2 A3 pbapply depends FALSE
#> 3 A3 randomForest suggests FALSE
#> 4 A3 e1071 suggests FALSE
#> 5 aaSEA DT imports FALSE
#> 6 aaSEA networkD3 imports FALSE
#> 7 aaSEA shiny imports FALSE
#> 8 aaSEA shinydashboard imports FALSE
#> 9 aaSEA magrittr imports FALSE
#> 10 aaSEA Bios2cor imports FALSE
#> # … with 211,371 more rows
dplyr::count(cran_dependencies, type, reverse)
#> # A tibble: 8 × 3
#> type reverse n
#> <chr> <lgl> <int>
#> 1 depends FALSE 11123
#> 2 depends TRUE 9672
#> 3 imports FALSE 57617
#> 4 imports TRUE 51913
#> 5 linking to FALSE 3433
#> 6 linking to TRUE 3721
#> 7 suggests FALSE 35018
#> 8 suggests TRUE 38884
This is essentially a snapshot of CRAN. We can obtain all the current dependencies using get_dep_all_packages()
, which requires no arguments:
df0.cran <- get_dep_all_packages()
head(df0.cran)
#> from to type reverse
#> 2 aaSEA DT imports FALSE
#> 3 aaSEA networkD3 imports FALSE
#> 4 aaSEA shiny imports FALSE
#> 5 aaSEA shinydashboard imports FALSE
#> 6 aaSEA magrittr imports FALSE
#> 7 aaSEA Bios2cor imports FALSE
dplyr::count(df0.cran, type, reverse) # numbers in general larger than above
#> type reverse n
#> 1 depends FALSE 10844
#> 2 depends TRUE 9463
#> 3 enhances FALSE 559
#> 4 enhances TRUE 575
#> 5 imports FALSE 82651
#> 6 imports TRUE 76061
#> 7 linking to FALSE 4903
#> 8 linking to TRUE 5276
#> 9 suggests FALSE 52121
#> 10 suggests TRUE 57996
We can build dependency network using get_graph_all_packages()
. Furthermore, we can verify that the forward and reverse dependency networks are (almost) the same, by looking at their size (number of edges) and order (number of nodes).
g0.depends <- get_graph_all_packages(type = "depends")
g0.depends
#> IGRAPH a2a8e72 DN-- 4707 7809 --
#> + attr: name (v/c)
#> + edges from a2a8e72 (vertex names):
#> [1] A3 ->xtable A3 ->pbapply abc ->abc.data
#> [4] abc ->nnet abc ->quantreg abc ->MASS
#> [7] abc ->locfit ABCp2 ->MASS abctools ->abc
#> [10] abctools ->abind abctools ->plyr abctools ->Hmisc
#> [13] abd ->nlme abd ->lattice abd ->mosaic
#> [16] abodOutlier ->cluster abundant ->glasso Ac3net ->data.table
#> [19] acc ->mhsmm accelmissing->mice accelmissing->pscl
#> [22] accessrmd ->ggplot2 accrual ->tcltk2 accrualPlot ->lubridate
#> + ... omitted several edges
We could obtain essentially the same graph, but with the direction of the edges reversed, by specifying type = "reverse depends"
:
The dependency words accepted by the argument type
is the same as in get_dep()
. The two networks’ size and order should be very close if not identical to each other. Because of the dependency direction, their edge lists should be the same but with the column names from
and to
swapped.
For verification, the exact same graphs can be obtained by filtering the data frame for the required dependency and applying df_to_graph()
:
g1.depends <- df0.cran %>%
dplyr::filter(type == "depends" & !reverse) %>%
df_to_graph(nodelist = dplyr::rename(df0.cran, name = from))
g1.depends # same as g0.depends
#> IGRAPH 482f949 DN-- 4707 7809 --
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 482f949 (vertex names):
#> [1] A3 ->xtable A3 ->pbapply abc ->abc.data
#> [4] abc ->nnet abc ->quantreg abc ->MASS
#> [7] abc ->locfit ABCp2 ->MASS abctools ->abc
#> [10] abctools ->abind abctools ->plyr abctools ->Hmisc
#> [13] abd ->nlme abd ->lattice abd ->mosaic
#> [16] abodOutlier ->cluster abundant ->glasso Ac3net ->data.table
#> [19] acc ->mhsmm accelmissing->mice accelmissing->pscl
#> [22] accessrmd ->ggplot2 accrual ->tcltk2 accrualPlot ->lubridate
#> + ... omitted several edges
If we extract the equivalent graph of reverse dependencies, we should obtain the same graph as before (had it been extracted above):
# Not run
g1.rev_depends <- df0.cran %>%
dplyr::filter(type == "depends" & reverse) %>%
df_to_graph(nodelist = dplyr::rename(df0.cran, name = from))
g1.rev_depends # should be same as g0.rev_depends
The networks obtained above should all be directed acyclic graphs:
One may notice that there are external reverse dependencies which won’t be appear in the forward dependencies if the scraping is limited to CRAN packages. We can find these external reverse dependencies by nodelist = NULL
in df_to_graph()
:
df1.rev_depends <- df0.cran %>%
dplyr::filter(type == "depends" & reverse) %>%
df_to_graph(nodelist = NULL, gc = FALSE) %>%
igraph::as_data_frame() # to obtain the edge list
df1.depends <- df0.cran %>%
dplyr::filter(type == "depends" & !reverse) %>%
df_to_graph(nodelist = NULL, gc = FALSE) %>%
igraph::as_data_frame()
dfa.diff.depends <- dplyr::anti_join(
df1.rev_depends,
df1.depends,
c("from" = "to", "to" = "from")
)
head(dfa.diff.depends)
#> from to type reverse
#> 1 abind baySeq depends TRUE
#> 2 abind CNORdt depends TRUE
#> 3 abind FISHalyseR depends TRUE
#> 4 abind flowMap depends TRUE
#> 5 abind riboSeqR depends TRUE
#> 6 adabag m6Aboost depends TRUE
This means we are extracting the reverse dependencies of which the forward equivalents are not listed. The column to
shows the packages external to CRAN. On the other hand, if we apply dplyr::anti_join()
by switching the order of two edge lists,
dfb.diff.depends <- dplyr::anti_join(
df1.depends,
df1.rev_depends,
c("from" = "to", "to" = "from")
)
head(dfb.diff.depends)
#> from to type reverse
#> 1 abctools parallel depends FALSE
#> 2 abd grid depends FALSE
#> 3 AcceptanceSampling methods depends FALSE
#> 4 AcceptanceSampling stats depends FALSE
#> 5 acid stats depends FALSE
#> 6 acid graphics depends FALSE
the column to
lists those which are not on the page of available packages on CRAN (anymore). These are either defunct or core packages.
Using the data frame df0.cran
, we can also obtain the degree for each package and each type:
df0.summary <- dplyr::count(df0.cran, from, type, reverse)
head(df0.summary)
#> from type reverse n
#> 1 A3 depends FALSE 2
#> 2 A3 suggests FALSE 2
#> 3 AATtools imports FALSE 4
#> 4 ABACUS imports FALSE 2
#> 5 ABACUS suggests FALSE 2
#> 6 ABC.RAP imports FALSE 3
We can look at the “winner” in each of the reverse dependencies:
df0.summary %>%
dplyr::filter(reverse) %>%
dplyr::group_by(type) %>%
dplyr::top_n(1, n)
#> # A tibble: 6 × 4
#> # Groups: type [5]
#> from type reverse n
#> <chr> <chr> <lgl> <int>
#> 1 Rcpp linking to TRUE 2726
#> 2 doMC enhances TRUE 14
#> 3 ggplot2 depends TRUE 451
#> 4 ggplot2 imports TRUE 3082
#> 5 knitr suggests TRUE 8280
#> 6 shiny enhances TRUE 14
This is not surprising given the nature of each package. To take the summarisation one step further, we can obtain the frequencies of the degrees, and visualise the empirical degree distribution neatly on the log-log scale:
df1.summary <- df0.summary %>%
dplyr::count(type, reverse, n)
#> Storing counts in `nn`, as `n` already present in input
#> ℹ Use `name = "new_name"` to pick a new name.
gg0.summary <- df1.summary %>%
dplyr::mutate(reverse = ifelse(reverse, "reverse", "forward")) %>%
ggplot2::ggplot() +
ggplot2::geom_point(ggplot2::aes(n, nn)) +
ggplot2::facet_grid(type ~ reverse) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10() +
ggplot2::labs(x = "Degree", y = "Number of packages") +
ggplot2::theme_bw(20)
gg0.summary
This shows the reverse dependencies, in particular Reverse_depends
and Reverse_imports
, follow the power law, which is empirically observed in various academic fields.
We can now visualise (the giant component of) the CRAN network of Depends
, using functions in the package visNetwork. To do this, we will need to convert the igraph object g0.depends
to the node list and edge list as data frames.
prefix <- "http://CRAN.R-project.org/package=" # canonical form
degrees <- igraph::degree(g0.depends)
df0.nodes <- data.frame(id = names(degrees), value = degrees) %>%
dplyr::mutate(title = paste0('<a href=\"', prefix, id, '\">', id, '</a>'))
df0.edges <- igraph::as_data_frame(g0.depends, what = "edges")
We could use igraph::membership()
& igraph::cluster_*()
for community detection and visualisation of the clusters using different colours, which however will take too much computing time and therefore not shown here.
By adding the column title
in df0.nodes
, we enable clicking the nodes and being directed to their CRAN pages, in the interactive visualisation below:
set.seed(2345L)
vis0 <- visNetwork::visNetwork(df0.nodes, df0.edges, width = "100%", height = "720px") %>%
visNetwork::visOptions(highlightNearest = TRUE) %>%
visNetwork::visEdges(arrows = "to", color = list(opacity = 0.5)) %>%
visNetwork::visNodes(fixed = TRUE) %>%
visNetwork::visIgraphLayout(layout = "layout_with_drl")
vis0
Methods in social network analysis, such as stochastic block models, can be applied to study the properties of the dependency network. Ideally, by analysing the dependencies of all CRAN packages, we can obtain a bird’s-eye view of the ecosystem. The number of reverse dependencies is modelled in this other vignette.