This vignette serves as an introduction to the package multigraphr
. Parts of the theoretical background is provided but for more details, consult the following literature which the package is based on:
Shafie, T. (2015). A multigraph approach to social network analysis. Journal of Social Structure, 16. Link
Shafie, T. (2016). Analyzing local and global properties of multigraphs. The Journal of Mathematical Sociology, 40(4), 239-264. Link
Shafie, T. and Schoch, D., (to appear in 2021). Multiplexity analysis of networks using multigraph representations. Statistical Methods & Applications
Shafie, T. (Under review). Goodness of fit tests for random multigraph models.
Make sure the library is loaded
library('multigraphr')
Multigraphs are network representations in which multiple edges and edge loops (self edges) are permitted. These data structures can be either directly observed or aggregated by classifying or cross-classifying node attributes into meta nodes. For the latter case, within group edges correspond to self-edges. See example below where the original graph with 15 nodes and 12 edges (left) is aggregated based on node categories into a small multigraph with 4 nodes (right).
Edge aggregation can also be used to obtain multigraphs. Assume that we study a graph with three different types of relations over three periods of time:
If we aggregate over time periods, we obtain for each edge category a multigraph for the total time period of three days:
For more details on these kinds of aggregations, see Shafie (2015;2016).
Multigraphs are represented by their edge multiplicity sequence \[\mathbf{M} = (M_{ij} : (i,j) \in \cal{R} )\] where \(\cal{R}\) is the canonical site space for undirected edges \[R = \lbrace (i,j) : 1 \leq i \leq j \leq n \rbrace\] i.e. \[(1,1) < (1,2) <···< (1,n) < (2,2) < (2,3) <···< (n,n)\] where \(n\) is number of nodes.
The number of vertex pair sites is given by \(\displaystyle r = \binom{n+1}{2}\).
Edge multiplicities can also be represented as entries in a matrix \[\mathbf{M}= \begin{bmatrix} M_{11} & M_{12} & \dots & M_{1n} \\ 0 & M_{22} & \dots & M_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0& \ldots & M_{nn} \end{bmatrix} \qquad \mathbf{M} + \mathbf{M'}= \begin{bmatrix} 2M_{11} & M_{12} & \dots & M_{1n} \\ M_{12} & 2M_{22} & \dots & M_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ M_{1n} & M_{2n} & \ldots & 2M_{nn} \end{bmatrix}\] where the right hand matrix is the equivalence of an adjacency matrix of a multigraph (Shafie, 2016).
Two probability models for generating undirected random multigraphs are implemented in the package together with several statistics under these two models. Moreover, functions for goodness of fit tests are available for the presented models.
Note that some of the functions are only practical for small scale multigraphs (with number of nodes less than 10 and number of edges less than 20).
The first model is obtained by random stub matching (RSM) given observed degree sequence of a multigraphs, so that edges are assigned to sites given fixed degree sequence \(\mathbf{d}=(d_1, \ldots, d_n)\). The edge assignment probability sequence \(\mathbf{Q}\) is defined as a function of these degrees and the edge assignment probabilities are given by \[\begin{equation} Q_{ij}= \left\{ \begin{array}{ll} \displaystyle \binom{d_i}{2}\bigg/ \binom{2m}{2} & \mbox{for $i=j$}\\ \displaystyle d_id_j\bigg/ \binom{2m}{2} & \mbox{for $i<j$} \ , \end{array} \right . \end{equation}\]
The probability of a multigraph under this model is given by \[\begin{equation}P(\mathbf{M}=\mathbf{m})=\frac{2^{m_2} \binom {m}{\mathbf{m}}}{\binom{2m}{\mathbf{d}}}=\frac{2^{m_2} m! \prod_{i=1}^n d_i!}{(2m)! \prod_{i\leq j} m_{ij}!} \ ,\end{equation}\] where \(m_2=\sum \sum_{i<j}m_{ij}\) (Shafie, 2016).
Consider a small graph on 3 nodes and the following adjacency matrix:
<- matrix(c(1, 1, 0,
A 1, 2, 2,
0, 2, 0),
nrow = 3, ncol = 3)
A
## [,1] [,2] [,3]
## [1,] 1 1 0
## [2,] 1 2 2
## [3,] 0 2 0
The degree sequence of the multigraph has double counted diagonals (see the edge multiplciity matrix defined above) and is given by
<- get_degree_seq(adj = A, type = 'graph')
D D
## [1] 3 7 2
so that number of edges in the multigraph is half the sum of the degree sequence which is equal to 6.
The RSM model given this degree sequence shows that the sample space consists of 7 possible multigraphs, as represented by their multiplicity \[\mathbf{M}= (M_{11}, M_{12}, M_{1,3}, M_{22}, M_{23}, M_{33})\] which is stored in data frame m.seq
(each row correspond to the edge multiplicity sequence of a unique multigraph):
<- rsm_model(deg.seq = D)
rsm_1 $m.seq rsm_1
## M11 M12 M13 M22 M23 M33
## 1 1 1 0 3 0 1
## 2 1 1 0 2 2 0
## 3 1 0 1 3 1 0
## 4 0 3 0 2 0 1
## 5 0 3 0 1 2 0
## 6 0 2 1 2 1 0
## 7 0 1 2 3 0 0
The probabilities associated with each multigraph/edge multiplicity sequence, together with statistics ‘number of loops’, ‘number of multiple edges’ and ‘simple graphs or not’, are stored in prob.dists
:
$prob.dists rsm_1
## prob.rsm loops multiedges simple
## 1 0.03030303 5 1 0
## 2 0.18181818 3 3 0
## 3 0.06060606 4 2 0
## 4 0.06060606 3 3 0
## 5 0.24242424 1 5 0
## 6 0.36363636 2 4 0
## 7 0.06060606 3 3 0
More details on these statistics for analyzing structural properties of a multigrpahs is given below.
The second is obtained by independent edge assignments (IEA) according to a common probability distribution. The \(m\) edges of the multigraph are independently assigned to the sites \((i,j)\in \mathcal{R}\) and the edge multiplicity sequence \(\mathbf{M}\) follows a multinomial distribution with parameters \(m\) and \(\mathbf{Q}=(Q_{ij}: (i, j) \in \mathcal{R})\), where \(\mathbf{Q}\) is the edge probability sequence with edge assignment probabilities \(Q_{ij}\) for each site \((i,j)\in \mathcal{R}\) (Shafie, 2015). The probability of a multigraph under this model is given by \[\begin{equation} P(\mathbf{M}=\mathbf{m})= \binom{m}{\mathbf{m}} \mathbf{Q}^{\mathbf{m}} = \frac{m!}{\prod_{i\leq j} m_{ij} !} \prod_{i\leq j}Q_{ij}^{m_{ij}}. \end{equation}\] Moments of certain statistics to analyse multigraph structures are easier derived under this model, facilitating the structural analysis of the multigraphs since the full probability distribution of multigraphs is not needed. Thus, it is of interest to approximate the RSM model by the IEA model. There are two ways in which this can be done:
1. Independent edge assignment of stubs (IEAS)
In order to get an RSM approximation using the IEA model we can simply ignore the dependency between the edge assignments in the RSM model. The distribution of \(\mathbf{M}\) is approximated with the edge probability sequence defined as a function of the fixed degrees \(\mathbf{Q}(\mathbf{d})\). This approximation can be viewed as repeated assignments with replacements of stubs, whereas RSM is repeated assignments without replacement of stubs.
2. Independent stub assignment (ISA)
A Bayesian version of the RSM model is obtained by assigning a prior to the parameter \(\mathbf{d}\), i.e., assuming that the stubs are independently attached to the \(n\) nodes according to a probability distribution \(\mathbf{p}=(p_1, p_2, \ldots, p_n)\) where \(p_i>0\) and \(\sum_{i=1}^np_i = 1\). Thus, is the outcome of a random degree sequence that is multinomial distributed with parameters \(2m\) and \(\mathbf{p}=\mathbf{d}\). Then it follows that the multiplicity sequence \(\mathbf{M}\) has an IEA distribution with edge probability sequence \(\mathbf{Q}(\mathbf{p})\). For the RSM approximation, \(\mathbf{p}=\mathbf{d}/2m\).
The relations between the models are summarized in the below figure:
The function iea_model
has both versions of the IEA model implemented and can be specified to approximate the RSM model. Consider using the IEA model to approximate the RSM model so that edge assignment probabilities are functions of observed degree sequence. Note that the sample space for multigraphs is much bigger than for the RSM model so the multiplicity sequences are not printed (they can be found using the function get_edgemultip_seq
for very small multigraphs and their probabilities can be found using the multinomial distribution). The following shows the number of multigraphs under either of the two IEA models:
<- iea_model(adj = A , type = 'graph', model = 'IEAS', K = 0, apx = TRUE)
ieas_1 <- iea_model(adj = A , type = 'graph', model = 'ISA', K = 0, apx = TRUE)
isa_1 $nr.multigraphs isa_1
## [1] 462
$nr.multigraphs ieas_1
## [1] 462
The logical parameter apx
determines whether the IEA model is being used as an approximation to the RSM model and the parameter model
specifies which IEA model is used for the approximation.
The IEA models can also be used independent of the RSM model. For example, the IEAS model can be used where edge assignment probabilities are estimated using the observed edge multiplicities (maximum likelihood estimates):
<- iea_model(adj = A , type = 'graph', model = 'IEAS',
ieas_2 K = 0, apx = FALSE)
The ISA model can also be used independent of the RSM model. Then, a sequence with the stub assignment probabilities (for example based on prior belief) should be given as argument:
<- iea_model(adj = A , type = 'graph', model = 'ISA',
isa_2 K = 0, apx = FALSE, p.seq = c(1/3, 1/3, 1/3))
Several statistics for analyzing the structural properties of multigraphs under the different models are implemented in the package. These include number of loops denoted \(M_1\) and number of non-loops denoted \(M_2\) (indicator of e.g. homophily and heterophily). Other statistics which are only implemented for the IEA model are those part of a so called complexity sequence with the distribution of edge multiplicities. This sequence is given by \(\mathbf{R}=(R_0, R_1, R_2, \ldots, R_m)\) where \[\begin{equation} R_k = \sum \sum_{i<j}I(M_{ij} = k) \ \textrm{ for } \ k = 0,1, \ldots, m \ , \end{equation}\] and \(I\) is an indicator variable. Thus, \(R_0\) denotes the number of vertex pair sites with no edge occupancy, \(R_1\) single edge occupancy, \(R_2\) double edge occupancy, and so forth. These statistics are useful as an indicator of e.g. multiplexity/interlocking.
Approximate 95% interval estimates for these statistics are given by \(\hat{E} \pm 2\sqrt{\hat{V}}\).
Under the RSM model, the first two moments and interval estimates of the statistics \(M_1\) and \(M_2\) are given by
$M rsm_1
## M1 M2
## Expected 2.273 3.727
## Variance 0.986 0.986
## Upper 95% 4.259 5.713
## Lower 95% 0.287 1.741
which are calculated using the numerically found probability distributions under RSM (no analytical solutions exist for these moments).
Under the IEA models (IEAS or ISA), moments of these statistics, together with the complexity statistic \(R_k\) representing the sequence of frequencies of edge sites with multiplicities \(k = 0,1, \ldots, m\) are found using derived formulas. Thus, there is no limit on multigraph size. When the IEAS model is used to approximate the RSM model (see above), these statistics are:
$M ieas_1
## M1 M2
## Observed 3.000 3.000
## Expected 2.273 3.727
## Variance 1.412 1.412
## Upper 95% 4.649 6.104
## Lower 95% -0.104 1.351
$R ieas_1
## R0 R1 R2
## Observed 2.000 2.000 2.000
## Expected 2.674 1.588 1.030
## Variance 0.575 1.129 0.760
## Upper 95% 4.191 3.713 2.773
## Lower 95% 1.156 -0.537 -0.713
When the ISA model is used to approximate the RSM model (see above):
$M isa_1
## M1 M2
## Observed 3.000 3.000
## Expected 2.583 3.417
## Variance 1.471 1.471
## Upper 95% 5.009 5.842
## Lower 95% 0.158 0.991
$R isa_1
## R0 R1 R2
## Observed 2.000 2.000 2.000
## Expected 2.599 1.703 1.018
## Variance 0.622 1.223 0.748
## Upper 95% 4.176 3.915 2.748
## Lower 95% 1.021 -0.509 -0.711
The interval estimates can then be visualized to detect discrepancies between observed and expected values thus indicating social mechanisms at play, and to detect interval overlap and potential interdependence between different types of edges (for examples, see Shafie 2015,2016; Shafie & Schoch 2021).
Goodness of fits tests of multigraph models using Pearson (\(S\)) and information divergence (\(A\)) test statistics under the random stub matching (RSM) and by independent edge assignments (IEA) model, where the latter is either independent edge assignments of stubs (IEAS) or independent stub assignment (ISA). The tests are performed using goodness-of-fit measures between the edge multiplicity sequence of a specified model or an observed multigraph, and the expected multiplicity sequence according to a simple or composite hypothesis.
The following test statistics are used when edge multiplicities according to \(\textrm{IEA}(\mathbf{Q})\) and correct model \(\mathbf{Q_0}=\mathbf{Q}\) is tested:
the Pearson statistic \[S_0=\sum\sum_{i \leq j}\frac{(M_{ij}-mQ_{0ij})^2}{mQ_{0ij}}=\sum\sum_{i \leq j}\frac{M_{ij}^2}{mQ_{0ij}}-m \stackrel{asymp}{\sim} \chi^2(r-1)\]
the divergence statistic \[D_0=\sum\sum_{i\leq j}\frac{M_{ij}}{m} \log \frac{M_{ij}}{mQ_{0ij}} \quad \textrm{and} \quad A_0=\frac{2m}{\log \textrm{e}}D_0 \stackrel{asymp}{\sim} \chi^2(r-1)\]
The composite multigraph hypotheses are ISA for unknown \(\mathbf{p}\) and IEAS for unknown \(\mathbf{d}\) where parameters have to be estimated from data \(\mathbf{M}\).
When the correct model is tested, the statistics are given by
the Pearson statistic \[\hat{S}=\sum\sum_{i \leq j}\frac{(M_{ij}-m\hat{Q}_{ij})^2}{m\hat{Q}_{ij}}=\sum\sum_{i \leq j}\frac{M_{ij}^2}{m\hat{Q}_{ij}}-m \stackrel{asymp}{\sim} \chi^2(r-n)\]
the divergence statistic \[\hat{D}=\sum\sum_{i\leq j}\frac{M_{ij}}{m} \log \frac{M_{ij}}{m\hat{Q}_{ij}} \qquad \textrm{and} \quad \hat{A}=\frac{2m}{\log \textrm{e}}\hat{D}\stackrel{asymp}{\sim} \chi^2(r-n)\]
Probability distributions of test statistics, summary of tests, moments of tests statistics, adjusted test statistics, critical values, significance level according to asymptotic distribution, and power of tests can be examined using gof_sim
given a specified model from which we simulate observed values from, and a null or non-null hypothesis from which we calculate expected values from. This in order to investigate the behavior of the null and non-null distributions of the test statistics and their fit to to asymptotic \(\chi^2\) distributions, thus also checking how reliable the tests are for small sized multigraphs.
Simulated goodness of fit tests for multigraphs with n=4 nodes and m=10 edges (be patient, these take a while to run).
(1) Testing a simple IEAS hypothesis with degree sequence (6,6,6,2) against a RSM model with degrees (8,8,2,2):
<- gof_sim(m = 10, model = 'IEAS', deg.mod = c(8,8,2,2),
gof1 hyp = 'IEAS', deg.hyp = c(6,6,6,2))
(2) Testing a correctly specified simple IEAS hypothesis with degree sequence (14,2,2,2):
<- gof_sim(m = 10, model = 'IEAS', deg.mod = c(14,2,2,2),
gof2 hyp = 'IEAS', deg.hyp = c(14,2,2,2))
The non-null (gof1
) and null (gof2
) distributions of the test statistics together with their asymptotic chi2-distribution can be visualized using e.g. ggplot2
:
(3) Testing a composite IEAS hypothesis against a RSM model with degree sequence (14,2,2,2):
<- gof_sim(m = 10, model = 'RSM', deg.mod = c(14,2,2,2),
gof3 hyp = 'IEAS', deg.hyp = 0)
(4) Testing a composite ISA hypothesis against a ISA model with degree sequence (14,2,2,2):
<- gof_sim(m = 10, model = 'ISA', deg.mod = c(14,2,2,2),
gof4 hyp = 'ISA', deg.hyp = 0)
The non-null (gof3
) and null (gof4
) distributions of the test statistics can then be visualized as shown above to check their fit to the asymptotic \(\chi^2\)-distribution.
Use function gof_test
to test whether the observed data follows IEA approximations of the RSM model. The null hypotheses can be simple or composite, although the latter is not recommended for small multigraphs as it is difficult to detect a false composite hypothesis under an RSM model and under IEA models (this can be checked and verified using gof_sim
to simulate these cases).
The multiplicity sequence represented as an upper triangular matrix for this mutigrpah is given by
<- t(matrix(c (0, 0, 1, 0, 0, 0, 0, 0,
flor_m 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 2, 0, 0, 1, 5,
0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 1, 2,
0, 0, 0, 0, 0, 0, 2, 1,
0, 0, 0, 0, 0, 0, 0, 2,
0, 0, 0, 0, 0, 0, 0, 1), nrow= 8, ncol=8))
The equivalence of adjacency matrix for the multigraph is given by
<- flor_m+t(flor_m)
flor_adj flor_adj
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0 1 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0
## [3,] 1 0 0 2 0 0 1 5
## [4,] 0 0 2 0 0 0 1 1
## [5,] 0 0 0 0 0 0 1 2
## [6,] 0 0 0 0 0 0 2 1
## [7,] 0 0 1 1 1 2 0 2
## [8,] 0 0 5 1 2 1 2 2
with the diagonal representing the loops double counted (Shafie, 2016). The function get_degree_seq
can now be used to find the degree sequence for this multigraph:
<- get_degree_seq(adj = flor_adj, type = 'multigraph')
flor_d flor_d
## [1] 1 0 9 4 3 3 7 13
Now we test whether the observed network fits the IEAS or the ISA model. The \(p\)-values for testing whether there is a significant difference between observed and expected edge multiplicity values according to the two approximate IEA models are given in the output tables below. Note that the asymptotic \(\chi^2\)-distribution has \(r-1 = \binom{n+1}{2} - 1 =35\) degrees of freedom.
<- gof_test(flor_adj, 'multigraph', 'IEAS', flor_d, 35)
flor_ieas_test flor_ieas_test
## Stat dof Stat(obs) p-value
## 1 S 35 15.762 0.998
## 2 A 35 18.905 0.988
<- gof_test(flor_adj, 'multigraph', 'ISA', flor_d, 35)
flor_isa_test flor_isa_test
## Stat dof Stat(obs) p-value
## 1 S 35 16.572 0.997
## 2 A 35 19.648 0.983
The results show that we have strong evidence for the null such that we fail to reject it. Thus, there is not a significant difference between the observed and the expected edge multiplicity sequence according on the two IEA models. Statistics derived under these models presented above can thus be used to analyze the structure of these multigraphs.
The function nsumk
is useful for finding all possible degree sequences for a network with \(n\) nodes and \(k/2\) number of edges, or for finding all possible edge multiplicity sequence \(n\) that sum up to \(k\) number of edges.
All edge multiplicity sequences/multigraph with 2 nodes and 4 edges
<- (2*3)/2 # vertex pair sites (or length of edge multiplicity sequences)
r <- nsumk(r,4) # number of rows give number of possible multigraphs
mg mg
## [,1] [,2] [,3]
## [1,] 0 0 4
## [2,] 0 1 3
## [3,] 0 2 2
## [4,] 0 3 1
## [5,] 0 4 0
## [6,] 1 0 3
## [7,] 1 1 2
## [8,] 1 2 1
## [9,] 1 3 0
## [10,] 2 0 2
## [11,] 2 1 1
## [12,] 2 2 0
## [13,] 3 0 1
## [14,] 3 1 0
## [15,] 4 0 0
The function get_edge_assignment_probs
can be used to calculates the edge assignment probabilities \(\mathbf{Q}\) given specified degree sequence under the two ways in which the RSM model can be approximated by the IEA model.
Under the IEAS model with 10 possible vertex pair sites (4 nodes), the edge assignment probabilities are given by
<- get_edge_assignment_probs(m = 8, deg.seq = c(4,4,4,4), model = 'IEAS')
Q Q
## [1] 0.0500000 0.1333333 0.1333333 0.1333333 0.0500000 0.1333333 0.1333333
## [8] 0.0500000 0.1333333 0.0500000
Given a degree sequence, the function get_edge_multip_seq
can be used to find all unique edge multiplicity sequences as rows in a data frame. Each row in the data frame represents a unique multigraph given the degree sequence. This is useful if you need to find the full outcome space of multgraphs under the IEA model (but only practical and readable for small multigraphs)
To find all multigraphs with degree sequence \((4,2,2)\)
<- get_edge_multip_seq(deg.seq = c(4,2,2))
mg mg
## M11 M12 M13 M22 M23 M33
## 1 2 0 0 1 0 1
## 2 2 0 0 0 2 0
## 3 1 2 0 0 0 1
## 4 1 1 1 0 1 0
## 5 1 0 2 1 0 0
## 6 0 2 2 0 0 0