Suppose that there are a set of variables \(X\) that are related to each other as seen in the model below:
Suppose that there is a profile of scores in \(X\) such that:
\[X=\{X_1,X_2, X_3, X_4\} = \{2,3,1,2\}. \]
As seen in Figure @ref(fig:example-profile), this profile of scores is summarized by a composite score of 2.30.
How can we calculate the Mahalanobis distance for profiles that all have the same elevation (i.e., composite score)? For your reference, we will do everything “by hand” using matrix algebra.
In r, we can make a named vector of scores like so:
<- c(X_1 = 2,
X X_2 = 3,
X_3 = 1,
X_4 = 2)
X#> X_1 X_2 X_3 X_4
#> 2 3 1 2
We will need to store variable names:
<- names(X)
v_observed <- "X"
v_latent <- "X_Composite"
v_composite <- c(v_observed, v_composite) v_names
We can create a matrix of factor loadings:
<- c(0.95, 0.90, 0.85, 0.60) %>%
lambda matrix() %>%
`rownames<-`(v_observed) %>%
`colnames<-`(v_latent)
lambda#> X
#> X_1 0.95
#> X_2 0.90
#> X_3 0.85
#> X_4 0.60
Now we calculate the model-implied correlations among the observed variables:
# Observed Correlations
<- lambda %*% t(lambda) %>%
R_X `diag<-`(1)
R_X#> X_1 X_2 X_3 X_4
#> X_1 1.0000 0.855 0.8075 0.57
#> X_2 0.8550 1.000 0.7650 0.54
#> X_3 0.8075 0.765 1.0000 0.51
#> X_4 0.5700 0.540 0.5100 1.00
Presented formally, the model-implied correlations are:
\[R_{X} \approx \begin{bmatrix} 1 & .85 & .81 & .57\\ .85 & 1 & .76 & .54\\ .81 & .76 & 1 & .51\\ .57 & .54 & .51 & 1 \end{bmatrix}\]
We need to use this matrix to create a new 5 × 5 correlation matrix that includes the correlations among the four variables and also each variable’s correlation with the general composite score (i.e., the standardized sum of four variables). Fortunately, such a matrix can be calculated with only a few steps.
We will need a “weight” matrix that will select each variable individually and also the sum of the four variables.
\[w=\begin{bmatrix} 1 & 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 & 1\\ 0 & 0 & 0 & 1 & 1 \end{bmatrix}\]
Notice that the first column of this matrix has a 1 in first position and zeroes elsewhere. It selects the first variable, X1. The second column selects X2, and so on to the fourth column. The last column is all ones, which will select all four variables and add them up.
We can construct this matrix with the diag
function, which creates an identity matrix. This matrix is appended to a column of ones:
<- cbind(diag(4),
w rep(1,4)) %>%
`rownames<-`(v_observed) %>%
`colnames<-`(v_names)
w#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1 0 0 0 1
#> X_2 0 1 0 0 1
#> X_3 0 0 1 0 1
#> X_4 0 0 0 1 1
Now we can use the weight matrix w to calculate the covariance matrix:
\[\Sigma = w'R_{X}w\]
\[\Sigma \approx \begin{bmatrix} 1 & .85 & .81 & .57 & 3.23\\ .85 & 1 & .76 & .54 & 3.16\\ .81 & .76 & 1 & .51 & 3.08\\ .57 & .54 & .51 & 1 & 2.62\\ 3.23 & 3.16 & 3.08 & 2.62 & 12.09 \end{bmatrix}\]
<- (t(w) %*% R_X %*% w)
Sigma
Sigma#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1.0000 0.855 0.8075 0.57 3.2325
#> X_2 0.8550 1.000 0.7650 0.54 3.1600
#> X_3 0.8075 0.765 1.0000 0.51 3.0825
#> X_4 0.5700 0.540 0.5100 1.00 2.6200
#> X_Composite 3.2325 3.160 3.0825 2.62 12.0950
Now we need to convert the covariance matrix to a correlation matrix. With matrix equations, we would need to create a matrix of with a vector of variances on the diagonal:
\[D = \text{diag}(\Sigma)\] Then we would take the square root, invert this matrix, and then pre-multiply it and post-multiply it by the covariance matrix.
\[R_{All} = D^{-0.5}\Sigma D^{-0.5}\]
\[R_{All} \approx \begin{bmatrix} 1 & .86 & .81 & .57 & .93\\ .86 & 1 & .76 & .54 & .91\\ .81 & .76 & 1 & .51 & .89\\ .57 & .54 & .51 & 1 & .75\\ .93 & .91 & .89 & .75 & 1 \end{bmatrix}\]
If we really want to use pure matrix algebra functions, we could do this:
<- Sigma %>%
D_root_inverted diag() %>%
sqrt() %>%
diag() %>%
solve() %>%
`rownames<-`(v_names) %>%
`colnames<-`(v_names)
<- D_root_inverted %*% Sigma %*% D_root_inverted
R_all
R_all#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1.0000000 0.8550000 0.8075000 0.5700000 0.9294705
#> X_2 0.8550000 1.0000000 0.7650000 0.5400000 0.9086239
#> X_3 0.8075000 0.7650000 1.0000000 0.5100000 0.8863396
#> X_4 0.5700000 0.5400000 0.5100000 1.0000000 0.7533527
#> X_Composite 0.9294705 0.9086239 0.8863396 0.7533527 1.0000000
However, I think it best to sidestep all this complication of converting covariances to correlations with the cov2cor
function:
# Convert covariance matrix to correlations
<- cov2cor(Sigma)
R_all
R_all#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1.0000000 0.8550000 0.8075000 0.5700000 0.9294705
#> X_2 0.8550000 1.0000000 0.7650000 0.5400000 0.9086239
#> X_3 0.8075000 0.7650000 1.0000000 0.5100000 0.8863396
#> X_4 0.5700000 0.5400000 0.5100000 1.0000000 0.7533527
#> X_Composite 0.9294705 0.9086239 0.8863396 0.7533527 1.0000000
To calculate the standardized composite score \(z_C\), add each variable’s deviation from its own mean and divide by the square root of the sum of the observed score covariance matrix.
\[z_C=\frac{1'(X-\mu_X)}{\sqrt{1'\Sigma_X1}}\]
Where
\(z_C\) is a standardized composite score.
\(X\) is a vector of observed scores.
\(\mu_X\) is the vector of means for the \(X\) variables.
\(\Sigma_X\) is the covariance matrix of the \(X\) variables.
\(1\) is a vector of ones compatible with \(\Sigma_X\).
The composite score is:
# Population means of observed variables
<- rep_along(X, 0)
mu_X
# Population standard deviations of observed variables
<- rep_along(X, 1)
sd_X
# Covariance Matrix
<- diag(sd_X) %*% R_X %*% diag(sd_X)
Sigma_X
# Vector of ones
<- rep_along(X, 1)
ones
# Standardized composite score
<- c(ones %*% (X - mu_X) / sqrt(ones %*% (Sigma_X) %*% ones)) z_C
Given a particular composite score, we need to calculate a predicted score. That is, if the composite score is 1.5 standard deviations above the mean, what are the expected subtest scores?
\[\hat{X}=\sigma_Xz_Cr_{XX_C}+\mu_X\]
Where
\(\hat{X}\) is the vector of expected subtest scores
\(\sigma_X\) is the vector of standard deviations for \(X\)
\(z_C\) is the composite score
\(r_{XX_C}\) is a vector of correlations of each variable in \(X\) with the composite score \(z_C\)
\(\mu_X\) is the vector of means for \(X\)
Thus,
# Predicted value of X, given composite score
<- sd_X * z_C * R_all[v_observed, v_composite] + mu_X X_hat
\[d_{M_C}=\sqrt{\left(X-\hat{X}\right)'\Sigma_{X}^{-1}\left(X-\hat{X}\right)}\]
Where
\(d_{M_C}\) is the Conditional Mahalanobis Distance
\(X\) is a vector of subtest scores
\(\hat{X}\) is the vector of expected subtest scores
\(\Sigma_{X}\) is the covariance matrix of the subtest scores
<- c(sqrt(t(X - X_hat) %*% solve(Sigma_X) %*% (X - X_hat)))
d_mc
d_mc#> [1] 2.9246
Suppose there are k outcome scores, and j composite scores used to calculate the expected scores \(\hat{X}\). If multivariate normality of the subtest scores can be assumed, then the Conditional Mahalanobis Distance squared has a χ2 distribution with k − j degrees of freedom.
\[d_{M_C}^{2} \sim\chi^{2}(k-j)\]
# Number of observed variables
<- length(v_observed)
k
# Number of composite variables
<- length(v_composite)
j
# Cumulative distribution function
<- pchisq(d_mc ^ 2, df = k - j)
p
p#> [1] 0.9641406
If we can assume that the observed variables in X are multivariate normal, a profile of X = {2,3,1,2} is more unusual than 96% of profiles that also have a composite score of zC = 2.3.