Calculations performed by the unusualprofile package

Suppose that there are a set of variables \(X\) that are related to each other as seen in the model below:

As seen in Figure @ref(fig:example-profile), this profile of scores is summarized by a composite score of 2.30.

How can we calculate the Mahalanobis distance for profiles that all have the same elevation (i.e., composite score)? For your reference, we will do everything “by hand” using matrix algebra.

# Observed Correlations
R_X <- lambda %*% t(lambda) %>% 
  `diag<-`(1)

R_X
#>        X_1   X_2    X_3  X_4
#> X_1 1.0000 0.855 0.8075 0.57
#> X_2 0.8550 1.000 0.7650 0.54
#> X_3 0.8075 0.765 1.0000 0.51
#> X_4 0.5700 0.540 0.5100 1.00

\[R_{X} \approx \begin{bmatrix} 1 & .85 & .81 & .57\\ .85 & 1 & .76 & .54\\ .81 & .76 & 1 & .51\\ .57 & .54 & .51 & 1 \end{bmatrix}\]

We need to use this matrix to create a new 5 × 5 correlation matrix that includes the correlations among the four variables and also each variable’s correlation with the general composite score (i.e., the standardized sum of four variables). Fortunately, such a matrix can be calculated with only a few steps.

We will need a “weight” matrix that will select each variable individually and also the sum of the four variables.

\[w=\begin{bmatrix} 1 & 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 & 1\\ 0 & 0 & 0 & 1 & 1 \end{bmatrix}\]

Notice that the first column of this matrix has a 1 in first position and zeroes elsewhere. It selects the first variable, X₁. The second column selects X₂, and so on to the fourth column. The last column is all ones, which will select all four variables and add them up.

We can construct this matrix with the diag function, which creates an identity matrix. This matrix is appended to a column of ones:

w <-  cbind(diag(4),
            rep(1,4)) %>% 
  `rownames<-`(v_observed) %>% 
  `colnames<-`(v_names)
w
#>     X_1 X_2 X_3 X_4 X_Composite
#> X_1   1   0   0   0           1
#> X_2   0   1   0   0           1
#> X_3   0   0   1   0           1
#> X_4   0   0   0   1           1

\[\Sigma \approx \begin{bmatrix} 1 & .85 & .81 & .57 & 3.23\\ .85 & 1 & .76 & .54 & 3.16\\ .81 & .76 & 1 & .51 & 3.08\\ .57 & .54 & .51 & 1 & 2.62\\ 3.23 & 3.16 & 3.08 & 2.62 & 12.09 \end{bmatrix}\]

Sigma <- (t(w) %*% R_X %*% w) 
Sigma
#>                X_1   X_2    X_3  X_4 X_Composite
#> X_1         1.0000 0.855 0.8075 0.57      3.2325
#> X_2         0.8550 1.000 0.7650 0.54      3.1600
#> X_3         0.8075 0.765 1.0000 0.51      3.0825
#> X_4         0.5700 0.540 0.5100 1.00      2.6200
#> X_Composite 3.2325 3.160 3.0825 2.62     12.0950

Now we need to convert the covariance matrix to a correlation matrix. With matrix equations, we would need to create a matrix of with a vector of variances on the diagonal:

\[D = \text{diag}(\Sigma)\] Then we would take the square root, invert this matrix, and then pre-multiply it and post-multiply it by the covariance matrix.

\[R_{All} \approx \begin{bmatrix} 1 & .86 & .81 & .57 & .93\\ .86 & 1 & .76 & .54 & .91\\ .81 & .76 & 1 & .51 & .89\\ .57 & .54 & .51 & 1 & .75\\ .93 & .91 & .89 & .75 & 1 \end{bmatrix}\]

D_root_inverted <- Sigma %>% 
  diag() %>% 
  sqrt() %>% 
  diag() %>% 
  solve() %>% 
  `rownames<-`(v_names) %>%
  `colnames<-`(v_names)

R_all <- D_root_inverted %*% Sigma %*% D_root_inverted 

R_all
#>                   X_1       X_2       X_3       X_4 X_Composite
#> X_1         1.0000000 0.8550000 0.8075000 0.5700000   0.9294705
#> X_2         0.8550000 1.0000000 0.7650000 0.5400000   0.9086239
#> X_3         0.8075000 0.7650000 1.0000000 0.5100000   0.8863396
#> X_4         0.5700000 0.5400000 0.5100000 1.0000000   0.7533527
#> X_Composite 0.9294705 0.9086239 0.8863396 0.7533527   1.0000000

However, I think it best to sidestep all this complication of converting covariances to correlations with the cov2cor function:

# Convert covariance matrix to correlations 
R_all <- cov2cor(Sigma)

R_all
#>                   X_1       X_2       X_3       X_4 X_Composite
#> X_1         1.0000000 0.8550000 0.8075000 0.5700000   0.9294705
#> X_2         0.8550000 1.0000000 0.7650000 0.5400000   0.9086239
#> X_3         0.8075000 0.7650000 1.0000000 0.5100000   0.8863396
#> X_4         0.5700000 0.5400000 0.5100000 1.0000000   0.7533527
#> X_Composite 0.9294705 0.9086239 0.8863396 0.7533527   1.0000000

Calculate composite scores

To calculate the standardized composite score \(z_C\), add each variable’s deviation from its own mean and divide by the square root of the sum of the observed score covariance matrix.

\[z_C=\frac{1'(X-\mu_X)}{\sqrt{1'\Sigma_X1}}\]

Where

\(z_C\) is a standardized composite score.
\(X\) is a vector of observed scores.
\(\mu_X\) is the vector of means for the \(X\) variables.
\(\Sigma_X\) is the covariance matrix of the \(X\) variables.
\(1\) is a vector of ones compatible with \(\Sigma_X\).

The composite score is:

# Population means of observed variables
mu_X <- rep_along(X, 0)

# Population standard deviations of observed variables
sd_X <- rep_along(X, 1)

# Covariance Matrix
Sigma_X <- diag(sd_X) %*% R_X %*% diag(sd_X)

# Vector of ones
ones <- rep_along(X, 1)

# Standardized composite score
z_C <- c(ones %*% (X - mu_X) / sqrt(ones %*% (Sigma_X) %*% ones))

Estimate expected test scores conditioned on a composite score

Given a particular composite score, we need to calculate a predicted score. That is, if the composite score is 1.5 standard deviations above the mean, what are the expected subtest scores?

\[\hat{X}=\sigma_Xz_Cr_{XX_C}+\mu_X\]

Where

\(\hat{X}\) is the vector of expected subtest scores
\(\sigma_X\) is the vector of standard deviations for \(X\)
\(z_C\) is the composite score
\(r_{XX_C}\) is a vector of correlations of each variable in \(X\) with the composite score \(z_C\)
\(\mu_X\) is the vector of means for \(X\)

Thus,

# Predicted value of X, given composite score
X_hat <- sd_X * z_C * R_all[v_observed, v_composite] + mu_X

Calculating the Conditional Mahalanobis Distance

\[d_{M_C}=\sqrt{\left(X-\hat{X}\right)'\Sigma_{X}^{-1}\left(X-\hat{X}\right)}\]

Where

\(d_{M_C}\) is the Conditional Mahalanobis Distance
\(X\) is a vector of subtest scores
\(\hat{X}\) is the vector of expected subtest scores
\(\Sigma_{X}\) is the covariance matrix of the subtest scores

d_mc <- c(sqrt(t(X - X_hat) %*% solve(Sigma_X) %*% (X - X_hat)))
d_mc
#> [1] 2.9246

Suppose there are k outcome scores, and j composite scores used to calculate the expected scores \(\hat{X}\). If multivariate normality of the subtest scores can be assumed, then the Conditional Mahalanobis Distance squared has a χ² distribution with k − j degrees of freedom.

\[d_{M_C}^{2} \sim\chi^{2}(k-j)\]

# Number of observed variables
k <- length(v_observed)

# Number of composite variables
j <- length(v_composite)

# Cumulative distribution function
p <- pchisq(d_mc ^ 2, df = k - j)
p
#> [1] 0.9641406

If we can assume that the observed variables in X are multivariate normal, a profile of X = {2,3,1,2} is more unusual than 96% of profiles that also have a composite score of z_C = 2.3.

Calculations performed by the unusualprofile package

W. Joel Schneider & Feng Ji

2021-05-12

Calculate composite scores

Estimate expected test scores conditioned on a composite score

Calculating the Conditional Mahalanobis Distance