Prediction of primary cancer site based on microRNA measurements, see Modeling tissue contamination to improve molecular identification of the primary tumor site of metastases for more details.
Load data containing N samples and p features (covariates):
For the purpose of this tutorial we will load a data set consisting of microRNA normalized expression measurements of primary cancer samples.
## let.7a let.7c let.7d let.7e let.7f
## P-544-ME -1.1052510 -0.9213983 -0.7200146 -0.9448098 -0.591417505
## P-554-NW -1.0956835 -1.0879221 -0.6100223 -0.9538088 -0.554779014
## P-559-OI -1.1271169 -1.0914447 -0.6889379 -1.0823322 -0.736167409
## P-564-MO -1.2465982 -1.2719367 -0.7614792 -1.2006796 -0.784319518
## P-579-MY -0.6194332 -0.4971233 -0.5169694 -0.9004003 0.009509523
## [1] 165 371
## classes
## Breast CCA Cirrhosis CRC EG HCC Liver
## 17 20 17 20 18 17 20
## Pancreas Squamous
## 20 16
Hence, p = 384, N = 165 and the number of classes K = 9, this implies that the multinomial classification model has 9*(384+1) = 3465 parameters.
Let us take out a small test set:
Choose lambda
(fraction of lambda.max) and alpha
, with alpha = 1
for lasso, alpha = 0
for group lasso and alpha
in the range (0,1) for sparse group lasso.
Use msgl::cv
to estimate the error for each lambda in a sequence decreasing from the data derived lambda.max to lambda
* lambda.max. Lambda.max is the lambda at which the first penalized parameter becomes non-zero. A smaller lambda
will take longer to fit and include more features. The following code will run a 10 fold cross validation for each lambda value in the lambda sequence using 2 parallel units (using the foreach and doParallel packages.
cl <- makeCluster(2)
registerDoParallel(cl)
fit.cv <- msgl::cv(x, classes, fold = 10, alpha = 0.5, lambda = 0.1, use_parallel = TRUE)
## Running msgl 10 fold cross validation (dense design matrix)
##
## Samples: Features: Classes: Groups: Parameters:
## 155 372 9 372 3.348k
We have now cross validated the models corresponding to the lambda values, one model for each lambda value. We can summarize the validation as follows.
##
## Call:
## msgl::cv(x = x, classes = classes, alpha = 0.5, lambda = 0.1,
## fold = 10, use_parallel = TRUE)
##
## Models:
##
## Index: Lambda: Features: Parameters: Error:
## 1 1.00 1.3 10.5 0.95
## 20 0.64 12.7 74.2 0.54
## 40 0.40 31.9 168.6 0.28
## 60 0.25 43.7 228.8 0.14
## 80 0.16 54.1 282 0.14
## 100 0.10 65.8 340.7 0.12
##
## Best model:
##
## Index: Lambda: Features: Parameters: Error:
## 85 0.14 57.2 298.1 0.12
Hence, the best model is obtained using lambda index 85 and it has a cross validation error of 0.12. The expected number of selected features is 57.2 and the expected number of parameters is 298.1.
Use msgl to fit a final model.
##
## Running msgl (dense design matrix)
##
## Samples: Features: Classes: Groups: Parameters:
## 155 372 9 372 3.348k
##
## Call:
## msgl::fit(x = x, classes = classes, alpha = 0.5, lambda = 0.1)
##
## Models:
##
## Index: Lambda: Features: Parameters:
## 1 1.00 2 13
## 20 0.64 11 65
## 40 0.40 32 171
## 60 0.25 43 230
## 80 0.16 48 250
## 100 0.10 67 345
As we saw in the previous step the model with index 85 had the best cross validation error, we may take a look at the included features using the command:
## [1] "Intercept" "let.7c" "miR.10a" "miR.17" "miR.21"
## [6] "miR.27a" "miR.34a" "miR.92a" "miR.99b" "miR.122"
## [11] "miR.129.3p" "miR.130b" "miR.133a" "miR.133b" "miR.135b"
## [16] "miR.138" "miR.139.5p" "miR.143" "miR.147b" "miR.148a"
## [21] "miR.181a" "miR.182" "miR.187" "miR.191" "miR.192"
## [26] "miR.196b" "miR.199a.3p" "miR.203" "miR.205" "miR.210"
## [31] "miR.214" "miR.216a" "miR.221" "miR.223" "miR.224"
## [36] "miR.302b" "miR.338.3p" "miR.484" "miR.505" "miR.518f"
## [41] "miR.526b" "miR.532.3p" "miR.548d.3p" "miR.570" "miR.615.5p"
## [46] "miR.625" "miR.628.5p" "miR.654.3p" "miR.885.5p" "miR.891a"
## [51] "miR.492" "miR.506"
Hence 52 features are included in the model, this is close to the expected number based on the cross validation estimate.
The sparsity structure of the parameters belonging to these 52 features may be viewed using
## 9 x 52 sparse Matrix of class "lgCMatrix"
## [[ suppressing 52 column names 'Intercept', 'let.7c', 'miR.10a' ... ]]
##
## Breast | | | . . . . | . | . . | | . . . . | | | | | | | | | | | | | |
## CCA | . | | | | | . | | | . . . . | . . . | | . | . . | . | | . | |
## Cirrhosis | . . . | | . . . | | | . . | | | . . . . . . | . | | | . . | .
## CRC | | | | . | | | | | . | . . | . | . | | . | | . | | . | . | . .
## EG | . . | . . . . . | | . | | | | . | | . . | . | | | . | . | | .
## HCC | . | | . | | | . | | | . . | | . . . | . . | | . . | | . . | |
## Liver | | | . | | | | | | . . . . | | | | | | | . | . | | . . | . | .
## Pancreas | | | | | | | | . | | | | . | | . . | . | . | . . . | . . | | |
## Squamous | . . | . . . | . | . | | | . | | . . | | . | | | . | | | | | .
##
## Breast | | | . . . | . | . | | | | . | . . . .
## CCA | | . | | | | | . | | . | | | | . | | |
## Cirrhosis | | | . | . . | | | | | | | | | | | | |
## CRC | . | . . . . . . . | | | . . . | | . .
## EG | . | | | . | | | | | | | | . | | | . |
## HCC | | | . | | . . | | . . . | . . | . . .
## Liver | . | . | | | . | | | | . . . | | . | .
## Pancreas . . . | . | | . . | | | . . . | | | | |
## Squamous | | | . | . . . . . . . . . . . . . . .
We may also take a look at the estimate parameters (or coefficients)
## 9 x 5 sparse Matrix of class "dgCMatrix"
## Intercept let.7c miR.10a miR.17 miR.21
## Breast 1.7255872 -0.1812463 0.41795742 . .
## CCA -5.6236803 . -0.07369201 0.447246595 -0.22946034
## Cirrhosis 2.5144127 . . . 0.18408397
## CRC -4.2895440 1.4856713 -1.33617049 -1.649955452 .
## EG 3.1423352 . . 0.244433117 .
## HCC 0.6830275 . 1.32180928 -0.337687072 .
## Liver 1.9262963 -0.6855175 0.02548705 . 0.28433634
## Pancreas 2.4641389 -0.5121838 -0.96329373 1.598100763 -0.03628732
## Squamous 3.2744378 . . -0.004954595 .
If we count the total number of non-zero parameters in the model we get in this case 271, which is close to the expected based on the cross validation estimate.
Load test data containing M samples and p features.
Use the final model to predict the classes of the M samples in x.test
.
## P-544-ME P-554-NW P-559-OI P-564-MO P-579-MY P-590-OU
## "Squamous" "Liver" "EG" "Liver" "CRC" "CRC"
## P-598-PO Q-199-AB Q-250-GS Q-278-DK
## "Liver" "EG" "CCA" "EG"
## P-544-ME P-554-NW P-559-OI P-564-MO P-579-MY P-590-OU P-598-PO
## Breast Cirrhosis EG Liver CRC CRC Liver
## Q-199-AB Q-250-GS Q-278-DK
## EG CCA CRC
## Levels: Breast CCA Cirrhosis CRC EG HCC Liver Pancreas Squamous
We may also get the estimated probabilities for each of the classes
## P-544-ME P-554-NW P-559-OI P-564-MO P-579-MY
## Breast 0.34350838 0.002896062 0.006283647 0.007344608 0.130829998
## CCA 0.14175034 0.021929502 0.053164119 0.028909312 0.044609532
## Cirrhosis 0.01929388 0.367678315 0.021345132 0.227996422 0.004549511
## CRC 0.04752251 0.002737490 0.076241611 0.007626142 0.746150956
## EG 0.02964941 0.044726547 0.690395770 0.038356633 0.013109232
## HCC 0.03291542 0.038302244 0.023283936 0.059218935 0.009526537
## Liver 0.01341429 0.510364537 0.038669788 0.621419176 0.002838196
## Pancreas 0.02778587 0.008606528 0.085337244 0.006981781 0.039483351
## Squamous 0.34415990 0.002758776 0.005278753 0.002146993 0.008902686
## P-590-OU P-598-PO Q-199-AB Q-250-GS Q-278-DK
## Breast 0.015923408 0.004721051 0.04437011 0.02125380 0.020382077
## CCA 0.021518463 0.017500274 0.14475866 0.63259237 0.192345235
## Cirrhosis 0.011217520 0.134403760 0.01253528 0.09764710 0.013698789
## CRC 0.757700320 0.006326813 0.13344124 0.03696188 0.058717941
## EG 0.102184534 0.049052480 0.50839356 0.03431131 0.579533688
## HCC 0.044667900 0.094568905 0.05630261 0.04823211 0.009212544
## Liver 0.007456864 0.679018183 0.01386866 0.01361658 0.003883603
## Pancreas 0.030865452 0.012075144 0.06686613 0.08730910 0.078742834
## Squamous 0.008465539 0.002333390 0.01946375 0.02807575 0.043483287