Classcodes

library(coder)

Motivating example

Let’s consider some example data (ex_peopple and ex_icd10) from vignette("ex_data").

Let’s categorize those patients by their Charlson comorbidity:

categorize(ex_people, codedata = ex_icd10, cc = charlson, id = "name", code = "icd10")
#> Classification based on: icd10
#> # A tibble: 100 × 25
#>    name       surgery    myoca…¹ conge…² perip…³ cereb…⁴ demen…⁵ chron…⁶ rheum…⁷
#>    <chr>      <date>     <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>  
#>  1 Chen, Tre… 2022-08-12 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#>  2 Graves, A… 2022-05-04 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#>  3 Trujillo,… 2022-04-21 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#>  4 Simpson, … 2022-07-24 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#>  5 Chin, Nel… 2022-07-07 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#>  6 Le, Chris… 2022-02-08 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#>  7 Kang, Xuan 2022-05-13 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#>  8 Shuemaker… 2022-02-09 FALSE   FALSE   FALSE   FALSE   FALSE   TRUE    FALSE  
#>  9 Boucher, … 2022-07-18 FALSE   FALSE   TRUE    FALSE   FALSE   FALSE   FALSE  
#> 10 Le, Sorai… 2022-06-22 FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#> # … with 90 more rows, 16 more variables: peptic.ulcer.disease <lgl>,
#> #   mild.liver.disease <lgl>, diabetes.without.complication <lgl>,
#> #   hemiplegia.or.paraplegia <lgl>, renal.disease <lgl>,
#> #   diabetes.complication <lgl>, malignancy <lgl>,
#> #   moderate.or.severe.liver.disease <lgl>, metastatic.solid.tumor <lgl>,
#> #   AIDS.HIV <lgl>, charlson <dbl>, deyo_ramano <dbl>, dhoore <dbl>,
#> #   ghali <dbl>, quan_original <dbl>, quan_updated <dbl>, and abbreviated …

Here, charlson (as supplied by the cc argument) is a “classcodes” object containing a classification scheme. This is the specification of how to match ex_icd10$icd10 to each condition recognized by the Charlson comorbidity classification. It is based on regular expressions (see ?regex).

Default classcodes

There are 7 default “classcodes” objects in the package (classcodes column below). Each of them might have several versions of regular expressions (column regex) and weighted indices (column indices):

all_classcodes()
#> # A tibble: 7 × 3
#>   classcodes    regex                                                    indices
#>   <chr>         <chr>                                                    <chr>  
#> 1 charlson      icd10, icd9cm_deyo, icd9cm_enhanced, icd10_rcs, icd8_br… "charl…
#> 2 cps           icd10                                                    "only_…
#> 3 elixhauser    icd10, icd10_short, icd9cm, icd9cm_ahrqweb, icd9cm_enha… "sum_a…
#> 4 hip_ae        icd10, kva, icd10_fracture                               ""     
#> 5 hip_ae_hailer icd10, kva                                               ""     
#> 6 knee_ae       icd10, kva                                               ""     
#> 7 rxriskv       atc_pratt, atc_caughey, atc_garland                      "pratt…

classcodes object

Each of those classcodes objects are documented (see for example ?charlson). Those objects are basically tibbles (data frames) with some additional attributes:

charlson
#> 
#> Classcodes object
#>  
#> Regular expressions:
#>    icd10, icd9cm_deyo, icd9cm_enhanced, icd10_rcs, icd8_brusselaers, icd9_brusselaers 
#> Indices:
#>    charlson, deyo_ramano, dhoore, ghali, quan_original, quan_updated   
#> 
#> # A tibble: 17 × 14
#>    group   descr…¹ icd10 icd9c…² icd9c…³ icd10…⁴ icd8_…⁵ icd9_…⁶ charl…⁷ deyo_…⁸
#>    <chr>   <chr>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>
#>  1 myocar… Acute … I2([… 41[02]  41[02]  "I2([1… 41[0-2] 41[02]        1       1
#>  2 conges… Heart … I(09… 428     39891|… "I(1[1… 4270|4… 4(02|2…       1       1
#>  3 periph… Periph… I7([… 44(39|… 0930|4… "(I7([… 44[0-5] 44[0-7…       1       1
#>  4 cerebr… Cerebr… G4[5… 43[0-8] 36234|… "G4[56… 43[0-8] 362C|4…       1       1
#>  5 dement… Senile… F0([… 290     29(0|4… "A810|… 290[01] 29[04]        1       1
#>  6 chroni… Chroni… (I27… 490|50… 4(16[8… "(I2[6… 49[0-3… 416|49…       1       1
#>  7 rheuma… System… M(0[… 7(1(0[… 4465|7… "M(0[5… 7(1[0-… 71[0-4…       1       1
#>  8 peptic… Gastri… K2[5… 53[1-4] 53[1-4]  <NA>   <NA>    <NA>          1       1
#>  9 mild l… Alcoho… B18|… 571[24… 070([2…  <NA>   <NA>    <NA>          1       1
#> 10 diabet… Diabet… E1[0… 250[0-… 250[0-…  <NA>   <NA>    <NA>          1       1
#> 11 hemipl… Parapl… G(04… 34(41|… 3(341|… "G(114… 344     34[2-4]       2       1
#> 12 renal … Chroni… I1(2… 58([25… 40(3([… "I1[23… 40[34]… 40[34]…       2       1
#> 13 diabet… Diabet… E1[0… 250[4-… 250[4-… "E1[0-… 250     250           2       1
#> 14 malign… Malign… C([0… (1([4-… 1([4-6… "C([01… 1([4-6… 1([4-6…       2       1
#> 15 modera… Hepati… I(8(… 456[01… 456[0-… "B18|I… 070|45… 070|45…       3       1
#> 16 metast… Second… C(7[… 19([6-… 19[6-9] "C(7[7… 19[6-9] 19[6-9]       6       1
#> 17 AIDS/H… HIV in… B2[0… 04[2-4] 04[2-4] "B2[0-… <NA>    279K          6       1
#> # … with 4 more variables: dhoore <dbl>, ghali <dbl>, quan_original <dbl>,
#> #   quan_updated <dbl>, and abbreviated variable names ¹​description,
#> #   ²​icd9cm_deyo, ³​icd9cm_enhanced, ⁴​icd10_rcs, ⁵​icd8_brusselaers,
#> #   ⁶​icd9_brusselaers, ⁷​charlson, ⁸​deyo_ramano

Columns have pre-specified names and/or content:

In the example above, we did not specify which version of the regular expressions to use. We see from the printed output above (or by attr(charlson, "regexprs")), that the first regular expression is “icd10”. This will be used by default. We have ICD-10 codes recorded in our code data set (ex_icd10$icd10). We might therefore use either “icd10” or the alternative “icd10_rcs”. Other versions might be relevant if the medical data is coded by other codes (such as earlier versions of ICD). We will show below how to alter this setting in practice.

Hierarchy

Some classcodes objects have an additional class attribute “hierarchy”, controlling hierarchical groups where only one of possibly several groups should be used in weighted index sums. The classcodes object for the Elixhauser comorbidity classification has this property:

print(elixhauser, n = 0) # preview 0 rows but present the attributes
#> 
#> Classcodes object
#>  
#> Regular expressions:
#>    icd10, icd10_short, icd9cm, icd9cm_ahrqweb, icd9cm_enhanced 
#> Indices:
#>    sum_all, sum_all_ahrq, walraven, sid29, sid30, ahrq_mort, ahrq_readm 
#> Hierarchy:
#>    c("metastatic cancer", "solid tumor"),
#>    c("diabetes uncomplicated", "diabetes complicated")

This means that patients who have both metastatic cancer and solid tumors should be recognized as such if classified. If such patient are assigned an aggregated index score, however, only the largest score is used (in this case for a metastatic cancer as superior to a solid tumor). The same is true for patients diagnosed with both uncomplicated and complicated diabetes.

Consider a patient Alice with some diagnoses:

pat <- tibble::tibble(id = "Alice")
diags <- c("C01", "C801", "E1010", "E1021")
decoder::decode(diags, decoder::icd10cm)
#> [1] "Malignant neoplasm of base of tongue"                   
#> [2] "Malignant (primary) neoplasm, unspecified"              
#> [3] "Type 1 diabetes mellitus with ketoacidosis without coma"
#> [4] "Type 1 diabetes mellitus with diabetic nephropathy"

According to Elixhauser, poor Alice has both a solid tumor and a metastatic cancer, as well as diabetes both with and without complications. The (unweighted) index “sum_all”, however will not equal 4 but 2, since metastatic cancer and diabetes with complications subsume solid tumors and diabetes without complications.

icd10 <- tibble::tibble(id = "Alice", icd10 = diags)
x <- categorize(pat, codedata = icd10, cc = elixhauser, 
                id = "id", code = "icd10", index = "sum_all", check.names = FALSE)
#> Classification based on: icd10
t(x)
#>                                [,1]   
#> id                             "Alice"
#> congestive heart failure       "FALSE"
#> cardiac arrhythmias            "FALSE"
#> valvular disease               "FALSE"
#> pulmonary circulation disorder "FALSE"
#> peripheral vascular disorder   "FALSE"
#> hypertension uncomplicated     "FALSE"
#> hypertension complicated       "FALSE"
#> paralysis                      "FALSE"
#> other neurological disorders   "FALSE"
#> chronic pulmonary disease      "FALSE"
#> diabetes uncomplicated         "TRUE" 
#> diabetes complicated           "TRUE" 
#> hypothyroidism                 "FALSE"
#> renal failure                  "FALSE"
#> liver disease                  "FALSE"
#> peptic ulcer disease           "FALSE"
#> AIDS/HIV                       "FALSE"
#> lymphoma                       "FALSE"
#> metastatic cancer              "TRUE" 
#> solid tumor                    "TRUE" 
#> rheumatoid arthritis           "FALSE"
#> coagulopathy                   "FALSE"
#> obesity                        "FALSE"
#> weight loss                    "FALSE"
#> fluid electrolyte disorders    "FALSE"
#> blood loss anemia              "FALSE"
#> deficiency anemia              "FALSE"
#> alcohol abuse                  "FALSE"
#> drug abuse                     "FALSE"
#> psychoses                      "FALSE"
#> depression                     "FALSE"
#> sum_all                        "2"

Conditions

Consider Alice once more. Suppose she got a THA and had some surgical procedure codes recorded at hospital visits either before, during or after her index surgery. Those codes are recorded by the Nomesco classification of surgical procedures (also known as KVA codes in Swedish). Here, “post_op” indicates whether the code was recorded after surgery or not. This information is not always accessible by pure date stamps (if so, the approach illustrated in vignette("coder") could be used instead).


nomesco <- 
  tibble::tibble(
    id      = "Alice",
    kva     = c("AA01", "NFC01"),
    post_op = c(TRUE, FALSE)
  )

Thus, the “post_op” column is a Boolean/logical vector with a name recognized from the “condition” column in hip_ae, a classcodes object used to identify adverse events after THA (the use of set_classcodes() is further explained below and is used here since hip_ae includes codes for both ICD and NOMESCO/KVA).

set_classcodes(hip_ae, regex = "kva")
#> 
#> Classcodes object
#>  
#> Regular expressions:
#>    kva 
#> Indices:
#>       
#> 
#> # A tibble: 1 × 3
#>   group kva                                                              condi…¹
#>   <chr> <chr>                                                            <chr>  
#> 1 KVA   ^(NF([CF-HJ-MS-TW]|A(02|1[12]|2[0-2])|Q09|U[013489]9)|QD(A10|B(… post_op
#> # … with abbreviated variable name ¹​condition

A code from nomesco$kva will only be recognized as an adverse events if 1) the code is matched by the relevant regular expression, and 2) the extra condition (from nomesco$post_op) is TRUE.

We need to specify that codes are based on regular expressions matching NOMESCO codes. We do this by the regex argument passed to set_classcodes() by the cc_args argument.

In the data set (nomesco), “AA01” was recorded after surgery but does not indicate a potential adverse event. “NFC01” is a potential adverse event but was recorded already before surgery. Therefore, no adverse event will be recognized in this case.

categorize(pat, codedata = nomesco, cc = hip_ae, id = "id", code = "kva",
           cc_args = list(regex = "kva"))
#> index calculated as number of relevant categories
#> # A tibble: 1 × 3
#>   id    KVA   index
#>   <chr> <lgl> <dbl>
#> 1 Alice FALSE     0

Use classcodes objects

Most functions do not use the classcodes object themselves, but a modified version passed through set_classcodes(). This function can be called directly but is more often invoked by arguments passed by the cc_args argument used in other functions (as in the example above).

Explicit use of set_classcodes()

We might use set_classcodes() to prepare a classification scheme according to the Charlson comorbidity index based on ICD-8 (Brusselaers and Lagergren 2017). Assume that such codes might be found in character strings with leading prefixes or in the middle of a more verbatim description. This is controlled by setting the argument start = FALSE, meaning that the identified ICD-8 codes do not need to appear in the beginning of the character string. We might assume, however, that there is no more information after the code (as specified by stop = TRUE). We can also use some more specific and unique group names as specified by tech_names.

charlson_icd8 <- 
  set_classcodes(
    "charlson",
    regex      = "icd8_brusselaers", # Version based on ICD-8
    start      = FALSE, # Codes do not have to occur in the beginning of a vector
    stop       = TRUE,  # Code vector must end with the specified codes
    tech_names = TRUE   # Use long but unique and descriptive variable names
  )

The resulting object has only one version of regular expressions (icd8_brusselaers as specified). Each regular expression is suffixed with $ (due to stop = TRUE). Group names might seem cumbersome but this will help to distinguish column names added by categorize() if this function is run repeatedly with different classcodes (i.e. if we calculate both Charlson and Elixhauser indices for the same patients). The original charlson object had 17 rows, but charlson_icd8 has only 13, since not all groups are used in this version.

charlson_icd8
#> 
#> Classcodes object
#>  
#> Regular expressions:
#>    icd8_brusselaers 
#> Indices:
#>    charlson, deyo_ramano, dhoore, ghali, quan_original, quan_updated   
#> 
#> # A tibble: 13 × 9
#>    group            descr…¹ icd8_…² charl…³ deyo_…⁴ dhoore ghali quan_…⁵ quan_…⁶
#>    <chr>            <chr>   <chr>     <dbl>   <dbl>  <dbl> <dbl>   <dbl>   <dbl>
#>  1 charlson_icd8_b… Acute … (41[0-…       1       1      1     1       1       0
#>  2 charlson_icd8_b… Heart … (4270|…       1       1      1     4       1       2
#>  3 charlson_icd8_b… Periph… (44[0-…       1       1      1     2       1       0
#>  4 charlson_icd8_b… Cerebr… (43[0-…       1       1      1     1       1       0
#>  5 charlson_icd8_b… Senile… (290[0…       1       1      1     0       1       2
#>  6 charlson_icd8_b… Chroni… (49[0-…       1       1      1     0       1       1
#>  7 charlson_icd8_b… System… (7(1[0…       1       1      1     0       1       1
#>  8 charlson_icd8_b… Parapl… (344)$        2       1      1     0       2       2
#>  9 charlson_icd8_b… Chroni… (40[34…       2       1      1     3       2       1
#> 10 charlson_icd8_b… Diabet… (250)$        2       1      1     0       2       1
#> 11 charlson_icd8_b… Malign… (1([4-…       2       1      1     0       2       2
#> 12 charlson_icd8_b… Hepati… (070|4…       3       1      1     0       3       4
#> 13 charlson_icd8_b… Second… (19[6-…       6       1      1     0       6       6
#> # … with abbreviated variable names ¹​description, ²​icd8_brusselaers, ³​charlson,
#> #   ⁴​deyo_ramano, ⁵​quan_original, ⁶​quan_updated

Note that all index columns remain in the tibble. It is thus possible to combine any categorization with any index, although some combinations might be preferred (such as regex_icd9cm_deyo combined with index_deyo_ramano).

We can now use charlson_icd8 for classification:

classify(410, charlson_icd8)
#> Classification based on: icd8_brusselaers
#> 
#> The printed data is of class: classified, matrix.
#> It has 1 row(s).
#> It is here previewed as a tibble
#> Use `print(x, n = NULL)` to print as is (or use `n` to specify the number of rows to preview)!
#> 
#> # A tibble: 1 × 13
#>   charlson_icd…¹ charl…² charl…³ charl…⁴ charl…⁵ charl…⁶ charl…⁷ charl…⁸ charl…⁹
#>   <lgl>          <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>  
#> 1 TRUE           FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#> # … with 4 more variables:
#> #   charlson_icd8_brusselaers_diabetes_complication <lgl>,
#> #   charlson_icd8_brusselaers_malignancy <lgl>,
#> #   charlson_icd8_brusselaers_moderate_or_severe_liver_disease <lgl>,
#> #   charlson_icd8_brusselaers_metastatic_solid_tumor <lgl>, and abbreviated
#> #   variable names ¹​charlson_icd8_brusselaers_myocardial_infarction,
#> #   ²​charlson_icd8_brusselaers_congestive_heart_failure, …

The ICD-8 code 410is recognized as (only) myocardial infarction.

Implicit use of set_classcodes()

Instead of pre-specifying the charlson_icd8, a similar result is achieved by:

classify(
  410,
  "charlson",
  cc_args = list(
    regex      = "icd8_brusselaers", 
    start      = FALSE, 
    stop       = TRUE,
    tech_names = TRUE
  )
)
#> 
#> The printed data is of class: classified, matrix.
#> It has 1 row(s).
#> It is here previewed as a tibble
#> Use `print(x, n = NULL)` to print as is (or use `n` to specify the number of rows to preview)!
#> 
#> # A tibble: 1 × 13
#>   charlson_icd…¹ charl…² charl…³ charl…⁴ charl…⁵ charl…⁶ charl…⁷ charl…⁸ charl…⁹
#>   <lgl>          <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>  
#> 1 TRUE           FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
#> # … with 4 more variables:
#> #   charlson_icd8_brusselaers_diabetes_complication <lgl>,
#> #   charlson_icd8_brusselaers_malignancy <lgl>,
#> #   charlson_icd8_brusselaers_moderate_or_severe_liver_disease <lgl>,
#> #   charlson_icd8_brusselaers_metastatic_solid_tumor <lgl>, and abbreviated
#> #   variable names ¹​charlson_icd8_brusselaers_myocardial_infarction,
#> #   ²​charlson_icd8_brusselaers_congestive_heart_failure, …

Bibliography

Brusselaers, Nele, and Jesper Lagergren. 2017. “The Charlson Comorbidity Index in Registry-Based Research.” Methods of Information in Medicine 56 (05): 401–6. https://doi.org/10.3414/ME17-01-0051.