Recommended variable argument syntax

This is the second vignette – we assume you have already read the “three argument syntax” vignette which covers the most basic namedCapture functions, str_match_named and str_match_all_named. Here we introduce the syntax used in the namedCapture::*_variable functions, which is motivated by the desire to avoid repetitive/boilerplate code.

Extract the first match from each subject

In the previous vignette we used the following code to extract the first match from each subject,

subject.vec <- c(
  "chr10:213,054,000-213,055,000",
  "chrM:111,000",
  "this will not match",
  NA, # neither will this.
  "chr1:110-111 chr2:220-222") # two possible matches.
## Single line pattern, not so easy to read.
single.line.pattern <-
  "(?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?"
## Same pattern defined over multiple lines, easier to read.
chr.pos.pattern <- paste0(
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")
identical(single.line.pattern, chr.pos.pattern)
#> [1] TRUE
namedCapture::str_match_named(subject.vec, chr.pos.pattern)
#>      chrom   chromStart    chromEnd     
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM"  "111,000"     ""           
#> [3,] NA      NA            NA           
#> [4,] NA      NA            NA           
#> [5,] "chr1"  "110"         "111"

Note that the pattern above is defined using the paste0 boilerplate, which is used to break the pattern over several lines for clarity. Using the variable argument syntax, we can omit paste0, and simply supply the pattern strings to str_match_variable directly,

namedCapture::str_match_variable(
  subject.vec, 
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")
#>      chrom   chromStart    chromEnd     
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM"  "111,000"     ""           
#> [3,] NA      NA            NA           
#> [4,] NA      NA            NA           
#> [5,] "chr1"  "110"         "111"

We can further simplify by removing the named capture groups from the strings, and adding names to the corresponding arguments. For name1="pattern1", namedCapture internally generates/uses the regex (?P<name1>pattern1).

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+",
  "(?:",
    "-",
    chromEnd="[0-9,]+",
  ")?")
#>      chrom   chromStart    chromEnd     
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM"  "111,000"     ""           
#> [3,] NA      NA            NA           
#> [4,] NA      NA            NA           
#> [5,] "chr1"  "110"         "111"

We can add type conversion functions on the same line as the definition of the named group:

keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
(match.df <- namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+", keep.digits,
  "(?:",
    "-",
    chromEnd="[0-9,]+", keep.digits,
  ")?"))
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 2  chrM     111000        NA
#> 3  <NA>         NA        NA
#> 4  <NA>         NA        NA
#> 5  chr1        110       111

Note the repetition in the chromStart/End lines – the same pattern and type conversion function is used for each group. This repetition can be avoided by creating and using a sub-pattern list variable,

pos.pattern <- list("[0-9,]+", keep.digits)
namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  "(?:",
    "-",
    chromEnd=pos.pattern,
  ")?")
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 2  chrM     111000        NA
#> 3  <NA>         NA        NA
#> 4  <NA>         NA        NA
#> 5  chr1        110       111

Finally, the non-capturing group can be replaced by an un-named list:

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?")
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 2  chrM     111000        NA
#> 3  <NA>         NA        NA
#> 4  <NA>         NA        NA
#> 5  chr1        110       111

In summary, the str_match_variable function takes a variable number of arguments, and allows for a shorter, less repetitive, and thus more user-friendly syntax:

The first argument is the subject character vector.
The other arguments specify the pattern, via character strings, functions, and/or lists.
If a pattern (character/list) is named, we use the argument name in R for the capture group name in the regex.
Each function is used to convert the text extracted by the previous named pattern argument. (type conversion can only be used with named R arguments, NOT with explicitly specified named groups in regex strings)
Lists may be used to avoid repetition in the definition of the pattern and type conversion functions.
Each list generates a group in the regex (named list => named capture group, un-named list => non-capturing group).
All patterns are pasted together in the order that they appear in the argument list.

View generated regex

To see the regular expression pattern string generated by the namedCapture::*_variable functions, call variable_args_list with the variable number of arguments that specify the pattern:

(L <- namedCapture::variable_args_list(
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?"))
#> $fun.list
#> $fun.list$chromStart
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x55a8df4095e0>
#> 
#> $fun.list$chromEnd
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x55a8df4095e0>
#> 
#> 
#> $pattern
#> [1] "(?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?"
identical(L$pattern, single.line.pattern)
#> [1] TRUE

The generated regex is the pattern element of the resulting list above (which is internally passed to namedCapture::*_named). Note how the generated regex is identical to the regex we defined above using a character string literal; the advantage of namedCapture::*_variable functions is that the regex is much easier to read/understand/edit.

Error if any subjects do not match

Sometimes you want to stop with an error (instead of reporting a row of NA) when a subject does not match. In that case, use nomatch.error=TRUE:

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?",
  nomatch.error=TRUE)
#> [1] "this will not match" NA
#> Error in namedCapture::str_match_variable(subject.vec, chrom = "chr.*?", : subjects printed above did not match regex below
#> (?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?

Extract all matches from a multi-line text file subject

The variable argument syntax can also be used with str_match_all_variable, which is for the common case of extracting each match from a multi-line text file. In this section we demonstrate how to use str_match_all_variable to extract data.frames from a loosely structured text file.

trackDb.txt.gz <- system.file(
  "extdata", "trackDb.txt.gz", package="namedCapture")
trackDb.vec <- readLines(trackDb.txt.gz)

Some representative lines from that file are shown below.

cat(trackDb.vec[78:107], sep="\n")
#> track peaks_summary
#> type bigBed 5
#> shortLabel _model_peaks_summary
#> longLabel Regions with a peak in at least one sample
#> visibility pack
#> itemRgb off
#> spectrum on
#> bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#> 
#> 
#>  track bcell_McGill0091
#>  parent bcell
#>  container multiWig
#>  type bigWig
#>  shortLabel bcell_McGill0091
#>  longLabel bcell | McGill0091
#>  graphType points
#>  aggregate transparentOverlay
#>  showSubtrackColorOnUi on
#>  maxHeightPixels 25:12:8
#>  visibility full
#>  autoScale on
#> 
#>   track bcell_McGill0091Coverage
#>   bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#>   shortLabel bcell_McGill0091Coverage
#>   longLabel bcell | McGill0091 | Coverage
#>   parent bcell_McGill0091
#>   type bigWig
#>   color 141,211,199

Each block of text begins with “track” and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:

fields.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name="\\S+",
  fields="(?:\n[^\n]+)*",
  "\n")

Note that this function assumes that its first argument is a character vector with one element for each line in a file. Therefore the result contains no information about which subject element each match comes from (to get that, use str_match_all_named). The code above creates a data frame with one row for each track block, with rownames given by the track line (because of the capture group named name), and one fields column which is a string with the rest of the data in that block.

head(fields.df)
#>                        fields                                                                                                      
#> bcell                  "\nsuperTrack on show\nshortLabel bcell\nlongLabel bcell ChIP-seq samples"                                  
#> kidneyCancer           "\nsuperTrack on show\nshortLabel kidneyCancer\nlongLabel kidneyCancer ChIP-seq samples"                    
#> kidney                 "\nsuperTrack on show\nshortLabel kidney\nlongLabel kidney ChIP-seq samples"                                
#> leukemiaCD19CD10BCells "\nsuperTrack on show\nshortLabel leukemiaCD19CD10BCells\nlongLabel leukemiaCD19CD10BCells ChIP-seq samples"
#> monocyte               "\nsuperTrack on show\nshortLabel monocyte\nlongLabel monocyte ChIP-seq samples"                            
#> skeletalMuscleCtrl     "\nsuperTrack on show\nshortLabel skeletalMuscleCtrl\nlongLabel skeletalMuscleCtrl ChIP-seq samples"

Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:

fields.list <- namedCapture::str_match_all_named(
  fields.df[, "fields"], paste0(
    "\\s+",
    "(?P<name>.*?)",
    " ",
    "(?P<value>[^\n]+)"))

Note that we used str_match_all_named which outputs a list in order to keep info about which match came from which subject. The result is a list of data frames.

fields.list[12:14]
#> $peaks_summary
#>            value                                                                  
#> type       "bigBed 5"                                                             
#> shortLabel "_model_peaks_summary"                                                 
#> longLabel  "Regions with a peak in at least one sample"                           
#> visibility "pack"                                                                 
#> itemRgb    "off"                                                                  
#> spectrum   "on"                                                                   
#> bigDataUrl "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed"
#> 
#> $bcell_McGill0091
#>                       value               
#> parent                "bcell"             
#> container             "multiWig"          
#> type                  "bigWig"            
#> shortLabel            "bcell_McGill0091"  
#> longLabel             "bcell | McGill0091"
#> graphType             "points"            
#> aggregate             "transparentOverlay"
#> showSubtrackColorOnUi "on"                
#> maxHeightPixels       "25:12:8"           
#> visibility            "full"              
#> autoScale             "on"                
#> 
#> $bcell_McGill0091Coverage
#>            value                                                                                      
#> bigDataUrl "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig"
#> shortLabel "bcell_McGill0091Coverage"                                                                 
#> longLabel  "bcell | McGill0091 | Coverage"                                                            
#> parent     "bcell_McGill0091"                                                                         
#> type       "bigWig"                                                                                   
#> color      "141,211,199"

There is a list element for each block, named by track. Each list element is a data frame with one row per field defined in that block (rownames are field names). The names/rownames make it easy to write R code that selects individual elements by name, e.g.

fields.list$bcell_McGill0091Coverage["bigDataUrl",]
#> [1] "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig"
fields.list$monocyte_McGill0001Peaks["color",]
#> [1] "0,0,0"
has.bigDataUrl <- sapply(fields.list, function(m)"bigDataUrl" %in% rownames(m))
bigDataUrl.list <- fields.list[has.bigDataUrl]
length(bigDataUrl.list)
#> [1] 78
length(fields.list)
#> [1] 123

So there are 78 tracks which define the bigDataUrl field, out of 123 total tracks.

In the example above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the bigDataUrl field for each track, and split sample names into separate columns (using a single regex for the track). It also demonstrates how to use nested named capture groups (via named lists which contain named regex strings).

name.pattern <- list(
  cellType=".*?",
  "_",
  sampleName=list(
    "McGill",
    sampleID="[0-9]+", as.integer),
  dataType="Coverage|Peaks",
  "|",
  "[^\n]+")
match.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name=name.pattern,
  "(?:\n[^\n]+)*",
  "\\s+bigDataUrl ",
  bigDataUrl="[^\n]+")
head(match.df)
#>                          cellType sampleName sampleID dataType
#> all_labels                                         NA         
#> problems                                           NA         
#> jointProblems                                      NA         
#> peaks_summary                                      NA         
#> bcell_McGill0091Coverage    bcell McGill0091       91 Coverage
#> bcell_McGill0091Peaks       bcell McGill0091       91    Peaks
#>                                                                                                            bigDataUrl
#> all_labels                                         http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/all_labels.bigBed
#> problems                                             http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/problems.bigBed
#> jointProblems                                   http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/jointProblems.bigBed
#> peaks_summary                                   http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#> bcell_McGill0091Coverage    http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#> bcell_McGill0091Peaks    http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/joint_peaks.bigWig

Exercise for the reader: modify the above regex in order to capture three additional columns (red, green, blue) from the color field.

Extract several columns of a data frame

We also provide namedCapture::df_match_variable which extracts text from several columns of a data.frame, using a different named capture regular expression for each column.

It requires a data.frame as the first argument.
It takes a variable number of other arguments, all of which must be named. For each other argument we call str_match_variable on one column of the input data.frame.
Each argument name specifies a column of the data.frame which will be used as the subject in str_match_variable.
Each argument value specifies a pattern to be used with str_match_variable, in list/character/function format as explained in the previous section.
The return value is a data.frame with the same number of rows as the input, but with an additional column for each named capture group. New columns are named using the convention subjectColumnName.groupName.
This is a “tidy” function that can be used in a pipe. This function can greatly simplify the code required to create numeric data columns from character data columns. For example consider the following data which was output from the sacct program.

(sacct.df <- data.frame(
  Elapsed = c(
    "07:04:42", "07:04:42", "07:04:49",
    "00:00:00", "00:00:00"),
  JobID=c(
    "13937810_25",
    "13937810_25.batch",
    "13937810_25.extern",
    "14022192_[1-3]",
    "14022204_[4]"),
  stringsAsFactors=FALSE))
#>    Elapsed              JobID
#> 1 07:04:42        13937810_25
#> 2 07:04:42  13937810_25.batch
#> 3 07:04:49 13937810_25.extern
#> 4 00:00:00     14022192_[1-3]
#> 5 00:00:00       14022204_[4]

Say we want to filter by the total Elapsed time (which is reported as hours:minutes:seconds), and base job id (which is the number before the underscore in the JobID column). We could start by converting those character columns to integers via:

## Define some sub-patterns separately for clarity.
range.pattern <- list(
  "[[]",
  task1="[0-9]+", as.integer,
  "(?:-",#begin optional end of range.
  taskN="[0-9]+", as.integer,
  ")?", #end is optional.
  "[]]")
task.pattern <- list(
  "(?:",#begin alternate
  task="[0-9]+", as.integer,
  "|",#either one task(above) or range(below)
  range.pattern,
  ")")#end alternate
(task.df <- namedCapture::df_match_variable(
  sacct.df,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))
#>    Elapsed              JobID JobID.job JobID.task JobID.task1 JobID.taskN
#> 1 07:04:42        13937810_25  13937810         25          NA          NA
#> 2 07:04:42  13937810_25.batch  13937810         25          NA          NA
#> 3 07:04:49 13937810_25.extern  13937810         25          NA          NA
#> 4 00:00:00     14022192_[1-3]  14022192         NA           1           3
#> 5 00:00:00       14022204_[4]  14022204         NA           4          NA
#>   JobID.type Elapsed.hours Elapsed.minutes Elapsed.seconds
#> 1                        7               4              42
#> 2      batch             7               4              42
#> 3     extern             7               4              49
#> 4                        0               0               0
#> 5                        0               0               0

The result is another data frame with an additional column for each named capture group. Note that this also works with data.table:

library(data.table)
sacct.dt <- data.table(sacct.df)
(task.dt <- namedCapture::df_match_variable(
  sacct.dt,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))
#>     Elapsed              JobID JobID.job JobID.task JobID.task1 JobID.taskN
#> 1: 07:04:42        13937810_25  13937810         25          NA          NA
#> 2: 07:04:42  13937810_25.batch  13937810         25          NA          NA
#> 3: 07:04:49 13937810_25.extern  13937810         25          NA          NA
#> 4: 00:00:00     14022192_[1-3]  14022192         NA           1           3
#> 5: 00:00:00       14022204_[4]  14022204         NA           4          NA
#>    JobID.type Elapsed.hours Elapsed.minutes Elapsed.seconds
#> 1:                        7               4              42
#> 2:      batch             7               4              42
#> 3:     extern             7               4              49
#> 4:                        0               0               0
#> 5:                        0               0               0