This is the second vignette – we assume you have already read the
“three argument syntax” vignette which covers the most basic
namedCapture
functions, str_match_named
and
str_match_all_named
. Here we introduce the syntax used in the
namedCapture::*_variable
functions, which is motivated by the desire
to avoid repetitive/boilerplate code.
In the previous vignette we used the following code to extract the first match from each subject,
subject.vec <- c(
"chr10:213,054,000-213,055,000",
"chrM:111,000",
"this will not match",
NA, # neither will this.
"chr1:110-111 chr2:220-222") # two possible matches.
## Single line pattern, not so easy to read.
single.line.pattern <-
"(?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?"
## Same pattern defined over multiple lines, easier to read.
chr.pos.pattern <- paste0(
"(?P<chrom>chr.*?)",
":",
"(?P<chromStart>[0-9,]+)",
"(?:",
"-",
"(?P<chromEnd>[0-9,]+)",
")?")
identical(single.line.pattern, chr.pos.pattern)
#> [1] TRUE
namedCapture::str_match_named(subject.vec, chr.pos.pattern)
#> chrom chromStart chromEnd
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM" "111,000" ""
#> [3,] NA NA NA
#> [4,] NA NA NA
#> [5,] "chr1" "110" "111"
Note that the pattern above is defined using the paste0
boilerplate, which is used to break the pattern over several lines for clarity. Using the variable argument syntax, we can omit paste0
, and simply supply the pattern strings to str_match_variable
directly,
namedCapture::str_match_variable(
subject.vec,
"(?P<chrom>chr.*?)",
":",
"(?P<chromStart>[0-9,]+)",
"(?:",
"-",
"(?P<chromEnd>[0-9,]+)",
")?")
#> chrom chromStart chromEnd
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM" "111,000" ""
#> [3,] NA NA NA
#> [4,] NA NA NA
#> [5,] "chr1" "110" "111"
We can further simplify by removing the named capture groups from the strings, and adding names to the corresponding arguments. For name1="pattern1"
, namedCapture
internally generates/uses the regex (?P<name1>pattern1)
.
namedCapture::str_match_variable(
subject.vec,
chrom="chr.*?",
":",
chromStart="[0-9,]+",
"(?:",
"-",
chromEnd="[0-9,]+",
")?")
#> chrom chromStart chromEnd
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM" "111,000" ""
#> [3,] NA NA NA
#> [4,] NA NA NA
#> [5,] "chr1" "110" "111"
We can add type conversion functions on the same line as the definition of the named group:
keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
(match.df <- namedCapture::str_match_variable(
subject.vec,
chrom="chr.*?",
":",
chromStart="[0-9,]+", keep.digits,
"(?:",
"-",
chromEnd="[0-9,]+", keep.digits,
")?"))
#> chrom chromStart chromEnd
#> 1 chr10 213054000 213055000
#> 2 chrM 111000 NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 chr1 110 111
Note the repetition in the chromStart/End lines – the same pattern and type conversion function is used for each group. This repetition can be avoided by creating and using a sub-pattern list variable,
pos.pattern <- list("[0-9,]+", keep.digits)
namedCapture::str_match_variable(
subject.vec,
chrom="chr.*?",
":",
chromStart=pos.pattern,
"(?:",
"-",
chromEnd=pos.pattern,
")?")
#> chrom chromStart chromEnd
#> 1 chr10 213054000 213055000
#> 2 chrM 111000 NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 chr1 110 111
Finally, the non-capturing group can be replaced by an un-named list:
namedCapture::str_match_variable(
subject.vec,
chrom="chr.*?",
":",
chromStart=pos.pattern,
list(
"-",
chromEnd=pos.pattern
), "?")
#> chrom chromStart chromEnd
#> 1 chr10 213054000 213055000
#> 2 chrM 111000 NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 chr1 110 111
In summary, the str_match_variable
function takes a variable number of arguments, and allows for a shorter, less repetitive, and thus more user-friendly syntax:
To see the regular expression pattern string generated by the
namedCapture::*_variable
functions, call variable_args_list
with the variable number of arguments that specify the pattern:
(L <- namedCapture::variable_args_list(
chrom="chr.*?",
":",
chromStart=pos.pattern,
list(
"-",
chromEnd=pos.pattern
), "?"))
#> $fun.list
#> $fun.list$chromStart
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x55a8df4095e0>
#>
#> $fun.list$chromEnd
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x55a8df4095e0>
#>
#>
#> $pattern
#> [1] "(?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?"
identical(L$pattern, single.line.pattern)
#> [1] TRUE
The generated regex is the pattern
element of the resulting list
above (which is internally passed to namedCapture::*_named
). Note
how the generated regex is identical to the regex we defined above
using a character string literal; the advantage of
namedCapture::*_variable
functions is that the regex is much easier
to read/understand/edit.
Sometimes you want to stop with an error (instead of reporting a row
of NA) when a subject does not match. In that case, use nomatch.error=TRUE
:
namedCapture::str_match_variable(
subject.vec,
chrom="chr.*?",
":",
chromStart=pos.pattern,
list(
"-",
chromEnd=pos.pattern
), "?",
nomatch.error=TRUE)
#> [1] "this will not match" NA
#> Error in namedCapture::str_match_variable(subject.vec, chrom = "chr.*?", : subjects printed above did not match regex below
#> (?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?
The variable argument syntax can also be used with str_match_all_variable
, which is for the common case of extracting each match from a multi-line text file. In this section we demonstrate how to use str_match_all_variable
to
extract data.frames from a loosely structured text file.
trackDb.txt.gz <- system.file(
"extdata", "trackDb.txt.gz", package="namedCapture")
trackDb.vec <- readLines(trackDb.txt.gz)
Some representative lines from that file are shown below.
cat(trackDb.vec[78:107], sep="\n")
#> track peaks_summary
#> type bigBed 5
#> shortLabel _model_peaks_summary
#> longLabel Regions with a peak in at least one sample
#> visibility pack
#> itemRgb off
#> spectrum on
#> bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#>
#>
#> track bcell_McGill0091
#> parent bcell
#> container multiWig
#> type bigWig
#> shortLabel bcell_McGill0091
#> longLabel bcell | McGill0091
#> graphType points
#> aggregate transparentOverlay
#> showSubtrackColorOnUi on
#> maxHeightPixels 25:12:8
#> visibility full
#> autoScale on
#>
#> track bcell_McGill0091Coverage
#> bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#> shortLabel bcell_McGill0091Coverage
#> longLabel bcell | McGill0091 | Coverage
#> parent bcell_McGill0091
#> type bigWig
#> color 141,211,199
Each block of text begins with “track” and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:
fields.df <- namedCapture::str_match_all_variable(
trackDb.vec,
"track ",
name="\\S+",
fields="(?:\n[^\n]+)*",
"\n")
Note that this function assumes that its first argument is a character vector with one element for each line in a file. Therefore the result contains no information about which subject element each match comes from (to get that, use str_match_all_named
).
The code above creates a data frame with one row for each track block,
with rownames given by the track line (because of the capture group
named name), and one fields column which is a string with the rest of
the data in that block.
head(fields.df)
#> fields
#> bcell "\nsuperTrack on show\nshortLabel bcell\nlongLabel bcell ChIP-seq samples"
#> kidneyCancer "\nsuperTrack on show\nshortLabel kidneyCancer\nlongLabel kidneyCancer ChIP-seq samples"
#> kidney "\nsuperTrack on show\nshortLabel kidney\nlongLabel kidney ChIP-seq samples"
#> leukemiaCD19CD10BCells "\nsuperTrack on show\nshortLabel leukemiaCD19CD10BCells\nlongLabel leukemiaCD19CD10BCells ChIP-seq samples"
#> monocyte "\nsuperTrack on show\nshortLabel monocyte\nlongLabel monocyte ChIP-seq samples"
#> skeletalMuscleCtrl "\nsuperTrack on show\nshortLabel skeletalMuscleCtrl\nlongLabel skeletalMuscleCtrl ChIP-seq samples"
Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:
fields.list <- namedCapture::str_match_all_named(
fields.df[, "fields"], paste0(
"\\s+",
"(?P<name>.*?)",
" ",
"(?P<value>[^\n]+)"))
Note that we used
str_match_all_named
which outputs a list in order to keep info about
which match came from which subject. The result is a list of data frames.
fields.list[12:14]
#> $peaks_summary
#> value
#> type "bigBed 5"
#> shortLabel "_model_peaks_summary"
#> longLabel "Regions with a peak in at least one sample"
#> visibility "pack"
#> itemRgb "off"
#> spectrum "on"
#> bigDataUrl "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed"
#>
#> $bcell_McGill0091
#> value
#> parent "bcell"
#> container "multiWig"
#> type "bigWig"
#> shortLabel "bcell_McGill0091"
#> longLabel "bcell | McGill0091"
#> graphType "points"
#> aggregate "transparentOverlay"
#> showSubtrackColorOnUi "on"
#> maxHeightPixels "25:12:8"
#> visibility "full"
#> autoScale "on"
#>
#> $bcell_McGill0091Coverage
#> value
#> bigDataUrl "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig"
#> shortLabel "bcell_McGill0091Coverage"
#> longLabel "bcell | McGill0091 | Coverage"
#> parent "bcell_McGill0091"
#> type "bigWig"
#> color "141,211,199"
There is a list element for each block, named by track. Each list element is a data frame with one row per field defined in that block (rownames are field names). The names/rownames make it easy to write R code that selects individual elements by name, e.g.
fields.list$bcell_McGill0091Coverage["bigDataUrl",]
#> [1] "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig"
fields.list$monocyte_McGill0001Peaks["color",]
#> [1] "0,0,0"
has.bigDataUrl <- sapply(fields.list, function(m)"bigDataUrl" %in% rownames(m))
bigDataUrl.list <- fields.list[has.bigDataUrl]
length(bigDataUrl.list)
#> [1] 78
length(fields.list)
#> [1] 123
So there are 78 tracks which define the bigDataUrl field, out of 123 total tracks.
In the example above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the bigDataUrl field for each track, and split sample names into separate columns (using a single regex for the track). It also demonstrates how to use nested named capture groups (via named lists which contain named regex strings).
name.pattern <- list(
cellType=".*?",
"_",
sampleName=list(
"McGill",
sampleID="[0-9]+", as.integer),
dataType="Coverage|Peaks",
"|",
"[^\n]+")
match.df <- namedCapture::str_match_all_variable(
trackDb.vec,
"track ",
name=name.pattern,
"(?:\n[^\n]+)*",
"\\s+bigDataUrl ",
bigDataUrl="[^\n]+")
head(match.df)
#> cellType sampleName sampleID dataType
#> all_labels NA
#> problems NA
#> jointProblems NA
#> peaks_summary NA
#> bcell_McGill0091Coverage bcell McGill0091 91 Coverage
#> bcell_McGill0091Peaks bcell McGill0091 91 Peaks
#> bigDataUrl
#> all_labels http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/all_labels.bigBed
#> problems http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/problems.bigBed
#> jointProblems http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/jointProblems.bigBed
#> peaks_summary http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#> bcell_McGill0091Coverage http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#> bcell_McGill0091Peaks http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/joint_peaks.bigWig
Exercise for the reader: modify the above regex in order to capture three additional columns (red, green, blue) from the color field.
We also provide namedCapture::df_match_variable
which extracts text
from several columns of a data.frame, using a different named capture
regular expression for each column.
str_match_variable
on one
column of the input data.frame.str_match_variable
.str_match_variable
, in list/character/function format as
explained in the previous section.subjectColumnName.groupName
.(sacct.df <- data.frame(
Elapsed = c(
"07:04:42", "07:04:42", "07:04:49",
"00:00:00", "00:00:00"),
JobID=c(
"13937810_25",
"13937810_25.batch",
"13937810_25.extern",
"14022192_[1-3]",
"14022204_[4]"),
stringsAsFactors=FALSE))
#> Elapsed JobID
#> 1 07:04:42 13937810_25
#> 2 07:04:42 13937810_25.batch
#> 3 07:04:49 13937810_25.extern
#> 4 00:00:00 14022192_[1-3]
#> 5 00:00:00 14022204_[4]
Say we want to filter by the total Elapsed time (which is reported as hours:minutes:seconds), and base job id (which is the number before the underscore in the JobID column). We could start by converting those character columns to integers via:
## Define some sub-patterns separately for clarity.
range.pattern <- list(
"[[]",
task1="[0-9]+", as.integer,
"(?:-",#begin optional end of range.
taskN="[0-9]+", as.integer,
")?", #end is optional.
"[]]")
task.pattern <- list(
"(?:",#begin alternate
task="[0-9]+", as.integer,
"|",#either one task(above) or range(below)
range.pattern,
")")#end alternate
(task.df <- namedCapture::df_match_variable(
sacct.df,
JobID=list(
job="[0-9]+", as.integer,
"_",
task.pattern,
"(?:[.]",
type=".*",
")?"),
Elapsed=list(
hours="[0-9]+", as.integer,
":",
minutes="[0-9]+", as.integer,
":",
seconds="[0-9]+", as.integer)))
#> Elapsed JobID JobID.job JobID.task JobID.task1 JobID.taskN
#> 1 07:04:42 13937810_25 13937810 25 NA NA
#> 2 07:04:42 13937810_25.batch 13937810 25 NA NA
#> 3 07:04:49 13937810_25.extern 13937810 25 NA NA
#> 4 00:00:00 14022192_[1-3] 14022192 NA 1 3
#> 5 00:00:00 14022204_[4] 14022204 NA 4 NA
#> JobID.type Elapsed.hours Elapsed.minutes Elapsed.seconds
#> 1 7 4 42
#> 2 batch 7 4 42
#> 3 extern 7 4 49
#> 4 0 0 0
#> 5 0 0 0
The result is another data frame with an additional column for each
named capture group. Note that this also works with data.table
:
library(data.table)
sacct.dt <- data.table(sacct.df)
(task.dt <- namedCapture::df_match_variable(
sacct.dt,
JobID=list(
job="[0-9]+", as.integer,
"_",
task.pattern,
"(?:[.]",
type=".*",
")?"),
Elapsed=list(
hours="[0-9]+", as.integer,
":",
minutes="[0-9]+", as.integer,
":",
seconds="[0-9]+", as.integer)))
#> Elapsed JobID JobID.job JobID.task JobID.task1 JobID.taskN
#> 1: 07:04:42 13937810_25 13937810 25 NA NA
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA
#> JobID.type Elapsed.hours Elapsed.minutes Elapsed.seconds
#> 1: 7 4 42
#> 2: batch 7 4 42
#> 3: extern 7 4 49
#> 4: 0 0 0
#> 5: 0 0 0