The goal of srt is to read SubRip text files as tabular data for easy analysis and manipulation.
You can install the development version of srt from GitHub with:
The .srt
standard is used to identify the subtitle components for the columns of a data frame:
-->
and the time it should disappear#> 1
#> 00:01:25,210 --> 00:01:28,004
#> I owe everything to George Bailey.
#>
#> 2
#> 00:01:28,422 --> 00:01:30,298
#> Help him, dear Father.
#>
#> 3
#> 00:01:30,674 --> 00:01:33,718
#> Joseph, Jesus and Mary,
These subtitle files are parsed as data frames with separate columns.
(wonderful_life <- read_srt(path = srt, collapse = " "))
#> # A tibble: 2,268 x 4
#> n start end subtitle
#> <int> <dbl> <dbl> <chr>
#> 1 1 85.2 88.0 I owe everything to George Bailey.
#> 2 2 88.4 90.3 Help him, dear Father.
#> 3 3 90.7 93.7 Joseph, Jesus and Mary,
#> 4 4 93.8 96.4 help my friend Mr. Bailey.
#> 5 5 96.9 99.5 Help my son George tonight.
#> 6 6 100. 102. He never thinks about himself, God.
#> 7 7 102. 104. That's why he's in trouble.
#> 8 8 104. 105. George is a good guy.
#> 9 9 106. 108. Give him a break, God.
#> 10 10 108. 110. I love him, dear Lord.
#> # … with 2,258 more rows
This makes it easy to perform various text analysis on the subtitles.
wonderful_life %>%
unnest_tokens(word, subtitle) %>%
count(word, sort = TRUE) %>%
anti_join(stop_words)
#> # A tibble: 1,651 x 2
#> word n
#> <chr> <int>
#> 1 george 216
#> 2 mary 85
#> 3 bailey 74
#> 4 hey 56
#> 5 harry 53
#> 6 yeah 50
#> 7 gonna 45
#> 8 potter 45
#> 9 home 34
#> 10 money 34
#> # … with 1,641 more rows
Or uniformly manipulate the numeric time stamps:
The subtitle data frames can be easily re-written as valid SubRip files.
#> 1
#> 00:01:35,200 --> 00:01:37,994
#> I owe everything to George Bailey.
#>
#> 2
#> 00:01:38,412 --> 00:01:40,288
#> Help him, dear Father.
#>
#> 3
#> 00:01:40,664 --> 00:01:43,708
#> Joseph, Jesus and Mary,