NCAA Scraping

Bill Petti

2016-11-22

The latest release of the baseballr includes a function for acquiring player statistics from the NCAA’s website for baseball teams across the three major divisions (I, II, III).

The function, ncaa_scrape, requires the user to pass values for three parameters for the function to work:

school_id: numerical code used by the NCAA for each school year: a four-digit year type: whether to pull data for batters or pitchers

If you want to pull batting statistics for Vanderbilt for the 2013 season, you would use the following:

library(baseballr)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ncaa_scrape(736, 2021, "batting") %>%
  select(year:OBPct)
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ──────── baseballr 1.3.0 ──
#> ℹ Data updated: 2022-09-09 03:29:54 EDT
#> # A tibble: 41 × 12
#>     year school     confer…¹ divis…² Jersey Player Yr    Pos      GP    GS    BA
#>    <int> <chr>      <chr>      <dbl> <chr>  <chr>  <chr> <chr> <dbl> <dbl> <dbl>
#>  1  2021 Vanderbilt SEC            1 51     Bradf… Fr    OF       67    67 0.336
#>  2  2021 Vanderbilt SEC            1 25     Nolan… So    INF      66    66 0.26 
#>  3  2021 Vanderbilt SEC            1 99     Gonza… So    INF      61    58 0.28 
#>  4  2021 Vanderbilt SEC            1 9      Young… So    INF      61    61 0.252
#>  5  2021 Vanderbilt SEC            1 12     Keega… Jr    UT       60    60 0.345
#>  6  2021 Vanderbilt SEC            1 8      Thoma… Jr    OF       59    57 0.305
#>  7  2021 Vanderbilt SEC            1 5      Rodri… So    C        58    52 0.249
#>  8  2021 Vanderbilt SEC            1 16     Bulge… Fr    UT       50    41 0.274
#>  9  2021 Vanderbilt SEC            1 6      Kolwy… Jr    INF      43    39 0.29 
#> 10  2021 Vanderbilt SEC            1 19     LaNev… So    OF       37    19 0.286
#> # … with 31 more rows, 1 more variable: OBPct <dbl>, and abbreviated variable
#> #   names ¹​conference, ²​division

The same can be done for pitching, just by changing the type parameter:

ncaa_scrape(736, 2021, "pitching") %>%
  select(year:ERA)
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ──────── baseballr 1.3.0 ──
#> ℹ Data updated: 2022-09-09 03:29:55 EDT
#> # A tibble: 41 × 12
#>     year school     confer…¹ divis…² Jersey Player Yr    Pos      GP   App    GS
#>    <int> <chr>      <chr>      <dbl> <chr>  <chr>  <chr> <chr> <dbl> <dbl> <dbl>
#>  1  2021 Vanderbilt SEC            1 51     Bradf… Fr    OF       67    67    NA
#>  2  2021 Vanderbilt SEC            1 25     Nolan… So    INF      66    66    NA
#>  3  2021 Vanderbilt SEC            1 99     Gonza… So    INF      61    61    NA
#>  4  2021 Vanderbilt SEC            1 9      Young… So    INF      61    61    NA
#>  5  2021 Vanderbilt SEC            1 12     Keega… Jr    UT       60    60    NA
#>  6  2021 Vanderbilt SEC            1 8      Thoma… Jr    OF       59    59    NA
#>  7  2021 Vanderbilt SEC            1 5      Rodri… So    C        58    58    NA
#>  8  2021 Vanderbilt SEC            1 16     Bulge… Fr    UT       50    50    NA
#>  9  2021 Vanderbilt SEC            1 6      Kolwy… Jr    INF      43    43    NA
#> 10  2021 Vanderbilt SEC            1 19     LaNev… So    OF       37    37    NA
#> # … with 31 more rows, 1 more variable: ERA <dbl>, and abbreviated variable
#> #   names ¹​conference, ²​division

Now, the function is dependent on the user knowing the school_id used by the NCAA website. Given that, I’ve included a ncaa_school_id_lu function so that users can find the school_id they need.

Just pass a string to the function and it will return possible matches based on the school’s name:

ncaa_school_id_lu("Vand")
#> # A tibble: 10 × 6
#>    school     conference school_id  year division conference_id
#>    <chr>      <chr>          <dbl> <dbl>    <dbl>         <dbl>
#>  1 Vanderbilt SEC              736  2013        1           911
#>  2 Vanderbilt SEC              736  2014        1           911
#>  3 Vanderbilt SEC              736  2015        1           911
#>  4 Vanderbilt SEC              736  2016        1           911
#>  5 Vanderbilt SEC              736  2017        1           911
#>  6 Vanderbilt SEC              736  2018        1           911
#>  7 Vanderbilt SEC              736  2019        1           911
#>  8 Vanderbilt SEC              736  2020        1           911
#>  9 Vanderbilt SEC              736  2021        1           911
#> 10 Vanderbilt SEC              736  2022        1           911