excelstrippr excelstrippr sticker

Travis build status CRAN status

The goal of excelstrippr is to provide an easy way to remove the extraneous metadata, headers, summaries, etc., and extract the useful tabular data from Excel-based reports.

The Problem

Excel reports often contain titles, data summaries, line counts, empty columns, etc., which make importing them into R a troublesome process. As there is no universal Excel report format, it’s difficult to write a reusable script to munge and import the data from these reports.

An example Excel report might look like this:
Example Excel Report

Unhiding hidden elements and adjusting empty columns to make them more clear shows that this report isn’t in any kind of standard tabular format:
Example Excel Report - Unhidden

When this data is imported into R, the results are not useful without serious wrangling:

dat <- readxl::read_excel("./man/example/example-report.xlsx")
#> New names:
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * `` -> ...6
#> * ...
head(dat, 10)
#> # A tibble: 10 x 28
#>    `Transaction De~ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11
#>    <lgl>            <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr>
#>  1 NA               <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA>  <NA>  <NA> 
#>  2 NA               <NA>  4120~ <NA>  <NA>  <NA>  <NA>  NA    <NA>  <NA>  <NA> 
#>  3 NA               <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA>  <NA>  <NA> 
#>  4 NA               Regi~ <NA>  LOB   CC    Loc ~ Fac ~ NA    JON/~ Doc # Type 
#>  5 NA               Alas~ <NA>  ATO   40022 APX   EQUIP NA    <NA>  2019~ Add ~
#>  6 NA               Alas~ <NA>  ATO   40046 APX   EQUIP NA    <NA>  2019~ Add ~
#>  7 NA               Alas~ <NA>  ATO   40085 APX   EQUIP NA    <NA>  2019~ Add ~
#>  8 NA               Alas~ <NA>  ATO   40050 APX   ARTCC NA    <NA>  2019~ Add ~
#>  9 NA               Alas~ <NA>  ATO   40009 APX   ARTCC NA    <NA>  2019~ Add ~
#> 10 NA               Alas~ <NA>  ATO   40046 APX   ARTCC NA    <NA>  2019~ Add ~
#> # ... with 17 more variables: ...12 <chr>, ...13 <chr>, ...14 <chr>,
#> #   ...15 <chr>, ...16 <chr>, ...17 <lgl>, ...18 <chr>, ...19 <chr>,
#> #   ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>,
#> #   ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>

The Solution

excelstrippr will look for the start of a tabular data set in an Excel file, ignoring titles, summaries, and other meta-data in the report.

library(excelstrippr)
dat_stripped <- strip_metadata("./man/example/example-report.xlsx")
head(dat_stripped, 10)
#> # A tibble: 10 x 24
#>    Region LOB   CC    `Loc ID` `Fac Type` `JON/DPN` `Doc #` Type  Status
#>    <chr>  <chr> <chr> <chr>    <chr>      <chr>     <chr>   <chr> <chr> 
#>  1 Alask~ ATO   40022 APX      EQUIP      <NA>      2019/5~ Add ~ Appro~
#>  2 Alask~ ATO   40046 APX      EQUIP      <NA>      2019/5~ Add ~ Appro~
#>  3 Alask~ ATO   40085 APX      EQUIP      <NA>      2019/5~ Add ~ Appro~
#>  4 Alask~ ATO   40050 APX      ARTCC      <NA>      2019/5~ Add ~ Appro~
#>  5 Alask~ ATO   40009 APX      ARTCC      <NA>      2019/5~ Add ~ Appro~
#>  6 Alask~ ATO   40046 APX      ARTCC      <NA>      2019/5~ Add ~ Appro~
#>  7 Alask~ ATO   40072 APX      ARTCC      <NA>      2019/5~ Add ~ Appro~
#>  8 Alask~ ATO   40065 APX      ARTCC      <NA>      2019/5~ Add ~ Appro~
#>  9 Alask~ ATO   40047 IDK      ARTCC      <NA>      2019/5~ Add ~ Appro~
#> 10 Alask~ ATO   40095 IDK      ARTCC      <NA>      2019/5~ Add ~ Appro~
#> # ... with 15 more variables: `Asset Type` <chr>, `Initiated By` <chr>,
#> #   `Initiate Date` <chr>, Custodian <chr>, Barcode <chr>, `NSN/LSN` <chr>,
#> #   Description <chr>, `Serial #` <chr>, `Delphi Asset #` <chr>, Cost <chr>,
#> #   Qty <chr>, `Rejected By` <chr>, `Rej Reason` <chr>, `Rej Date` <chr>, `Test
#> #   Equip` <chr>

The extracted data can then be saved back into another file for future analysis or can be immediately used in analysis pipeline as any other data frame object.

Installation

You can install the released version of excelstrippr from CRAN with:

install.packages("excelstrippr")

The most recent version of excelstrippr can be installed from GitHub with {devtools}:

install.packages("devtools")
devtools::install_github("burch-cm/excelstrippr")