convert
your data typesBest practise of data analysis is to fix data types directly after importing data into R. This helps in many ways:
Additionally, if every column is converted to its appropriate data type then you won’t be surprised by data type errors the next time you run the script.
convert(.x, ...)
where .x
is a data frame. ...
is a placeholder for data type specific conversion functions.
convert
must be used in conjunction with data type conversion functions:
chr
converts to character.num
converts to numeric.int
converts to integer.lgl
converts to logical.fct
converts to factor.dte
converts to date.dtm
converts to date time.Imagine you have a data frame where you want to change columns:
a
and b
to numericalc
to dated
and e
to characterThen you can write:
df %>% convert(num(a, b), dte(c), chr(d, e))
The easiest way to understand how simple convert
is to use is with examples. Have a look at the a gapminder dataset from the package gapminder
:
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> # … with 1,700 more rows
We might want to change the country column to character instead of factor. To do this we use chr
together with the column name inside convert
:
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> # … with 1,700 more rows
This converted the country column to the data type character. But we do not have to make this whole procedure for each column if we want to convert more columns. Let’s say that we also want to convert continent to character and the column lifeExp to integer, pop to double and gdpPercap to numeric. It is simply done:
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <chr> <int> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28 8425333 779.
#> 2 Afghanistan Asia 1957 30 9240934 821.
#> 3 Afghanistan Asia 1962 31 10267083 853.
#> 4 Afghanistan Asia 1967 34 11537966 836.
#> # … with 1,700 more rows
convert
?You can change alot of data types with little code. Consider using mutate
from dplyr
to do the same operation:
gapminder %>%
mutate(country = as.character(country),
continent = as.character(continent),
lifeExp = as.integer(lifeExp),
pop = as.double(pop),
gdpPercap = as.numeric(gdpPercap))
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <chr> <int> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28 8425333 779.
#> 2 Afghanistan Asia 1957 30 9240934 821.
#> 3 Afghanistan Asia 1962 31 10267083 853.
#> 4 Afghanistan Asia 1967 34 11537966 836.
#> # … with 1,700 more rows
Which gives the same result. However, you need to refer to the column name twice and the data type conversion function for each column. Imagine the code to convert 20 columns.
However, dplyr
have another way of applying the same function to multiple columns which could help, mutate_at
. The same example would then look like:
gapminder %>%
mutate_at(vars(country, continent), funs(as.character)) %>%
mutate_at(vars(lifeExp), funs(as.integer)) %>%
mutate_at(vars(pop), funs(as.double)) %>%
mutate_at(vars(gdpPercap), funs(as.numeric))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once per session.
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <chr> <int> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28 8425333 779.
#> 2 Afghanistan Asia 1957 30 9240934 821.
#> 3 Afghanistan Asia 1962 31 10267083 853.
#> 4 Afghanistan Asia 1967 34 11537966 836.
#> # … with 1,700 more rows
Which is more easily scaled to deal with data type conversion of large numbers of variables. However, convert
does the same job with much less code. In fact, convert
uses mutate_at
internally. The difference is syntax and code readability. Compare again with convert
:
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <chr> <int> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28 8425333 779.
#> 2 Afghanistan Asia 1957 30 9240934 821.
#> 3 Afghanistan Asia 1962 31 10267083 853.
#> 4 Afghanistan Asia 1967 34 11537966 836.
#> # … with 1,700 more rows
convert
also supports functions of convert
support additional arguments to be passed. For example, if you want to convert a number to a date and want to include an origin
argument you can write:
tibble(dates = c(12818, 13891),
sunny = c("yes", "no")) %>%
convert(dte(dates, .args = list(origin = "1900-01-01")))
#> # A tibble: 2 x 2
#> dates sunny
#> <date> <chr>
#> 1 1935-02-05 yes
#> 2 1938-01-13 no
convert
is built upon dplyr
and it will share some amazing features of dplyr
. For example, tidyselect
works with convert
which helps you to select multiple columns at the same time. A simple example, if you want to change all columns with names that includes the letter “e” to factors, you can write:
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <fct> <fct> <int> <fct>
#> 1 Afghanistan Asia 1952 28.801 8425333 779.4453145
#> 2 Afghanistan Asia 1957 30.332 9240934 820.8530296
#> 3 Afghanistan Asia 1962 31.997 10267083 853.10071
#> 4 Afghanistan Asia 1967 34.02 11537966 836.1971382
#> # … with 1,700 more rows