After you have acquired the data, you should do the following:
The dlookr package makes these steps fast and easy:
This document introduces data transformation methods provided by the dlookr package. You will learn how to transform of tbl_df
data that inherits from data.frame and data.frame
with functions provided by dlookr.
dlookr increases synergy with dplyr
. Particularly in data transformation and data wrangle, it increases the efficiency of the tidyverse
package group.
To illustrate the basic use of data transformation in the dlookr package, I use a Carseats
dataset. Carseats
in the ISLR
package is simulation dataset that sells children’s car seats at 400 stores. This data is a data.frame created for the purpose of predicting sales volume.
library(ISLR)
str(Carseats)
'data.frame': 400 obs. of 11 variables:
$ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
$ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
$ Income : num 73 48 35 100 64 113 105 81 110 113 ...
$ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
$ Population : num 276 260 269 466 340 501 45 425 108 131 ...
$ Price : num 120 83 80 97 128 72 108 120 124 124 ...
$ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
$ Age : num 42 65 59 55 38 78 71 67 76 76 ...
$ Education : num 17 10 12 14 13 16 15 10 10 17 ...
$ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
$ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
The contents of individual variables are as follows. (Refer to ISLR::Carseats Man page)
When data analysis is performed, data containing missing values is often encountered. However, Carseats
is complete data without missing. Therefore, the missing values are generated as follows. And I created a data.frame object named carseats.
<- ISLR::Carseats
carseats
suppressWarnings(RNGversion("3.5.0"))
set.seed(123)
sample(seq(NROW(carseats)), 20), "Income"] <- NA
carseats[
suppressWarnings(RNGversion("3.5.0"))
set.seed(456)
sample(seq(NROW(carseats)), 10), "Urban"] <- NA carseats[
dlookr imputes missing values and outliers and resolves skewed data. It also provides the ability to bin continuous variables as categorical variables.
Here is a list of the data conversion functions and functions provided by dlookr:
find_na()
finds a variable that contains the missing values variable, and imputate_na()
imputes the missing values.find_outliers()
finds a variable that contains the outliers, and imputate_outlier()
imputes the outlier.summary.imputation()
and plot.imputation()
provide information and visualization of the imputed variables.find_skewness()
finds the variables of the skewed data, and transform()
performs the resolving of the skewed data.transform()
also performs standardization of numeric variables.summary.transform()
and plot.transform()
provide information and visualization of transformed variables.binning()
and binning_by()
convert binational data into categorical data.print.bins()
and summary.bins()
show and summarize the binning results.plot.bins()
and plot.optimal_bins()
provide visualization of the binning result.transformation_report()
performs the data transform and reports the result.imputate_na()
imputate_na()
imputes the missing value contained in the variable. The predictor with missing values support both numeric and categorical variables, and supports the following method
.
In the following example, imputate_na()
imputes the missing value of Income
, a numeric variable of carseats, using the “rpart” method. summary()
summarizes missing value imputation information, and plot()
visualizes missing information.
if (requireNamespace("rpart", quietly = TRUE)) {
<- imputate_na(carseats, Income, US, method = "rpart")
income
# result of imputation
income
# summary of imputation
summary(income)
# viz of imputation
plot(income)
else {
} cat("If you want to use this feature, you need to install the rpart package.\n")
}* Impute missing values based on Recursive Partitioning and Regression Trees
- method : rpart
* Information of Imputation (before vs after)
Original Imputation "value" "value"
described_variables "380" "400"
n "20" " 0"
na "68.86053" "69.05073"
mean "28.09161" "27.57382"
sd "1.441069" "1.378691"
se_mean "48.25" "46.00"
IQR "0.04490600" "0.02935732"
skewness "-1.089201" "-1.035086"
kurtosis "21" "21"
p00 "21.79" "21.99"
p01 "26" "26"
p05 "30.0" "30.9"
p10 "39" "40"
p20 "42.75" "44.00"
p25 "48.00000" "51.58333"
p30 "62" "63"
p40 "69" "69"
p50 "78.0" "77.4"
p60 "86.3" "84.3"
p70 "91" "90"
p75 "96.2" "96.0"
p80 "108.1" "106.1"
p90 "115.05" "115.00"
p95 "119.21" "119.01"
p99 "120" "120" p100
The following imputes the categorical variable urban
by the “mice” method.
library(mice)
: 'mice'
다음의 패키지를 부착합니다'package:stats':
The following object is masked from
filter'package:base':
The following objects are masked from
cbind, rbind
<- imputate_na(carseats, Urban, US, method = "mice")
urban
iter imp variable1 1 Income Urban
1 2 Income Urban
1 3 Income Urban
1 4 Income Urban
1 5 Income Urban
2 1 Income Urban
2 2 Income Urban
2 3 Income Urban
2 4 Income Urban
2 5 Income Urban
3 1 Income Urban
3 2 Income Urban
3 3 Income Urban
3 4 Income Urban
3 5 Income Urban
4 1 Income Urban
4 2 Income Urban
4 3 Income Urban
4 4 Income Urban
4 5 Income Urban
5 1 Income Urban
5 2 Income Urban
5 3 Income Urban
5 4 Income Urban
5 5 Income Urban
# result of imputation
urban1] Yes Yes Yes Yes Yes No Yes Yes No No No Yes Yes Yes Yes No Yes Yes
[19] No Yes Yes No Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes No
[37] No Yes Yes No No Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes
[55] No Yes Yes Yes Yes Yes Yes No Yes Yes No No Yes Yes Yes Yes Yes No
[73] Yes No No No Yes No Yes Yes Yes Yes Yes No No No Yes No Yes No
[91] No Yes Yes No Yes Yes No Yes No No No Yes No Yes Yes Yes No Yes
[109] Yes No Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes No
[127] Yes Yes Yes No Yes Yes Yes Yes Yes No No Yes Yes No Yes Yes Yes Yes
[145] No Yes Yes No No Yes No No No No No Yes Yes No No No No No
[163] Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes No Yes No Yes
[181] Yes Yes Yes Yes No Yes No Yes Yes No No Yes No Yes Yes Yes Yes Yes
[199] Yes Yes No Yes No Yes Yes Yes Yes No Yes No No Yes Yes Yes Yes Yes
[217] Yes No Yes Yes Yes Yes Yes Yes No Yes Yes Yes No No No No Yes No
[235] No Yes Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes Yes
[253] Yes No Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes No No Yes Yes
[271] Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes No Yes No No Yes No Yes
[289] No Yes No No Yes Yes Yes No Yes Yes Yes No Yes Yes Yes Yes Yes Yes
[307] Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes No No No Yes Yes Yes Yes
[325] Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes No Yes Yes No
[343] No Yes No Yes No No Yes No No No Yes No Yes Yes Yes Yes Yes Yes
[361] No No Yes Yes Yes No No Yes No Yes Yes Yes No Yes Yes Yes Yes No
[379] Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes No Yes Yes
[397] No Yes Yes Yes
[attr(,"var_type")
1] categorical
[attr(,"method")
1] mice
[attr(,"na_pos")
1] 33 36 84 94 113 132 151 292 313 339
[attr(,"seed")
1] 24283
[attr(,"type")
1] missing values
[attr(,"message")
1] complete imputation
[attr(,"success")
1] TRUE
[: No Yes
Levels
# summary of imputation
summary(urban)
* Impute missing values based on Multivariate Imputation by Chained Equations
- method : mice
- random seed : 24283
* Information of Imputation (before vs after)
original imputation original_percent imputation_percent115 122 28.75 30.5
No 275 278 68.75 69.5
Yes <NA> 10 0 2.50 0.0
# viz of imputation
plot(urban)
The following example imputes the missing value of the Income
variable, and then calculates the arithmetic mean for each level of US
. In this case, dplyr
is used, and it is easily interpreted logically using pipes.
# The mean before and after the imputation of the Income variable
%>%
carseats mutate(Income_imp = imputate_na(carseats, Income, US, method = "knn")) %>%
group_by(US) %>%
summarise(orig = mean(Income, na.rm = TRUE),
imputation = mean(Income_imp))
# A tibble: 2 × 3
US orig imputation<fct> <dbl> <dbl>
1 No 65.8 66.1
2 Yes 70.4 70.5
imputate_outlier()
imputate_outlier()
imputes the outliers value. The predictor with outliers supports only numeric variables and supports the following methods.
imputate_outlier()
imputes the outliers with the numeric variable Price
as the “capping” method, as follows. summary()
summarizes outliers imputation information, and plot()
visualizes imputation information.
<- imputate_outlier(carseats, Price, method = "capping")
price
# result of imputation
price1] 120.00 83.00 80.00 97.00 128.00 72.00 108.00 120.00 124.00 124.00
[11] 100.00 94.00 136.00 86.00 118.00 144.00 110.00 131.00 68.00 121.00
[21] 131.00 109.00 138.00 109.00 113.00 82.00 131.00 107.00 97.00 102.00
[31] 89.00 131.00 137.00 128.00 128.00 96.00 100.00 110.00 102.00 138.00
[41] 126.00 124.00 77.00 134.00 95.00 135.00 70.00 108.00 98.00 149.00
[51] 108.00 108.00 129.00 119.00 144.00 154.00 84.00 117.00 103.00 114.00
[61] 123.00 107.00 133.00 101.00 104.00 128.00 91.00 115.00 134.00 99.00
[71] 99.00 150.00 116.00 104.00 136.00 92.00 70.00 89.00 145.00 90.00
[81] 79.00 128.00 139.00 94.00 121.00 112.00 134.00 126.00 111.00 119.00
[91] 103.00 107.00 125.00 104.00 84.00 148.00 132.00 129.00 127.00 107.00
[101] 106.00 118.00 97.00 96.00 138.00 97.00 139.00 108.00 103.00 90.00
[111] 116.00 151.00 125.00 127.00 106.00 129.00 128.00 119.00 99.00 128.00
[121] 131.00 87.00 108.00 155.00 120.00 77.00 133.00 116.00 126.00 147.00
[131] 77.00 94.00 136.00 97.00 131.00 120.00 120.00 118.00 109.00 94.00
[141] 129.00 131.00 104.00 159.00 123.00 117.00 131.00 119.00 97.00 87.00
[151] 114.00 103.00 128.00 150.00 110.00 69.00 157.00 90.00 112.00 70.00
[161] 111.00 160.00 149.00 106.00 141.00 155.05 137.00 93.00 117.00 77.00
[171] 118.00 55.00 110.00 128.00 155.05 122.00 154.00 94.00 81.00 116.00
[181] 149.00 91.00 140.00 102.00 97.00 107.00 86.00 96.00 90.00 104.00
[191] 101.00 173.00 93.00 96.00 128.00 112.00 133.00 138.00 128.00 126.00
[201] 146.00 134.00 130.00 157.00 124.00 132.00 160.00 97.00 64.00 90.00
[211] 123.00 120.00 105.00 139.00 107.00 144.00 144.00 111.00 120.00 116.00
[221] 124.00 107.00 145.00 125.00 141.00 82.00 122.00 101.00 163.00 72.00
[231] 114.00 122.00 105.00 120.00 129.00 132.00 108.00 135.00 133.00 118.00
[241] 121.00 94.00 135.00 110.00 100.00 88.00 90.00 151.00 101.00 117.00
[251] 156.00 132.00 117.00 122.00 129.00 81.00 144.00 112.00 81.00 100.00
[261] 101.00 118.00 132.00 115.00 159.00 129.00 112.00 112.00 105.00 166.00
[271] 89.00 110.00 63.00 86.00 119.00 132.00 130.00 125.00 151.00 158.00
[281] 145.00 105.00 154.00 117.00 96.00 131.00 113.00 72.00 97.00 156.00
[291] 103.00 89.00 74.00 89.00 99.00 137.00 123.00 104.00 130.00 96.00
[301] 99.00 87.00 110.00 99.00 134.00 132.00 133.00 120.00 126.00 80.00
[311] 166.00 132.00 135.00 54.00 129.00 171.00 72.00 136.00 130.00 129.00
[321] 152.00 98.00 139.00 103.00 150.00 104.00 122.00 104.00 111.00 89.00
[331] 112.00 134.00 104.00 147.00 83.00 110.00 143.00 102.00 101.00 126.00
[341] 91.00 93.00 118.00 121.00 126.00 149.00 125.00 112.00 107.00 96.00
[351] 91.00 105.00 122.00 92.00 145.00 146.00 164.00 72.00 118.00 130.00
[361] 114.00 104.00 110.00 108.00 131.00 162.00 134.00 77.00 79.00 122.00
[371] 119.00 126.00 98.00 116.00 118.00 124.00 92.00 125.00 119.00 107.00
[381] 89.00 151.00 121.00 68.00 112.00 132.00 160.00 115.00 78.00 107.00
[391] 111.00 124.00 130.00 120.00 139.00 128.00 120.00 159.00 95.00 120.00
[attr(,"method")
1] "capping"
[attr(,"var_type")
1] "numerical"
[attr(,"outlier_pos")
1] 43 126 166 175 368
[attr(,"outliers")
1] 24 49 191 185 53
[attr(,"type")
1] "outliers"
[attr(,"message")
1] "complete imputation"
[attr(,"success")
1] TRUE
[attr(,"class")
1] "imputation" "numeric"
[
# summary of imputation
summary(price)
Impute outliers with capping
* Information of Imputation (before vs after)
Original Imputation "value" "value"
described_variables "400" "400"
n "0" "0"
na "115.7950" "115.8928"
mean "23.67666" "22.61092"
sd "1.183833" "1.130546"
se_mean "31" "31"
IQR "-0.1252862" "-0.0461621"
skewness " 0.4518850" "-0.3030578"
kurtosis "24" "54"
p00 "54.99" "67.96"
p01 "77" "77"
p05 "87" "87"
p10 "96.8" "96.8"
p20 "100" "100"
p25 "104" "104"
p30 "110" "110"
p40 "117" "117"
p50 "122" "122"
p60 "128.3" "128.3"
p70 "131" "131"
p75 "134" "134"
p80 "146" "146"
p90 "155.0500" "155.0025"
p95 "166.05" "164.02"
p99 "191" "173"
p100
# viz of imputation
plot(price)
The following example imputes the outliers of the Price
variable, and then calculates the arithmetic mean for each level of US
. In this case, dplyr
is used, and it is easily interpreted logically using pipes.
# The mean before and after the imputation of the Price variable
%>%
carseats mutate(Price_imp = imputate_outlier(carseats, Price, method = "capping")) %>%
group_by(US) %>%
summarise(orig = mean(Price, na.rm = TRUE),
imputation = mean(Price_imp, na.rm = TRUE))
# A tibble: 2 × 3
US orig imputation<fct> <dbl> <dbl>
1 No 114. 114.
2 Yes 117. 117.
transform()
transform()
performs data transformation. Only numeric variables are supported, and the following methods are provided.
transform()
Use the methods “zscore” and “minmax” to perform standardization.
%>%
carseats mutate(Income_minmax = transform(carseats$Income, method = "minmax"),
Sales_minmax = transform(carseats$Sales, method = "minmax")) %>%
select(Income_minmax, Sales_minmax) %>%
boxplot()
transform()
find_skewness()
searches for variables with skewed data. This function finds data skewed by search conditions and calculates skewness.
# find index of skewed variables
find_skewness(carseats)
1] 4
[
# find names of skewed variables
find_skewness(carseats, index = FALSE)
1] "Advertising"
[
# compute the skewness
find_skewness(carseats, value = TRUE)
Sales CompPrice Income Advertising Population Price 0.185 -0.043 0.045 0.637 -0.051 -0.125
Age Education -0.077 0.044
# compute the skewness & filtering with threshold
find_skewness(carseats, value = TRUE, thres = 0.1)
Sales Advertising Price 0.185 0.637 -0.125
The skewness of Advertising
is 0.637. This means that the distribution of data is somewhat inclined to the left. So, for normal distribution, use transform()
to convert to “log” method as follows. summary()
summarizes transformation information, and plot()
visualizes transformation information.
<- transform(carseats$Advertising, method = "log")
Advertising_log
# result of transformation
head(Advertising_log)
1] 2.397895 2.772589 2.302585 1.386294 1.098612 2.564949
[# summary of transformation
summary(Advertising_log)
* Resolving Skewness with log
* Information of Transformation (before vs after)
Original Transformation400.0000000 400.0000000
n 0.0000000 0.0000000
na 6.6350000 -Inf
mean 6.6503642 NaN
sd 0.3325182 NaN
se_mean 12.0000000 Inf
IQR 0.6395858 NaN
skewness -0.5451178 NaN
kurtosis 0.0000000 -Inf
p00 0.0000000 -Inf
p01 0.0000000 -Inf
p05 0.0000000 -Inf
p10 0.0000000 -Inf
p20 0.0000000 -Inf
p25 0.0000000 -Inf
p30 2.0000000 0.6931472
p40 5.0000000 1.6094379
p50 8.4000000 2.1265548
p60 11.0000000 2.3978953
p70 12.0000000 2.4849066
p75 13.0000000 2.5649494
p80 16.0000000 2.7725887
p90 19.0000000 2.9444390
p95 23.0100000 3.1359198
p99 29.0000000 3.3672958
p100 # viz of transformation
plot(Advertising_log)
It seems that the raw data contains 0, as there is a -Inf in the log converted value. So this time, convert it to “log+1”.
<- transform(carseats$Advertising, method = "log+1")
Advertising_log
# result of transformation
head(Advertising_log)
1] 2.484907 2.833213 2.397895 1.609438 1.386294 2.639057
[# summary of transformation
summary(Advertising_log)
* Resolving Skewness with log+1
* Information of Transformation (before vs after)
Original Transformation400.0000000 400.00000000
n 0.0000000 0.00000000
na 6.6350000 1.46247709
mean 6.6503642 1.19436323
sd 0.3325182 0.05971816
se_mean 12.0000000 2.56494936
IQR 0.6395858 -0.19852549
skewness -0.5451178 -1.66342876
kurtosis 0.0000000 0.00000000
p00 0.0000000 0.00000000
p01 0.0000000 0.00000000
p05 0.0000000 0.00000000
p10 0.0000000 0.00000000
p20 0.0000000 0.00000000
p25 0.0000000 0.00000000
p30 2.0000000 1.09861229
p40 5.0000000 1.79175947
p50 8.4000000 2.23936878
p60 11.0000000 2.48490665
p70 12.0000000 2.56494936
p75 13.0000000 2.63905733
p80 16.0000000 2.83321334
p90 19.0000000 2.99573227
p95 23.0100000 3.17846205
p99 29.0000000 3.40119738
p100 # viz of transformation
# plot(Advertising_log)
binning()
binning()
transforms a numeric variable into a categorical variable by binning it. The following types of binning are supported.
Here are some examples of how to bin Income
using binning()
.:
# Binning the carat variable. default type argument is "quantile"
<- binning(carseats$Income)
bin # Print bins class object
bin: quantile
binned type: 10
number of bins
x21,30] (30,39] (39,48] (48,62] (62,69]
[40 37 38 40 42
69,78] (78,86.56667] (86.56667,96.6] (96.6,108.6333] (108.6333,120]
(33 36 38 38 38
<NA>
20
# Summarize bins class object
summary(bin)
levels freq rate1 [21,30] 40 0.1000
2 (30,39] 37 0.0925
3 (39,48] 38 0.0950
4 (48,62] 40 0.1000
5 (62,69] 42 0.1050
6 (69,78] 33 0.0825
7 (78,86.56667] 36 0.0900
8 (86.56667,96.6] 38 0.0950
9 (96.6,108.6333] 38 0.0950
10 (108.6333,120] 38 0.0950
11 <NA> 20 0.0500
# Plot bins class object
plot(bin)
# Using labels argument
<- binning(carseats$Income, nbins = 4,
bin labels = c("LQ1", "UQ1", "LQ3", "UQ3"))
bin: quantile
binned type: 4
number of bins
x<NA>
LQ1 UQ1 LQ3 UQ3 95 102 89 94 20
# Using another type argument
binning(carseats$Income, nbins = 5, type = "equal")
: equal
binned type: 5
number of bins
x21,40.8] (40.8,60.6] (60.6,80.4] (80.4,100.2] (100.2,120] <NA>
[81 65 94 80 60 20
binning(carseats$Income, nbins = 5, type = "pretty")
: pretty
binned type: 5
number of bins
x20,40] (40,60] (60,80] (80,100] (100,120] <NA>
[81 65 94 80 60 20
if (requireNamespace("classInt", quietly = TRUE)) {
binning(carseats$Income, nbins = 5, type = "kmeans")
binning(carseats$Income, nbins = 5, type = "bclust")
else {
} cat("If you want to use this feature, you need to install the classInt package.\n")
}: bclust
binned type: 5
number of bins
x21,49] (49,66.5] (66.5,78.5] (78.5,95.5] (95.5,120] <NA>
[115 59 56 71 79 20
# Extract the binned results
extract(bin)
1] LQ3 UQ1 LQ1 UQ3 UQ1 UQ3 UQ3 LQ3 UQ3 UQ3 LQ3 UQ3 LQ1 LQ1 UQ3
[16] UQ3 <NA> <NA> UQ3 LQ3 LQ3 LQ1 UQ1 LQ1 UQ3 LQ1 UQ3 UQ3 LQ3 UQ3
[31] UQ3 UQ1 LQ1 LQ1 UQ1 LQ3 LQ3 LQ1 LQ3 <NA> UQ3 UQ1 UQ1 LQ1 LQ3
[46] UQ1 LQ3 UQ3 UQ1 UQ3 LQ1 LQ3 LQ1 UQ1 UQ3 LQ3 LQ3 LQ3 UQ3 LQ3
[61] UQ3 LQ1 UQ1 LQ3 UQ1 LQ1 UQ3 UQ1 UQ1 UQ1 LQ3 UQ1 UQ1 LQ3 UQ1
[76] UQ3 LQ3 LQ3 UQ1 UQ1 UQ3 LQ3 LQ3 LQ1 LQ1 UQ3 LQ3 UQ1 LQ1 UQ1
[91] LQ1 UQ1 UQ3 LQ1 <NA> LQ1 LQ1 LQ3 LQ3 UQ1 UQ1 UQ3 LQ1 LQ3 UQ3
[106] UQ3 LQ1 UQ3 LQ3 UQ1 UQ1 UQ3 UQ3 LQ1 LQ3 <NA> LQ3 UQ1 LQ3 UQ3
[121] UQ3 LQ3 UQ3 UQ3 UQ3 <NA> UQ1 UQ1 UQ3 UQ3 LQ3 UQ1 LQ3 UQ3 LQ1
[136] UQ3 LQ3 LQ1 UQ3 UQ1 UQ1 LQ1 LQ3 LQ3 UQ1 UQ1 LQ3 UQ1 UQ3 UQ3
[151] LQ3 UQ1 LQ3 LQ1 UQ1 LQ3 LQ1 UQ1 LQ3 UQ1 LQ1 LQ1 <NA> UQ1 UQ1
[166] UQ1 UQ1 LQ3 LQ3 LQ1 LQ1 UQ3 UQ3 LQ3 LQ1 LQ3 <NA> LQ3 <NA> LQ1
[181] UQ3 LQ3 UQ1 LQ3 LQ1 UQ3 UQ1 LQ1 LQ1 UQ3 LQ1 LQ1 LQ1 LQ3 UQ3
[196] UQ3 LQ1 UQ1 LQ3 LQ3 UQ3 LQ3 LQ3 LQ3 LQ3 LQ1 UQ1 UQ3 <NA> LQ1
[211] LQ1 UQ3 UQ1 LQ3 UQ3 LQ3 <NA> UQ1 UQ1 LQ3 UQ3 <NA> UQ3 UQ1 LQ3
[226] LQ1 LQ1 UQ1 LQ3 UQ3 UQ1 UQ1 LQ3 LQ3 UQ1 LQ1 LQ1 LQ1 LQ1 UQ3
[241] LQ3 UQ1 UQ1 LQ1 LQ1 UQ1 UQ1 UQ3 UQ1 UQ1 UQ3 UQ3 UQ3 LQ1 UQ3
[256] LQ3 LQ1 UQ1 LQ1 LQ1 UQ3 LQ1 <NA> LQ1 LQ1 LQ1 UQ3 LQ3 UQ1 UQ1
[271] LQ1 UQ1 LQ1 UQ3 UQ3 UQ3 UQ1 UQ1 UQ3 UQ1 LQ3 UQ1 UQ3 UQ3 UQ1
[286] LQ1 UQ3 UQ1 LQ1 LQ3 UQ3 LQ3 UQ1 LQ3 LQ3 LQ1 UQ1 LQ3 UQ1 LQ1
[301] LQ3 UQ3 LQ3 UQ1 UQ3 LQ1 LQ1 UQ3 LQ3 UQ3 UQ1 UQ1 UQ3 LQ3 <NA>
[316] LQ1 LQ1 LQ1 LQ3 UQ1 LQ3 LQ1 UQ1 UQ3 UQ1 UQ1 LQ1 LQ1 UQ1 UQ1
[331] UQ1 UQ1 LQ1 UQ1 UQ3 LQ3 LQ1 LQ1 LQ1 UQ1 LQ1 UQ3 UQ3 LQ1 LQ3
[346] UQ1 <NA> LQ1 UQ3 LQ1 <NA> UQ3 UQ3 UQ1 LQ1 UQ3 UQ3 LQ3 UQ3 UQ1
[361] LQ3 LQ1 UQ1 <NA> LQ1 LQ1 UQ1 UQ3 LQ1 UQ3 LQ1 LQ3 <NA> <NA> UQ1
[376] UQ1 UQ1 UQ1 LQ3 UQ3 UQ1 UQ1 LQ1 UQ3 LQ1 LQ3 UQ3 LQ3 LQ3 LQ1
[391] LQ3 UQ1 LQ1 UQ1 UQ1 UQ3 <NA> LQ1 LQ3 LQ1
[: LQ1 < UQ1 < LQ3 < UQ3
Levels
# -------------------------
# Using pipes & dplyr
# -------------------------
library(dplyr)
%>%
carseats mutate(Income_bin = binning(carseats$Income) %>%
extract()) %>%
group_by(ShelveLoc, Income_bin) %>%
summarise(freq = n()) %>%
arrange(desc(freq)) %>%
head(10)
`summarise()` has grouped output by 'ShelveLoc'. You can override using the
`.groups` argument.
# A tibble: 10 × 3
# Groups: ShelveLoc [1]
ShelveLoc Income_bin freq<fct> <ord> <int>
1 Medium [21,30] 25
2 Medium (62,69] 24
3 Medium (48,62] 23
4 Medium (39,48] 21
# … with 6 more rows
binning_by()
binning_by()
transforms a numeric variable into a categorical variable by optimal binning. This method is often used when developing a scorecard model
.
The following binning_by()
example optimally binning Advertising
considering the target variable US
with a binary class.
library(dplyr)
# optimal binning using character
<- binning_by(carseats, "US", "Advertising")
bin in binning_by(carseats, "US", "Advertising"): The factor y has been changed to a numeric vector consisting of 0 and 1.
Warning 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).
# optimal binning using name
<- binning_by(carseats, US, Advertising)
bin in binning_by(carseats, US, Advertising): The factor y has been changed to a numeric vector consisting of 0 and 1.
Warning 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).
bin: optimal
binned type: 3
number of bins
x-1,0] (0,6] (6,29]
[144 69 187
# summary optimal_bins class
summary(bin)
── Binning Table ──────────────────────── Several Metrics ──
Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE IV JSD1 [-1,0] 144 19 125 0.07364 0.88028 0.1520 -2.48101 2.00128 0.20093
2 (0,6] 69 54 15 0.20930 0.10563 3.6000 0.68380 0.07089 0.00869
3 (6,29] 187 185 2 0.71705 0.01408 92.5000 3.93008 2.76272 0.21861
4 Total 400 258 142 1.00000 1.00000 1.8169 NA 4.83489 0.42823
AUC1 0.03241
2 0.01883
3 0.00903
4 0.06028
── General Metrics ───────────────────────────────────────── : -0.87944
• Gini index IV (Jeffrey) : 4.83489
• JS (Jensen-Shannon) Divergence : 0.42823
• -Smirnov Statistics : 0.80664
• KolmogorovHHI (Herfindahl-Hirschman Index) : 0.37791
• HHI (normalized) : 0.06687
• 's V : 0.81863
• Cramer
── Significance Tests ──────────────────── Chisquare Test ──
Bin A Bin B statistics p_value
1 [-1,0] (0,6] 87.67064 7.731349e-21
2 (0,6] (6,29] 34.73349 3.780706e-09
# performance table
attr(bin, "performance")
Bin CntRec CntPos CntNeg CntCumPos CntCumNeg RatePos RateNeg RateCumPos
1 [-1,0] 144 19 125 19 125 0.07364 0.88028 0.07364
2 (0,6] 69 54 15 73 140 0.20930 0.10563 0.28295
3 (6,29] 187 185 2 258 142 0.71705 0.01408 1.00000
4 Total 400 258 142 NA NA 1.00000 1.00000 NA
RateCumNeg Odds LnOdds WoE IV JSD AUC
1 0.88028 0.1520 -1.88387 -2.48101 2.00128 0.20093 0.03241
2 0.98592 3.6000 1.28093 0.68380 0.07089 0.00869 0.01883
3 1.00000 92.5000 4.52721 3.93008 2.76272 0.21861 0.00903
4 NA 1.8169 0.59713 NA 4.83489 0.42823 0.06028
# visualize optimal_bins class
plot(bin)
# extract binned results
extract(bin) %>%
head(20)
1] (6,29] (6,29] (6,29] (0,6] (0,6] (6,29] [-1,0] (6,29] [-1,0] [-1,0]
[11] (6,29] (0,6] (0,6] (6,29] (6,29] (0,6] [-1,0] (6,29] [-1,0] (6,29]
[: [-1,0] < (0,6] < (6,29] Levels
dlookr provides two automated data transformation reports:
transformation_web_report()
transformation_web_report()
create dynamic report for object inherited from data.frame(tbl_df
, tbl
, etc) or data.frame.
The contents of the report are as follows.:
transformation_web_report() generates various reports with the following arguments.
The following script creates a data transformation report for the tbl_df
class object, heartfailure
.
%>%
heartfailure transformation_web_report(target = "death_event", subtitle = "heartfailure",
output_dir = "./", output_file = "transformation.html",
theme = "blue")
transformation_paged_report()
transformation_paged_report()
create static report for object inherited from data.frame(tbl_df
, tbl
, etc) or data.frame.
The contents of the report are as follows.:
transformation_paged_report() generates various reports with the following arguments.
The following script creates a data transformation report for the data.frame
class object, heartfailure
.
%>%
heartfailure transformation_paged_report(target = "death_event", subtitle = "heartfailure",
output_dir = "./", output_file = "transformation.pdf",
theme = "blue")