Automated feature engineering is a cornerstone of the package. Below are some of the techniques we use in multivariate machine learning models, and the outside packages that make it possible.
The tk_augment_timeseries_signature function from the timetk package easily extracts out various date features from the time stamp. The function doesn’t differentiate between date type, so features need to be removed depending on the date type. For example, features related to week and day for a monthly forecast are automatically removed.
Fourier series are also added using the tk_augment_fourier function from timetk.
library(dplyr)
library(timetk)
%>%
m4_monthly ::tk_augment_timeseries_signature(date) %>%
timetk::group_by(id) %>%
dplyr::tk_augment_fourier(date, .periods = c(3, 6, 12), .K = 1) %>%
timetk::ungroup()
dplyr#> # A tibble: 1,574 x 37
#> id date value index.num diff year year.iso half quarter month
#> <fct> <date> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
#> 1 M1 1976-06-01 8000 202435200 NA 1976 1976 1 2 6
#> 2 M1 1976-07-01 8350 205027200 2592000 1976 1976 2 3 7
#> 3 M1 1976-08-01 8570 207705600 2678400 1976 1976 2 3 8
#> 4 M1 1976-09-01 7700 210384000 2678400 1976 1976 2 3 9
#> 5 M1 1976-10-01 7080 212976000 2592000 1976 1976 2 4 10
#> 6 M1 1976-11-01 6520 215654400 2678400 1976 1976 2 4 11
#> 7 M1 1976-12-01 6070 218246400 2592000 1976 1976 2 4 12
#> 8 M1 1977-01-01 6650 220924800 2678400 1977 1976 1 1 1
#> 9 M1 1977-02-01 6830 223603200 2678400 1977 1977 1 1 2
#> 10 M1 1977-03-01 5710 226022400 2419200 1977 1977 1 1 3
#> # ... with 1,564 more rows, and 27 more variables: month.xts <int>,
#> # month.lbl <ord>, day <int>, hour <int>, minute <int>, second <int>,
#> # hour12 <int>, am.pm <int>, wday <int>, wday.xts <int>, wday.lbl <ord>,
#> # mday <int>, qday <int>, yday <int>, mweek <int>, week <int>,
#> # week.iso <int>, week2 <int>, week3 <int>, week4 <int>, mday7 <int>,
#> # date_sin3_K1 <dbl>, date_cos3_K1 <dbl>, date_sin6_K1 <dbl>,
#> # date_cos6_K1 <dbl>, date_sin12_K1 <dbl>, date_cos12_K1 <dbl>
Missing data is filled in using the pad_by_time function from the timetk package. First, each time series is grouped and padded using their existing start and end dates. Missing values are padded using NA. Then the same process is ran again, this time padding data from the “hist_start_date” input from the “forecast_time_series” finnts function, with missing values being filled in with zero. This ensures that missing data before a time series starts are all zeroes, but missing periods within the existing time series data are identified to be inputted with new values in the next step.
After missing data is padded, the ts_impute_vec function from the timetk package is called to impute any NA values. This only happens if the “clean_missing_values” input from the “forecast_time_series” finnts function is set to TRUE, otherwise NA values are replaced with zero.
Outliers are handled using the ts_clean_vec function from the timetk package. Outliers are replaced after the missing data process, and only runs if the “clean_outliers” input from the “forecast_time_series” finnts function is set to TRUE.
Important Note: Missing values and outliers are replaced for the target variable and any numeric external regressors.
Lags of the target variable and external regressors are created using the tk_augment_lags function from timetk.
Rolling window calculations of the target variable are created using the tk_augment_slidify function from timetk. The below calculations are created over various window values.
Polynomial transformations are created for the target variable, and lags are then created on top of them. The below transformations are created.
In addition to the standard approaches above, finnts also does two different ways of preparing features to be created for a multivariate machine learning model.
In the first recipe, referred to as “R1” in default finnts models, all of the engineered target and external regressor features are used but cannot be less than the forecast horizon. For example, a monthly data set with a forecast horizon of 3, finnts will take engineered features like lags and rolling window features but only use those one that are for periods equal to or greater than 3. Recursive forecasting is not supported in default finnts multivariate machine learning models, since feeding forecast outputs as features to create another forecast adds complex layers of uncertainty that can easily spiral out of control and produce poor forecasts. NA values created by generating lag features are filled “up”. This results in the first initial periods of a time series having some data leakage but the effect should be small if the time series is long enough.
#> # A tibble: 10 x 9
#> Combo Date Target Target_lag3 Target_lag6 Target_lag3_roll3_sum
#> <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 Country_1 2020-01-01 10 10 10 30
#> 2 Country_1 2020-02-01 20 10 10 30
#> 3 Country_1 2020-03-01 30 10 10 30
#> 4 Country_1 2020-04-01 40 10 10 30
#> 5 Country_1 2020-05-01 50 20 10 40
#> 6 Country_1 2020-06-01 60 30 10 60
#> 7 Country_1 2020-07-01 70 40 10 90
#> 8 Country_1 2020-08-01 80 50 20 120
#> 9 Country_1 2020-09-01 90 60 30 150
#> 10 Country_1 2020-10-01 100 70 40 180
#> # ... with 3 more variables: Target_lag6_roll3_sum <dbl>,
#> # Target_lag3_roll6_sum <dbl>, Target_lag6_roll6_sum <dbl>
The second recipe is referred to as “R2” in default finnts models. It takes a very different approach than the “R1” recipe. For a 3 month forecast horizon on a monthly dataset, target and rolling window features are created by depend on the horizon period. They are also constrained to be equal or less than the forecast horizon. In the below example, “Origin” and “Horizon” features are created for each time period. This results in duplicating rows in the original data set to create new features that are now specific to each horizon period. This helps the default finnts models find new unique relationships to model, when compared to a more formal approach in “R1”. NA values created by generating lag features are filled “up”, but we left that out in the below example to better understand how the horizon specific features are created.
#> # A tibble: 18 x 8
#> Combo Date Origin Horizon Target Target_Lag1 Target_Lag2 Target_Lag3
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Country~ 2020-01-01 0 1 10 NA NA NA
#> 2 Country~ 2020-01-01 -1 2 10 NA NA NA
#> 3 Country~ 2020-01-01 -2 3 10 NA NA NA
#> 4 Country~ 2020-02-01 1 1 20 10 NA NA
#> 5 Country~ 2020-02-01 0 2 20 NA NA NA
#> 6 Country~ 2020-02-01 -1 3 20 NA NA NA
#> 7 Country~ 2020-03-01 2 1 30 20 10 NA
#> 8 Country~ 2020-03-01 1 2 30 10 NA NA
#> 9 Country~ 2020-03-01 0 3 30 NA NA NA
#> 10 Country~ 2021-04-01 3 1 40 30 20 10
#> 11 Country~ 2021-04-01 2 2 40 20 10 NA
#> 12 Country~ 2021-04-01 1 3 40 10 NA NA
#> 13 Country~ 2020-05-01 4 1 50 40 30 20
#> 14 Country~ 2020-05-01 3 2 50 30 20 10
#> 15 Country~ 2020-05-01 2 3 50 20 10 NA
#> 16 Country~ 2021-06-01 5 1 60 50 40 30
#> 17 Country~ 2021-06-01 4 2 60 40 30 20
#> 18 Country~ 2021-06-01 3 3 60 30 20 10
In addition to everything called out above, some models have their own specific transformations that need to be applied before training a model. For example, the “glmnet” model needs to transform categorical variables into continuous variables and center/scale the data before training. Each default model in finnts has their own preprocessing steps that ensure the data fed into the model has the best chance of producing a high quality forecast. The recipes package is used to easily apply various preprocessing transformations needed before training a model.