Time Series Forecasting Lab (Part 2) - Feature Engineering with Recipes

Cover photo by Kristine Tumanyan on Unsplash

Go to R-bloggers for R news and tutorials contributed by hundreds of R bloggers.

This is the second of a series of 6 articles about time series forecasting with panel data and ensemble stacking with R. Click here for Part 1.

Through these articles I will be putting into practice what I have learned from the Business Science University training course 1 DS4B 203-R: High-Performance Time Series Forecasting", delivered by Matt Dancho. If you are looking to gain a high level of expertise in time series with R I strongly recommend this course.

The objective of this article is to explain how to perform time series preprocessing and feature enginneering with timetk . timetk extends recipes for time series.

I will explain what can and cannot be done with recipes for time series panel data.

Introduction

Since the ultimate goal of this series of articles is to perform time series forecast with machine learning, we all know that maximing model performance requires a lot of exprimentation with different feature engineered datasets.

Feature engineering is the most critical part of time series analysis and with recipes you can "use dplyr-like pipeable sequences of feature engineering steps to get your data ready for modeling".

Prerequisites

I assume you are already familiar with the following topics, packages and terms:

time series data format: timestamp, measure, key(s), features
time series lags, rolling lags, Fourier terms
dplyr or tidyverse R packages
normalization and standardization
dummy variables (one-hot encoding)
dataset split into training and test for modeling

Packages

The following packages must be loaded:

# PART 1 - FEATURE ENGINEERING
library(tidyverse)  # loading dplyr, tibble, ggplot2, .. dependencies
library(timetk)  # using timetk plotting, diagnostics and augment operations
library(tsibble)  # for month to Date conversion
library(tsibbledata)  # for aus_retail dataset
library(fastDummies)  # for dummyfying categorical variables
library(skimr) # for quick statistics

# PART 2 -  FEATURE ENGINEERING WITH RECIPES
library(tidymodels)

End-to-end FE process with recipe

Below is a summary diagram of the end-to-end FE process that will be implemented in this article. Notice that the recipe creation will be the last step of the process. The whole process is described hereafter:

Perform measure transformations (here, Turnover) .
Add a future frame with prediction length
Add Fourier, lag, and rolling lag features. Add a rowid column.
Perform a first split into prepared and future datasets.
Apply the training and test split on the prepared dataset.
Create the recipe based upon the training dataset of the split.
Add steps to the recipe such as adding calendar-based features and other tranformations and cleaning.

Recipe limitation for panel data

Generally, we can perform 1., 2., and 3. and preprocessing within a recipe by simply adding the correspong step to it.

The problem is that we are working with panel data , thus it is not possible to include 1., 2., and 3. as different steps within a recipe. We must first perfom these "steps" per group (by "Industry"), and then bind all groups into a single tibble again to be able to do 4., 5., 6. and 7.

Note: the "NA" values in the prepared dataset correspond the (unknown) Turnover predictions within the future frame.

Our data

Let us recall our timeseries dataset from Part 1: the Australian retail trade turnover dataset tsibble::aus_retail filtered by the “Australian Capital Territory” State value."

> aus_retail_tbl <- aus_retail %>%
     tk_tbl()

> monthly_retail_tbl <- aus_retail_tbl %>%
     filter(State == "Australian Capital Territory") %>%
     mutate(Month = as.Date(Month)) %>%
     mutate(Industry = as_factor(Industry)) %>%
     select(Month, Industry, Turnover)

> monthly_retail_tbl %>% glimpse()
Rows: 8,820
Columns: 3
$ Month    <date> 1982-04-01, 1982-05-01, 1982-06-01, 1982-07-01, 1982-08-01, 1982-09-01, 1982-10-01, 1982-11-01, 1982-12-01, 1983-01-01, 1983-02-01, 1983-03-01, 1983-04-01, 1983-05-01,~
$ Industry <fct> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, restaurants and catering ser~
$ Turnover <dbl> 4.4, 3.4, 3.6, 4.0, 3.6, 4.2, 4.8, 5.4, 6.9, 3.8, 4.2, 4.0, 4.4, 4.3, 4.3, 4.6, 4.9, 4.5, 5.5, 6.7, 9.6, 3.8, 3.8, 4.4, 3.8, 3.9, 4.1, 4.3, 4.8, 5.0, 5.3, 5.6, 6.8, 4.6~

The code snippet below will steps (not recipe steps!) 1., 2., and 3. as explained above. You can always refer to Part 1 for function arguments explanations.

We will generate a list (groups) of 20 tibbles, each corresponding to a specific industry. In addition to steps 1., 2. and 3. I have also arranged in ascending order the timestamp (Month) for each Industry group for the lags, rolling lags and Fourier terms.

We add a future frame, starting from 2019-01-01 to 2019-12-01 (12-month length), which will hold NA values for the measure (Turnover) but to which the Fourier, lag and rolling lag features will also be added.

Finally, we bind them together to produce the first feature engineered tibble groups_fe_tbl.

groups <- lapply(X = 1:length(Industries), FUN = function(x){

  monthly_retail_tbl %>%
    filter(Industry == Industries[x]) %>%
    arrange(Month) %>%
    mutate(Turnover =  log1p(x = Turnover)) %>%
    mutate(Turnover =  standardize_vec(Turnover)) %>%
    future_frame(Month, .length_out = "12 months", .bind_data = TRUE) %>%
    mutate(Industry = Industries[x]) %>%
    tk_augment_fourier(.date_var = Month, .periods = 12, .K = 1) %>%
    tk_augment_lags(.value = Turnover, .lags = 12) %>%
    tk_augment_slidify(.value   = Turnover_lag12,
                       .f       = ~ mean(.x, na.rm = TRUE), 
                       .period  = c(3, 6, 9, 12),
                       .partial = TRUE,
                       .align   = "center")
})

groups_fe_tbl <- bind_rows(groups) %>%
  rowid_to_column(var = "rowid")

Below I will display the last 13 months of groups_fe_tbl, you should see the last historical row (2018-12-01) and the future frame values for the last industry (obviously, we have the same situation for other industries within the same tibble):

> groups_fe_tbl %>% tail(n=13) %>% glimpse()
Rows: 13
Columns: 16
$ rowid                  <int> 9048, 9049, 9050, 9051, 9052, 9053, 9054, 9055, 9056, 9057, 9058, 9059, 9060
$ Month                  <date> 2018-12-01, 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 20~
$ Industry               <chr> "Takeaway food services", "Takeaway food services", "Takeaway food services", "Takeaway food serv~
$ Turnover               <dbl> 2.078015, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ Month_sin12_K1         <dbl> 0.16810089, 0.63846454, 0.93775213, 0.99301874, 0.80100109, 0.40982032, -0.10116832, -0.57126822, ~
$ Month_cos12_K1         <dbl> 0.98576980, 0.76965124, 0.34730525, -0.11795672, -0.59866288, -0.91216627, -0.99486932, -0.8207634~
$ Turnover_lag12         <dbl> 1.668977, 1.397316, 1.542003, 1.894938, 1.894938, 1.894938, 1.878972, 1.886969, 1.957669, 1.926527~
$ Turnover_lag13         <dbl> 1.755391, 1.668977, 1.397316, 1.542003, 1.894938, 1.894938, 1.894938, 1.878972, 1.886969, 1.957669~
$ Turnover_lag12_roll_3  <dbl> 1.607228, 1.536098, 1.611419, 1.777293, 1.894938, 1.889616, 1.886960, 1.907870, 1.923722, 1.972616~
$ Turnover_lag13_roll_3  <dbl> 1.723756, 1.607228, 1.536098, 1.611419, 1.777293, 1.894938, 1.889616, 1.886960, 1.907870, 1.923722~
$ Turnover_lag12_roll_6  <dbl> 1.667587, 1.692260, 1.715518, 1.750517, 1.832126, 1.901404, 1.906669, 1.929788, 1.941529, 1.974703~
$ Turnover_lag13_roll_6  <dbl> 1.668911, 1.667587, 1.692260, 1.715518, 1.750517, 1.832126, 1.901404, 1.906669, 1.929788, 1.941529~
$ Turnover_lag12_roll_9  <dbl> 1.750363, 1.744253, 1.741597, 1.757160, 1.779636, 1.808252, 1.878956, 1.925999, 1.946341, 1.952766~
$ Turnover_lag13_roll_9  <dbl> 1.748589, 1.750363, 1.744253, 1.741597, 1.757160, 1.779636, 1.808252, 1.878956, 1.925999, 1.929881~
$ Turnover_lag12_roll_12 <dbl> 1.783846, 1.784512, 1.785157, 1.787128, 1.811024, 1.828524, 1.862610, 1.904910, 1.941200, 1.946341~
$ Turnover_lag13_roll_12 <dbl> 1.766346, 1.783846, 1.784512, 1.785157, 1.787128, 1.811024, 1.828524, 1.843028, 1.887599, 1.925999~

Save standarization parameters

Since we are working with panel data (there are 20 time series, one for each Industry), consequently there will be 20 Standardization Parameters (mean, standard deviation) generated by standardize_vec(Turnover).

The values of these parameters must be saved, Ajay Mehta proposed an elegant solution with group_map(), I'm including his code snippet based on his comment at the end of this article. The mean and standard deviation values will be saved in std_mean and std_sd vectors respectively.

tmp <- monthly_retail_tbl %>%
  group_by(Industry) %>% 
  arrange(Month) %>%
  mutate(Turnover = log1p(x = Turnover)) %>% 
  group_map(~ c(mean = mean(.x$Turnover, na.rm = TRUE),
                sd = sd(.x$Turnover, na.rm = TRUE))) %>% 
  bind_rows()

std_mean <- tmp$mean
std_sd <- tmp$sd
rm('tmp')

Prepared and future datasets

We perform the split of groups_fe_tbl into prepared and future datasets .

The prepared dataset is the one with historic values of the measure (here Turnover) which must not have NA values. Although we know Turnover does have missing values we still apply filter(!is.na(Turnover)). We will also drop all rows with NA values elsewhere e.g., those introduced by lag and rolling lag features.

Finally, we obtain an 8,560-row tibble (data_prepared_tbl ) from a 9,060-row one (groups_fe_tbl). You can run a groups_fe_tbl %>% group_by(Industry) %>% skim() to check the total number of missing values (NA and NaN ) that were removed.

Prepared dataset starts from 1983-04-01.

> data_prepared_tbl <- groups_fe_tbl %>%
    filter(!is.na(Turnover)) %>%
    drop_na()

> data_prepared_tbl %>% glimpse()
Rows: 8,560
Columns: 16
$ rowid                  <int> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38~
$ Month                  <date> 1983-05-01, 1983-06-01, 1983-07-01, 1983-08-01, 1983-09-01, 1983-10-01, 1983-11-01, 1983-12-01, 1~
$ Industry               <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, re~
$ Turnover               <dbl> -2.0426107, -2.0426107, -1.9499231, -1.8620736, -1.9802554, -1.6990366, -1.4138382, -0.8757670, -2~
$ Month_sin12_K1         <dbl> 0.51455540, 0.87434662, 1.00000000, 0.86602540, 0.50000000, 0.01688948, -0.48530196, -0.84864426, ~
$ Month_cos12_K1         <dbl> 8.574571e-01, 4.853020e-01, 1.567970e-14, -5.000000e-01, -8.660254e-01, -9.998574e-01, -8.743466e-~
$ Turnover_lag12         <dbl> -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.370672, -2.209420,~
$ Turnover_lag13         <dbl> -2.011144, -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.370672,~
$ Turnover_lag12_roll_3  <dbl> -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.768409, -1.884923,~
$ Turnover_lag13_roll_3  <dbl> -2.183520, -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.768409,~
$ Turnover_lag12_roll_6  <dbl> -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.901909, -1.921958,~
$ Turnover_lag13_roll_6  <dbl> -2.197201, -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.901909,~
$ Turnover_lag12_roll_9  <dbl> -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.975371, -1.948876,~
$ Turnover_lag13_roll_9  <dbl> -2.213974, -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.975371,~
$ Turnover_lag12_roll_12 <dbl> -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.000355, -1.984457,~
$ Turnover_lag13_roll_12 <dbl> -2.147914, -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.000355,~

The future dataset hold the 12 future (and unknown) values of Turnover e.g., the NA values introduced by the future frame. We will predict and fill in the values in Part 6 with the final stacking ensemble model.

> future_tbl <- groups_fe_tbl %>%
    filter(is.na(Turnover))

> future_tbl %>% head(n=12) %>% glimpse()
Rows: 12
Columns: 16
$ rowid                  <int> 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453
$ Month                  <date> 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 2019-08-01, 20~
$ Industry               <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, r~
$ Turnover               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ Month_sin12_K1         <dbl> 0.63846454, 0.93775213, 0.99301874, 0.80100109, 0.40982032, -0.10116832, -0.57126822, -0.90511451,~
$ Month_cos12_K1         <dbl> 0.76965124, 0.34730525, -0.11795672, -0.59866288, -0.91216627, -0.99486932, -0.82076344, -0.425167~
$ Turnover_lag12         <dbl> 1.125397, 1.282320, 1.569292, 1.425855, 1.437951, 1.429897, 1.550608, 1.469789, 1.457920, 1.469789~
$ Turnover_lag13         <dbl> 1.384894, 1.125397, 1.282320, 1.569292, 1.425855, 1.437951, 1.429897, 1.550608, 1.469789, 1.457920~
$ Turnover_lag12_roll_3  <dbl> 1.264204, 1.325670, 1.425822, 1.477699, 1.431234, 1.472819, 1.483431, 1.492773, 1.465833, 1.495272~
$ Turnover_lag13_roll_3  <dbl> 1.316081, 1.264204, 1.325670, 1.425822, 1.477699, 1.431234, 1.472819, 1.483431, 1.492773, 1.465833~
$ Turnover_lag12_roll_6  <dbl> 1.370952, 1.370952, 1.378452, 1.449320, 1.480565, 1.462003, 1.469326, 1.489352, 1.490694, 1.478711~
$ Turnover_lag13_roll_6  <dbl> 1.378274, 1.370952, 1.370952, 1.378452, 1.449320, 1.480565, 1.462003, 1.469326, 1.489352, 1.501243~
$ Turnover_lag12_roll_9  <dbl> 1.411416, 1.395927, 1.404907, 1.408445, 1.416559, 1.454825, 1.485468, 1.470874, 1.476502, 1.482009~
$ Turnover_lag13_roll_9  <dbl> 1.420563, 1.411416, 1.395927, 1.404907, 1.408445, 1.416559, 1.454825, 1.485468, 1.474990, 1.482009~
$ Turnover_lag12_roll_12 <dbl> 1.433627, 1.429420, 1.420139, 1.420139, 1.430152, 1.434573, 1.462680, 1.480716, 1.470874, 1.476502~
$ Turnover_lag13_roll_12 <dbl> 1.429496, 1.433627, 1.429420, 1.420139, 1.420139, 1.430152, 1.434266, 1.465153, 1.485468, 1.474990~

Train and test datasets split

We perform the split with timetk::time_series_split() function. The latter returns a split object (an R list) which can be further used for plotting both the traininig and the test datasets.

One important argument is assess, it determines the length of the test dataset. As per 2 , "The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test dataset should ideally be at least as large as the maximum forecast horizon required." Since there are 428 observations (months) per industry, then 20% of 428 is 86 months (rounded up).

cumulative = TRUE means that we consider the training dataset from the begining (first row) of the prepared dataset.

> splits <- data_prepared_tbl %>%
    time_series_split(Month, 
                      assess = "86 months", 
                      cumulative = TRUE)

# split object
> splits
<Analysis/Assess/Total>
<6840/1720/8560>

Which means

6840 rows for training (all 20 industries). Here, we only care about the rows since the number of columns will be updated by the recipe (explained soon).
1720 rows for test dataset (86 months x 20 industries)
8560 rows of data_prepared_tbl (all 20 industries)

You can confirm the size of the training and test datasets with the following code snippet using the tk_time_series_cv_plan() function which creates a new tibble with two new columns .id and .key

> splits %>%
   tk_time_series_cv_plan() %>% glimpse()
Rows: 8,560
Columns: 18
$ .id                    <chr> "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slic~
$ .key                   <fct> training, training, training, training, training, training, training, training, training, train~
$ rowid                  <int> 14, 467, 920, 1373, 1826, 2279, 2732, 3185, 3638, 4091, 4544, 4997, 5450, 5903, 6356, 6809, 726~
$ Month                  <date> 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01~
$ Industry               <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and takeaway food services", "C~
$ Turnover               <dbl> -2.042611, -1.866127, -1.383977, -1.533674, -1.772910, -1.575604, -1.941481, -1.635071, -1.9995~
$ Month_sin12_K1         <dbl> 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.51455~
$ Month_cos12_K1         <dbl> 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-0~
$ Turnover_lag12         <dbl> -2.355895, -2.397538, -1.674143, -1.891997, -1.973832, -1.550576, -2.113506, -2.109525, -2.2924~
$ Turnover_lag13         <dbl> -2.011144, -2.195619, -1.713772, -1.891997, -2.044287, -1.680229, -2.080545, -2.035540, -2.4340~
$ Turnover_lag12_roll_3  <dbl> -2.216035, -2.299396, -1.771137, -1.971674, -2.053104, -1.602136, -2.091532, -2.110334, -2.4204~
$ Turnover_lag13_roll_3  <dbl> -2.183520, -2.296579, -1.693957, -1.891997, -2.009059, -1.615402, -2.097026, -2.072532, -2.3632~
$ Turnover_lag12_roll_6  <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.4065~
$ Turnover_lag13_roll_6  <dbl> -2.197201, -2.278793, -1.787835, -1.988232, -2.146638, -1.577044, -2.076792, -2.110131, -2.4117~
$ Turnover_lag12_roll_9  <dbl> -2.190758, -2.244673, -1.826687, -2.015706, -2.222243, -1.607126, -2.067327, -2.110334, -2.4193~
$ Turnover_lag13_roll_9  <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.4065~
$ Turnover_lag12_roll_12 <dbl> -2.095067, -2.152525, -1.819038, -1.978000, -2.167591, -1.593199, -2.043333, -2.048362, -2.3049~
$ Turnover_lag13_roll_12 <dbl> -2.147914, -2.200931, -1.828293, -2.002079, -2.217778, -1.606260, -2.056832, -2.089405, -2.3594~

> splits %>%
   tk_time_series_cv_plan() %>% select(.id) %>% table()

Slice1 
  8560 

> splits %>%
   tk_time_series_cv_plan() %>% select(.key) %>% table()

training  testing 
    6840     1720

Let us plot the training and test datasets for one Industry value with the plot_time_series_cv_plan() function:

> Industries <- unique(aus_retail_tbl$Industry)
> splits %>%
  tk_time_series_cv_plan() %>%
  filter(Industry == Industries[1]) %>%
  plot_time_series_cv_plan(.date_var = Month, 
                           .value = Turnover, 
                           .title = paste0("Split for ", industry))

You can extract the training and test datasets by applying the training() and testing() functions respectively on the splits object.

> training(splits) %>% 
     filter(Industry == Industries[industry]) %>% 
     glimpse()
Rows: 342
Columns: 16
$ rowid                  <int> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,~
$ Month                  <date> 1983-05-01, 1983-06-01, 1983-07-01, 1983-08-01, 1983-09-01, 1983-10-01, 1983-11-01, 1983-12-01~
$ Industry               <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes,~
$ Turnover               <dbl> -2.0426107, -2.0426107, -1.9499231, -1.8620736, -1.9802554, -1.6990366, -1.4138382, -0.8757670,~
$ Month_sin12_K1         <dbl> 0.51455540, 0.87434662, 1.00000000, 0.86602540, 0.50000000, 0.01688948, -0.48530196, -0.8486442~
$ Month_cos12_K1         <dbl> 8.574571e-01, 4.853020e-01, 1.567970e-14, -5.000000e-01, -8.660254e-01, -9.998574e-01, -8.74346~
$ Turnover_lag12         <dbl> -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.370672, -2.2094~
$ Turnover_lag13         <dbl> -2.011144, -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.3706~
$ Turnover_lag12_roll_3  <dbl> -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.768409, -1.8849~
$ Turnover_lag13_roll_3  <dbl> -2.183520, -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.7684~
$ Turnover_lag12_roll_6  <dbl> -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.901909, -1.9219~
$ Turnover_lag13_roll_6  <dbl> -2.197201, -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.9019~
$ Turnover_lag12_roll_9  <dbl> -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.975371, -1.9488~
$ Turnover_lag13_roll_9  <dbl> -2.213974, -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.9753~
$ Turnover_lag12_roll_12 <dbl> -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.000355, -1.9844~
$ Turnover_lag13_roll_12 <dbl> -2.147914, -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.0003~

It is important to understand that the actual training dataset to be used to learn the model will be the output of the recipe , as explained hereafter.

Create recipe

As stated in Get Started, before training the model, we can use a recipe to create (new) predictors and conduct the preprocessing required by the model.

recipe(Turnover ~ ., data = training(splits))

"The recipe() function as we used it here has two arguments:

A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (here, Turnover). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.
The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = training(splits) here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc."

For each preprocessing and feature engineering steps explained in Part 1 there is a specific step_<> :

add calendar-based features: step_timeseries_signature()
remove columns: step_rm()
perform one-hot encoding on categorical variables: step_dummy()
normalize numeric variables: step_normalize()

recipe_spec <- recipe(Turnover ~ ., data = training(splits)) %>%
  update_role(rowid, new_role = "indicator") %>%  
  step_other(Industry) %>%
  step_timeseries_signature(Month) %>%
  step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  step_normalize(Month_index.num, Month_year)

Notice the two very useful functions explained hereafter:

update_role(rowid, new_role = "indicator"): it lets recipes know that rowid variable has a custom role called "indicator" (a role can have any character value). This tells the recipe to keep the rowid variable without using it neither as an outcome nor a predictor.

step_other(Industry): in case the training dataset does not include all Industry (categorical variable) values, it will pool the infrequently occurring values into an "other" category.

Below we display the recipe's spec, its summary, and finally the (real) training dataset:

> recipe_spec
Recipe

Inputs:

      role #variables
 indicator          1
   outcome          1
 predictor         14

Operations:

Collapsing factor levels for Industry
Timeseries signature features from Month
Delete terms matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")
Dummy variables from all_nominal()
Centering and scaling for Month_index.num, Month_year

# Recipe summary: 50 predictors (including all dummy variables) + 1 indicator + 1 outcome
> recipe_spec %>% 
+   prep() %>% 
+   summary()
# A tibble: 52 x 4
   variable              type    role      source  
   <chr>                 <chr>   <chr>     <chr>   
 1 rowid                 numeric indicator original
 2 Month                 date    predictor original
 3 Month_sin12_K1        numeric predictor original
 4 Month_cos12_K1        numeric predictor original
 5 Turnover_lag12        numeric predictor original
 6 Turnover_lag13        numeric predictor original
 7 Turnover_lag12_roll_3 numeric predictor original
 8 Turnover_lag13_roll_3 numeric predictor original
 9 Turnover_lag12_roll_6 numeric predictor original
10 Turnover_lag13_roll_6 numeric predictor original
# ... with 42 more rows

# Extracted training dataset
> recipe_spec %>% 
   prep() %>%
   juice() %>% 
   glimpse()
Rows: 6,840
Columns: 52
$ rowid                                                                      <int> 14, 467, 920, 1373, 1826, 2279, 2732, 3185,~
$ Month                                                                      <date> 1983-05-01, 1983-05-01, 1983-05-01, 1983-0~
$ Month_sin12_K1                                                             <dbl> 0.5145554, 0.5145554, 0.5145554, 0.5145554,~
$ Month_cos12_K1                                                             <dbl> 8.574571e-01, 8.574571e-01, 8.574571e-01, 8~
$ Turnover_lag12                                                             <dbl> -2.355895, -2.397538, -1.674143, -1.891997,~
$ Turnover_lag13                                                             <dbl> -2.011144, -2.195619, -1.713772, -1.891997,~
$ Turnover_lag12_roll_3                                                      <dbl> -2.216035, -2.299396, -1.771137, -1.971674,~
$ Turnover_lag13_roll_3                                                      <dbl> -2.183520, -2.296579, -1.693957, -1.891997,~
$ Turnover_lag12_roll_6                                                      <dbl> -2.213974, -2.275130, -1.824439, -2.023204,~
$ Turnover_lag13_roll_6                                                      <dbl> -2.197201, -2.278793, -1.787835, -1.988232,~
$ Turnover_lag12_roll_9                                                      <dbl> -2.190758, -2.244673, -1.826687, -2.015706,~
$ Turnover_lag13_roll_9                                                      <dbl> -2.213974, -2.275130, -1.824439, -2.023204,~
$ Turnover_lag12_roll_12                                                     <dbl> -2.095067, -2.152525, -1.819038, -1.978000,~
$ Turnover_lag13_roll_12                                                     <dbl> -2.147914, -2.200931, -1.828293, -2.002079,~
$ Turnover                                                                   <dbl> -2.042611, -1.866127, -1.383977, -1.533674,~
$ Month_index.num                                                            <dbl> -1.727262, -1.727262, -1.727262, -1.727262,~
$ Month_year                                                                 <dbl> -1.710287, -1.710287, -1.710287, -1.710287,~
$ Month_half                                                                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ Month_quarter                                                              <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ Month_month                                                                <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5~
$ Industry_Cafes..restaurants.and.catering.services                          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Cafes..restaurants.and.takeaway.food.services                     <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Clothing.retailing                                                <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Clothing..footwear.and.personal.accessory.retailing               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Department.stores                                                 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Electrical.and.electronic.goods.retailing                         <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Food.retailing                                                    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Footwear.and.other.personal.accessory.retailing                   <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Furniture..floor.coverings..houseware.and.textile.goods.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ Industry_Hardware..building.and.garden.supplies.retailing                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0~
$ Industry_Household.goods.retailing                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
$ Industry_Liquor.retailing                                                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
$ Industry_Newspaper.and.book.retailing                                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
$ Industry_Other.recreational.goods.retailing                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
$ Industry_Other.retailing                                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
$ Industry_Other.retailing.n.e.c.                                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Other.specialised.food.retailing                                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Pharmaceutical..cosmetic.and.toiletry.goods.retailing             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Supermarket.and.grocery.stores                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Takeaway.food.services                                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_01                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_02                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_03                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_04                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_05                                                         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ Month_month.lbl_06                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_07                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_08                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_09                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_10                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_11                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_12                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

As for the standardization parameters, we save the normalization parameters generated by the step_normalize() function applied on Month_index.num and Month_year columns. Min-max normalization was used for these columns.

To get the original min and max values prior normalization you must customize the skim() function as I did in Part 1 with my myskim() functionand apply it to training(splits).

recipe(Turnover ~ ., data = training(splits)) %>%
  step_timeseries_signature(Month) %>%
  prep() %>%
  juice() %>% 
  myskim()

# Extract of the output for the sake of clarity:
... ...

-- Variable type: numeric ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   skim_variable          n_missing complete_rate     mean            sd           p0           p25      p50     p75          p100           max          min

 ... ...

15 Month_index.num                0             1  8.69e+8 259648888.    420595200    644198400      8.69e+8 1.09e+9 1317427200    1317427200    420595200   
16 Month_year                     0             1  2.00e+3         8.23       1983         1990      2.00e+3 2.00e+3       2011          2011         1983  

... ...

# Month_index.num normalization Parameters
Month_index.num_limit_lower = 420595200
Month_index.num_limit_upper = 1317427200

# Month_year normalization Parameters
Month_year_limit_lower = 1983
Month_year_limit_upper = 2011

Save your work

We will keep all data, recipes, splits, parameters and any other useful stuff in a list and save it as an RDS file.

feature_engineering_artifacts_list <- list(
  # Data
  data = list(
    data_prepared_tbl = data_prepared_tbl,
    future_tbl      = future_tbl,
    industries = Industries
  ),

  # Recipes
  recipes = list(
    recipe_spec = recipe_spec
  ),

  # Splits
  splits = splits,

  # Inversion Parameters
  standardize = list(
    std_mean = std_mean,
    std_sd   = std_sd
  ),

  normalize = list(
    Month_index.num_limit_lower = Month_index.num_limit_lower, 
    Month_index.num_limit_upper = Month_index.num_limit_upper,
    Month_year_limit_lower = Month_year_limit_lower,
    Month_year_limit_upper = Month_year_limit_upper
  )  
)

feature_engineering_artifacts_list %>% 
  write_rds("feature_engineering_artifacts_list.rds")

Conclusion

In this article I have explained how to perform time series feature engineering with recipes thanks to the timetk package.

The recipe performed preprocessing, transformations and feature engineering applicable to all 20 time series (panel data). For operations such as measure standardization, adding lags, rolling lags and Fourier terms we needed to code a specific by group (here, by Industry) feature engineering.

The recipe is an important step in the data science process with modeltime since it will be added to a workflow along with a model, this will be explained in my next article (Part 3).

Note: I haven't shown all the power of recipes e.g., model-specific recipes can be built from a "base" recipe which can be modified in order to adapt to the model requirements. Remember, it is important to perform experimentation with different datasets and creating interdependent recipes can be very useful.

References

1 Dancho, Matt, "DS4B 203-R: High-Performance Time Series Forecasting", Business Science University

2 Hyndman R., Athanasopoulos G., "Forecasting: Principles and Practice", 3rd edition