library(data.table) # requires install.packages("data.table") first
<- data.table::fread("path/to/folder/PreFer_train_data.csv",
data keepLeadingZeros = TRUE, # if FALSE adds zeroes to some dates
data.table = FALSE) # returns a data.frame object rather than data.table
A guide on using different datasets [LISS]
Here is some useful information on how to use the different PreFer datasets on offer
Here we describe how you can (maybe) fruitfully combine the different datasets PreFer_train_data.csv
, PreFer_train_background_data.csv
, and PreFer_train_supplementary_data.csv
. (here you can find a guide on these different datasets)
Below it is assumed that you have these datasets in your working directory.
Reading-in different datasets
PreFer_train_data.csv
The most important dataset of the PreFer data challenge is the PreFer_train_data.csv
dataset. This contains data on all LISS respondents who were between the ages of 18 and 45 in 2020 (i.e., had birthyears between 1975 and 2002). Click here for an explanation on how to quickly read-in the data
PreFer_train_background_data.csv
The PreFer_background_data.csv
contains data on particularly variables on a monthly basis for each respondent.
<- data.table::fread("path/to/folder/PreFer_train_background_data.csv",
background keepLeadingZeros = TRUE,
data.table = FALSE)
PreFer_train_supplementary_data.csv
The PreFer_train_supplementary_data.csv
contains data on LISS respondents that did not fall in our target group, i.e., had birthyears lower than 1975 or higher than 2002.
<- data.table::fread("path/to/folder/PreFer_train_supplementary_data.csv",
supplement keepLeadingZeros = TRUE,
data.table = FALSE)
How to usefully combine datasets
Using information from background data
One could use the background data to enhance the train data (PreFer_train_data.csv
). For example, imagine that you would want to add the average income of respondents in the last three months prior to 2021. Here is what one could add to the submission.R
script:
library(dplyr) # for data wrangling
# creating average income for all respondents
<- background |>
income_3months # only select last 3 months
filter(wave >= 202010 & wave <= 202012) |>
# for each respondent
group_by(nomem_encr) |>
# calculate average income
summarise(mean_income = mean(netinc, na.rm = TRUE))
# add income to train data
<- left_join(data, income_3months, by = "nomem_encr")
data
<- data |>
data # impute mean income if missing
mutate(mean_income_imp = if_else(is.na(mean_income),
mean(mean_income, na.rm = TRUE),
mean_income))
You can now make use of the variable mean_income_imp
in your train_save_model
-file. To make this work don’t forget to add "dplyr"
to packages.R
! Here you can see it in action and that it passes the check.
Using information from the supplementary data
Using information from the supplementary data is slightly more complex. Remember that the supplementary data (PreFer_train_supplementary_data.csv
) consists of all people who are not in the target group (i.e., it consists of respondents with birthyears prior to 1975 and post 2002). In contrast to the background data, where a PreFer_train_background_data.csv
is available (to participants) and where aPreFer_holdout_background_data.csv
exists (but is not available to PreFer participants), there is no equivalent of the supplementary data for the holdout data. This means that we cannot use any code to process the supplementary data in the function clean_df
, because everything inserted into clean_df
will be applied to holdout data. Thus, everything you do with the supplementary data must be specified in the training.R file.
Imagine that you also want to include people who were born in 1974 in your train data. You could insert something like the below in your train_save_model
function:
library(dplyr) # for data wrangling
# selecting only respondents from 1974
<- supplement |>
respondents_1974 filter(birthyear_bg == 1974)
# add respondents from 1974
<- bind_rows(data, respondents_1974) data
And then use your usual method to train on the data. This will work because your method will make use of these new cases, but the structure of the dataset has not changed (no new variables were added). If you would create a novel variable on the basis of the supplementary data and use this variable in your model, it would not work on the holdout data, because that variable will not exist in the holdout data. Here you can see it in action and that it passes the check.