A guide on using different datasets [LISS]

dataset

guide

Here is some useful information on how to use the different PreFer datasets on offer

Authors

Gert Stulp

Lisa Sivak

Published

March 23, 2024

Here we describe how you can (maybe) fruitfully combine the different datasets PreFer_train_data.csv, PreFer_train_background_data.csv, and PreFer_train_supplementary_data.csv. (here you can find a guide on these different datasets)

Below it is assumed that you have these datasets in your working directory.

Reading-in different datasets

PreFer_train_data.csv

The most important dataset of the PreFer data challenge is the PreFer_train_data.csv dataset. This contains data on all LISS respondents who were between the ages of 18 and 45 in 2020 (i.e., had birthyears between 1975 and 2002). Click here for an explanation on how to quickly read-in the data

library(data.table) # requires install.packages("data.table") first
data <- data.table::fread("path/to/folder/PreFer_train_data.csv", 
                          keepLeadingZeros = TRUE, # if FALSE adds zeroes to some dates
                          data.table = FALSE) # returns a data.frame object rather than data.table

PreFer_train_background_data.csv

The PreFer_background_data.csv contains data on particularly variables on a monthly basis for each respondent.

background <- data.table::fread("path/to/folder/PreFer_train_background_data.csv", 
                                keepLeadingZeros = TRUE, 
                                data.table = FALSE)

PreFer_train_supplementary_data.csv

The PreFer_train_supplementary_data.csv contains data on LISS respondents that did not fall in our target group, i.e., had birthyears lower than 1975 or higher than 2002.

supplement <- data.table::fread("path/to/folder/PreFer_train_supplementary_data.csv", 
                                keepLeadingZeros = TRUE, 
                                data.table = FALSE)

How to usefully combine datasets

Using information from background data

One could use the background data to enhance the train data (PreFer_train_data.csv). For example, imagine that you would want to add the average income of respondents in the last three months prior to 2021. Here is what one could add to the submission.R script:

library(dplyr) # for data wrangling

# creating average income for all respondents
income_3months <- background |>
  # only select last 3 months
  filter(wave >= 202010 & wave <= 202012) |>
  # for each respondent
  group_by(nomem_encr) |>
  # calculate average income
  summarise(mean_income = mean(netinc, na.rm = TRUE))

# add income to train data
data <- left_join(data, income_3months, by = "nomem_encr")

data <- data |>
  # impute mean income if missing
  mutate(mean_income_imp = if_else(is.na(mean_income), 
                                   mean(mean_income, na.rm = TRUE),
                                   mean_income))

You can now make use of the variable mean_income_imp in your train_save_model-file. To make this work don’t forget to add "dplyr" to packages.R! Here you can see it in action and that it passes the check.

Using information from the supplementary data

Using information from the supplementary data is slightly more complex. Remember that the supplementary data (PreFer_train_supplementary_data.csv) consists of all people who are not in the target group (i.e., it consists of respondents with birthyears prior to 1975 and post 2002). In contrast to the background data, where a PreFer_train_background_data.csv is available (to participants) and where aPreFer_holdout_background_data.csv exists (but is not available to PreFer participants), there is no equivalent of the supplementary data for the holdout data. This means that we cannot use any code to process the supplementary data in the function clean_df, because everything inserted into clean_df will be applied to holdout data. Thus, everything you do with the supplementary data must be specified in the training.R file.

Imagine that you also want to include people who were born in 1974 in your train data. You could insert something like the below in your train_save_model function:

library(dplyr) # for data wrangling

# selecting only respondents from 1974
respondents_1974 <- supplement |>
  filter(birthyear_bg == 1974) 

# add respondents from 1974
data <- bind_rows(data, respondents_1974)

And then use your usual method to train on the data. This will work because your method will make use of these new cases, but the structure of the dataset has not changed (no new variables were added). If you would create a novel variable on the basis of the supplementary data and use this variable in your model, it would not work on the holdout data, because that variable will not exist in the holdout data. Here you can see it in action and that it passes the check.