Preparing your submission
You can use either R or python to develop your method (i.e. train your model). Here you can read how to prepare a submission and submit your method.
The submission (i.e. the scripts for preprocessing the data and training the model, and the model itself) should be either in Python or R, otherwise it will not run on the holdout data.
Yes, in Phase 1 (and track 3 of phase 2) you can use the LISS data in such a shared cloud environment, but all people who will have access to the data need to have signed the data agreement, and the data cannot be available to everyone.
It is possible if this data is publicly available.
The data must be from 2020 or earlier, because the task is to predict fertility outcomes in 2021-2023 based on information from earlier years. If you usу more recent data (from 2021 onwards), the result may not be evaluated in a standard way and ше may not be on the leaderboard. Exceptions are possible depending on the data.
If you plan to use any external data, please contact us.
Yes, you can use the codebooks. You need to add the codebook (or both codebooks) to your repository (root folder), and then you can just read the file(s) inside the function (i.e. not from an argument) the way you normally would read a csv file.
If you are using R, additionally you need to check the ‘r.Dockerfile’ file, which is in your repository: check if the line ’COPY *.csv /app’ is there. If not, please add it after the other lines with ‘COPY’.
Yes, you can use other formats. In this case, you’ll need to change the way you load the model (and all additional files needed to run this model) in the submission.py script (in the ‘predict_outcomes’ function) - currently a model in the .joblib format is loaded there, 2) update the list of packages in environment.yml - don’t forget any packages that you use.
The Python version is 3.11.7 and the R version is 4.3.3.
It’s better to stick to the same version unless that’s the only way to use a particular function or package (for example, the package is only supported in the older version of Python).
To use another version of Python, edit the file ‘python.Dockerfile’ in your repository. Replace ‘anaconda3:2024.02-1’ in the first line with the tag for anaconda release, which uses the Python version that you need. First, check which Python versions are used in different Anaconda’s here. Then go here and find the tag for this version. For example, the tag for Anaconda 2020.02 is ‘anaconda3:2020.02’. And then replace the tag in the ‘python.Dockerfile’, so the first line will look like this: FROM continuumio/anaconda3:2020.02
The docker tag should be similar to the one currently used; so no suffixes e.g. -alpine.
Submitting your model
You can submit as many submissions via the Next platform as you like, but only your most recent submission at the deadline for the leaderboard will be assessed.
Yes, you can make a submission via the Next platform in advance, and then update your submission if you find a better model.
How to update your submission: you will see an “Update” button on the Submit page after you submit something. To change your submission, replace the link to your commit and press this “Update” button.
It is not obligatory to make a submission every time before we publish an intermediate leaderboard. An intermediate leaderboard is just an opportunity for you to see how your model is performing on the holdout set.
Make sure that you submit at least once before the final submission deadline in Phase 1 - June 3, 12:00pm (afternoon) CEST.
There are several reasons why we want participants’ codes and models, rather than just their predictions. First, the goal of this data challenge is primarily scientific, and the organisers (and future researchers) should be able to reproduce all analyses. This is particularly important because a previous data challenge showed problems with (computational) reproducibility. Second, we will run this code and models on different variants of the data for robustness checks and to delve into substantive problems (e.g., how well do the models predict the outcome one year later).
We have chosen only to provide prediction scores and rankings at set times (see here). The reason for this is the following: the sample size of the LISS panel for our target group is rather limited. This meant that we had chosen to separate the data into a training and holdout set, and we did not opt for an additional validation set because that would come at a sample size cost for the training data. We will not provide continuous prediction scores based on the holdout data because this would lead to overfitting.
Each commit doesn’t only include the changes of a certain file. It captures the state of your whole repository at that moment in time. After you changed all the files that you needed to change, copy the link to the last commit and send it as explained here.
General
You should fill in the form about team membership that we’ve sent again before you make a new submission.
Everyone who has engaged in the data challenge and submitted predictions through the procedures of the data challenge will be invited to contribute to a “community paper” on the results of the data challenge. You can also consider writing up your (team’s) results and submit it to the special issue; these submissions will be peer-reviewed. Instructions on how to do this will follow after the data challenge. If you want to publish your results separate from the special issue, then you could do so, but only after the community paper and the special issue is fully published (as you will have agreed to via signing the data agreement).
You can start with [this review by Balbo et al.] (https://link.springer.com/article/10.1007/s10680-012-9277-y){target=“_blank”}. Keep in mind though, that most studied factors are not necessarily the best predictors.
You can contact Elizaveta Sivak and Gert Stulp at preferdatachallenge [at] gmail [dot] com for any questions and suggestions you might have.
Application
Yes. Participating in the challenge is a good opportunity to gain experience in machine learning.
Yes, we welcome participants from various backgrounds who may take different approaches to developing their predictive models. Knowledge of the factors found to be significantly associated with fertility is not required. However, if you want to get a general overview of the factors influencing fertility behaviour you can read this review (Balbo et al. 2013).
Yes, it is possible. As in any other case, all team members must submit applications to participate.
Yes, you can participate in the challenge and then describe your approach in your thesis. We can’t help supervise it though, because as organizers we will decide on the winners so we cannot participate ourselves.
With the example scripts and codebooks that we will provide, you will be able to produce and submit a basic method in several hours. Trying different methods to improve the predictions can take several weeks. However, any contribution is valuable. Consider applying even if you can dedicate only a few days.
Ethics
Although our aim is to make the best possible predictions for all individuals, we are not interested in individual outcomes. Both LISS panel data and the datasets based on the Dutch registers are anonymized, and precautions are taken to prevent de-anonymization (e.g. full dates of birth or exact income are not available; in the LISS datasets, sensitive information is removed from open answers, etc.). De-anonymization of the Dutch registers data via linking with external datasets is also highly unlikely as linking with external data can only be done by Statistics Netherlands employees and is only possible after the external data is checked for potential risks of de-anonymization. To summarize, it is highly unlikely that during the challenge current (or predictions of the future) fertility outcomes of specific persons will be revealed.
If policies are developed based on the results (e.g., if particular groups of people will be identified who have fewer children than desired due to particular circumstances) these policies will be targeted at groups rather than individuals.
Developing models that predict individual-level fertility outcomes and then comparing and interpreting these models is a way of learning more about fertility behavior. This way has several advantages in comparison with the more traditional methods of studying fertility in the social sciences, which, for example, do not allow comparison of importance of different factors. In the preprint we describe these advantages in more detail.
We think that the risk of companies improving their targeting based on the models produced in this data challenge is exceedingly small, because if predictions are accurate, they will be accurate because we have thousands of variables (including values and preferences) which companies do not usually have and which are hard to obtain without people’s awareness.