For general audience and media
General
A data challenge is a competition where research teams try to predict a specific characteristic as accurately as possible using provided datasets. The winning team is chosen based on how closely their predictions match actual values, which are known only to the challenge organizers.
These competitions not only help identify the most accurate predictive models but also often drive scientific progress by refining predictive methods and uncovering new insights.
In the PreFer data challenge, researchers predicted who will have the first or subsequent child in the next three years (2021-2023) based on information from previous years.
Birth rates are a key driver of population growth and age structure, with profound socio-economic implications. Developing accurate predictive models and identifying the most predictive factors can enhance our ability to forecast future birth rates. Identifying groups at higher risk of having few or no children in the coming years is also vital. This can help policymakers support individuals in achieving their desired family sizes and address involuntary childlessness.
PreFer is the first data challenge in demography, and it uniquely combines large-scale administrative data from Dutch population registers with rich survey data from the LISS panel. By integrating these two data sources with advanced infrastructure provided by ODISSEI and partner organizations (such as the ODISSEI Secure Supercomputer supported by SURF), PreFer measures the current upper limits of predictability of fertility outcomes. By comparing performance of different methods and different data sources, this setting also help understand how predictability of fertility outcomes can be improved, or what are the limits of theory, methods, data, and infrastructure.
PreFer ended in October 2024, and we are currently analysing the results. We plan to publish a paper with the main results and a special issue in the Journal of Computational Social Science in 2025. In the meanwhile, we will present results at conferences (check our session at the ODISSEI conference for computational social science) and post updates on this website.
Ethics
Although our aim is to make the best possible predictions for all individuals, we are not interested in individual outcomes. Both PreFer data sources - LISS panel data and the datasets based on the Dutch registers - are anonymised, and precautions are taken to prevent de-anonymisation (e.g. full dates of birth or exact income are not available; in the LISS datasets, sensitive information is removed from open answers, etc.). De-anonymisation of the Dutch registers data via linking with external datasets is also highly unlikely as linking with external data can only be done by Statistics Netherlands employees and is only possible after the external data is checked for potential risks of de-anonymisation. To summarize, it is highly unlikely that during the challenge current (or predictions of the future) fertility outcomes of specific persons will be revealed. If policies are developed based on the results (e.g., if particular groups of people will be identified who have fewer children than desired due to particular circumstances) these policies will be targeted at groups rather than individuals.
Group-level predictions may overlook important variability within subgroups. By predicting fertility outcomes at the individual level, we can better analyze predictive performance across diverse subgroups, potentially enabling more targeted interventions. Identifying subgroups with lower predictability can highlight areas where our understanding of fertility behavior is limited or where important data is missing.
We think that the risk of companies improving their targeting based on the models produced in this data challenge is exceedingly small, because if predictions are accurate, they will be accurate because we have thousands of variables (including values and preferences) which companies do not usually have and which are hard to obtain without people’s awareness.
Access to Dutch registry data is highly restricted. It is available only to authorized universities, scientific organizations, public policy institutions, and research institutes in the Netherlands and certain other EU countries, and can only be used for scientific (non-commercial) purposes under stringent conditions. Data access is granted exclusively to researchers affiliated with these organizations, who must first pass an awareness test.
All data analysis is conducted in a secure environment without internet access, and it is not possible to download the data onto personal computers. Only aggregated statistics, such as accuracy measures, can be exported only after verification by Statistics Netherlands (CBS) employees. Individual level data, even statistical estimates such as an individual-level estimated probability of having a(nother) child, can never be exported from the environment.
For PreFer participants
Preparing your submission
You can use either R or python to develop your method (i.e. train your model). Here you can read how to prepare a submission and submit your method.
The submission (i.e. the scripts for preprocessing the data and training the model, and the model itself) should be either in Python or R, otherwise it will not run on the holdout data.
Yes, in Phase 1 (and track 3 of phase 2) you can use the LISS data in such a shared cloud environment, but all people who will have access to the data need to have signed the data agreement, and the data cannot be available to everyone.
It is possible if this data is publicly available.
The data must be from 2020 or earlier, because the task is to predict fertility outcomes in 2021-2023 based on information from earlier years. If you usу more recent data (from 2021 onwards), the result may not be evaluated in a standard way and ше may not be on the leaderboard. Exceptions are possible depending on the data.
If you plan to use any external data, please contact us.
Yes, you can use the codebooks. You need to add the codebook (or both codebooks) to your repository (root folder), and then you can just read the file(s) inside the function (i.e. not from an argument) the way you normally would read a csv file.
If you are using R, additionally you need to check the ‘r.Dockerfile’ file, which is in your repository: check if the line ’COPY *.csv /app’ is there. If not, please add it after the other lines with ‘COPY’.
Yes, you can use other formats. In this case, you’ll need to change the way you load the model (and all additional files needed to run this model) in the submission.py script (in the ‘predict_outcomes’ function) - currently a model in the .joblib format is loaded there, 2) update the list of packages in environment.yml - don’t forget any packages that you use.
The Python version is 3.11.7 and the R version is 4.3.3.
It’s better to stick to the same version unless that’s the only way to use a particular function or package (for example, the package is only supported in the older version of Python).
To use another version of Python, edit the file ‘python.Dockerfile’ in your repository. Replace ‘anaconda3:2024.02-1’ in the first line with the tag for anaconda release, which uses the Python version that you need. First, check which Python versions are used in different Anaconda’s here. Then go here and find the tag for this version. For example, the tag for Anaconda 2020.02 is ‘anaconda3:2020.02’. And then replace the tag in the ‘python.Dockerfile’, so the first line will look like this: FROM continuumio/anaconda3:2020.02
The docker tag should be similar to the one currently used; so no suffixes e.g. -alpine.
Submitting your model
You can submit as many submissions via the Next platform as you like, but only your most recent submission at the deadline for the leaderboard will be assessed.
Yes, you can make a submission via the Next platform in advance, and then update your submission if you find a better model.
How to update your submission: you will see an “Update” button on the Submit page after you submit something. To change your submission, replace the link to your commit and press this “Update” button.
It is not obligatory to make a submission every time before we publish an intermediate leaderboard. An intermediate leaderboard is just an opportunity for you to see how your model is performing on the holdout set.
Make sure that you submit at least once before the final submission deadline in Phase 1 - June 3, 12:00pm (afternoon) CEST.
There are several reasons why we want participants’ codes and models, rather than just their predictions. First, the goal of this data challenge is primarily scientific, and the organisers (and future researchers) should be able to reproduce all analyses. This is particularly important because a previous data challenge showed problems with (computational) reproducibility. Second, we will run this code and models on different variants of the data for robustness checks and to delve into substantive problems (e.g., how well do the models predict the outcome one year later).
We have chosen only to provide prediction scores and rankings at set times (see here). The reason for this is the following: the sample size of the LISS panel for our target group is rather limited. This meant that we had chosen to separate the data into a training and holdout set, and we did not opt for an additional validation set because that would come at a sample size cost for the training data. We will not provide continuous prediction scores based on the holdout data because this would lead to overfitting.
Each commit doesn’t only include the changes of a certain file. It captures the state of your whole repository at that moment in time. After you changed all the files that you needed to change, copy the link to the last commit and send it as explained here.
General questions about participation
You should fill in the form about team membership that we’ve sent again before you make a new submission.
Everyone who has engaged in the data challenge and submitted predictions through the procedures of the data challenge will be invited to contribute to a “community paper” on the results of the data challenge. You can also consider writing up your (team’s) results and submit it to the special issue; these submissions will be peer-reviewed. Instructions on how to do this will follow after the data challenge. If you want to publish your results separate from the special issue, then you could do so, but only after the community paper and the special issue is fully published (as you will have agreed to via signing the data agreement).
You can start with [this review by Balbo et al.] (https://link.springer.com/article/10.1007/s10680-012-9277-y){target=“_blank”}. Keep in mind though, that most studied factors are not necessarily the best predictors.
Yes. Participating in the challenge is a good opportunity to gain experience in machine learning.
Yes, we welcome participants from various backgrounds who may take different approaches to developing their predictive models. Knowledge of the factors found to be significantly associated with fertility is not required. However, if you want to get a general overview of the factors influencing fertility behaviour you can read this review (Balbo et al. 2013).
Yes, it is possible. As in any other case, all team members must submit applications to participate.
Yes, you can participate in the challenge and then describe your approach in your thesis. We can’t help supervise it though, because as organizers we will decide on the winners so we cannot participate ourselves.
With the example scripts and codebooks that we will provide, you will be able to produce and submit a basic method in several hours. Trying different methods to improve the predictions can take several weeks. However, any contribution is valuable. Consider applying even if you can dedicate only a few days.
You can contact Elizaveta Sivak and Gert Stulp at preferdatachallenge [at] gmail [dot] com for any questions and suggestions you might have.