# predictions_df = data.frame with two columns; nomem_encr and predictions
# ground_truth_df = data.frame with two columns; nomem_encr and new_child
<- function(predictions_df, ground_truth_df) {
score
= merge(predictions_df, ground_truth_df, by = "nomem_encr",
merged_df all.y = TRUE) # right join
# Calculate accuracy
= sum( merged_df$prediction == merged_df$new_child, na.rm = T) / length(merged_df$new_child)
accuracy
# Calculate true positives and false positives
= sum( merged_df$prediction == 1 & merged_df$new_child == 1, na.rm = T )
true_positives
= sum( merged_df$prediction == 1 & merged_df$new_child == 0 , na.rm = T)
false_positives
#false_negatives = sum( merged_df$prediction == 0 & merged_df$new_child == 1 , na.rm = T)
# Calculate the actual number of positive instances (N of people who actually had a new child) for calculating recall
= sum(ground_truth_df$new_child == 1)
n_all_positive_instances
# Calculate precision, recall, and F1 score
if( (true_positives + false_positives) == 0) {
= 0
precision else {
} = true_positives / (true_positives + false_positives)
precision
}# if( (true_positives + false_negatives) == 0) {
# recall = 0
# } else {
# recall = true_positives / (true_positives + false_negatives)
# }
= true_positives / n_all_positive_instances
recall
if( (precision + recall) == 0) {
= 0
f1_score else {
} = 2 * (precision * recall) / (precision + recall)
f1_score
}
<- data.frame(
metrics_df 'accuracy' = accuracy,
'precision' = precision,
'recall' = recall,
'f1_score' = f1_score
)
return(metrics_df)
}
Calculating predictive accuracy scores in R
Here we show to evaluate your model in terms of predictive accuracy when you are working in R.
The predictive accuracy scores (F1, accuracy, precision, recall) for all submissions will be calculated via the script score.py
. If you work in Python, you can also calculate these scores by running python score.py predictions.csv outcome.csv
in the command line, where “predictions.csv” refers to a csv-file that includes two columns (nomem_encr and prediction; the output of predict_outcomes
function) and outcome.csv
(for example “PreFer_train_outcome.csv”) refers to a csv-file that contains the ground truth (also with two columns; nomem_encr and new_child).
If you work in R, you cannot do this (unless you save your predictions from submission.R in a csv file and then run the Python code). That’s why we provide a script to calculate these scores yourself. The reason that we do not include this R-script in the repository, is because when you submit your model/method, the Python-script will be used to calculate the outcomes for your submission during the challenge. Although entirely unlikely, for whatever reason the Python- and R-script that produce the scores may lead to different results. So use this R-script for convenience, but note the disclaimer above.
Update from April 30:
Starting from the second intermediate leaderboard, we use an updated score.py
script. When calculating recall, we now take into account not only the cases when a predicted value was available (i.e., not missing) but all cases in the holdout set. Specifically, in the updated script, we divide the number of true positives by the total number of positive cases in the ground truth data (i.e., the number of people who actually had a new child), rather than by the sum of true positives and false negatives. This change only matters if there are missing values in predictions. We made this change to avoid a situation where a model makes very accurate predictions for only a small number of cases (where the remaining cases were not predicted because of missing values on predictor variables), yet gets the same result as a model that makes similar accurate predictions but for all cases.
score.R
You can just source
this function into your R-environment by running: source("https://preferdatachallenge.nl/data/score.R")
. This R-script contains the following code:
Commented lines of code (preceded with #
) were part of our original scoring function.
Creating fake data
Let’s create fake data to use for our script.
# the names of these variables for both datasets are set
<- data.frame(
truth_df nomem_encr = 1:20, # 20 fake IDs
new_child = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1) # 10 people had child, 10 people did not
)<- data.frame(
predictions_df nomem_encr = 1:20, # same fake IDs
prediction = c(0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
)
Scores
The following scores are created: F1, accuracy, precision, recall, using counts of true positives, false positives, and false negatives.
Manually
Accuracy: fraction of correct predictions out of total predictions made. Number of people who did not have a child who were correctly predicted not to have births (6 in our example) plus the number of people who had a child and were correctly predicted to have a child (7 in our example), divided by the total number of people (20). Thus, the accuracy for our example is \(\frac{6+7}{20}=0.65\)
Precision: fraction of correct predictions out of total number of predicted cases “of interest”. Among those who were predicted to have child (11 in our case), the percentage who indeed had a child (7 out of those 11 had a child), thus 0.64. Sometimes this is phrased as the \(\frac{\text{true positives}}{\text{true positives + false positives}} =\frac{7}{7+4}\).
Recall: fraction of correct predictions out of total number of cases “of interest”. Among those who had a child (10 in our case), the percentage who were predicted to have a child (7 out of those 10 had a child), thus 0.7. Sometimes this is phrased as the \(\frac{\text{true positives}}{\text{true positives + false negatives}} =\frac{7}{7+3}\). After our update, this is \(\frac{\text{true positives}}{\text{all positive instances}} =\frac{7}{7+3}\) [leading to the same score, as there were no missing predictions in this case].
F1-score: the harmonic mean between precision and recall: \(2\frac{precision*recall}{precision+recall}\). Here \(2*\frac{0.64*0.7}{0.64*0.7}=0.67\).
Via the script
source("https://preferdatachallenge.nl/data/score.R")
score(predictions_df, truth_df)
accuracy precision recall f1_score
1 0.65 0.6363636 0.7 0.6666667