Reading in data [LISS]

dataset

guide

Here is some useful information on reading in / load PreFer data.

Authors

Gert Stulp

Lisa Sivak

Published

March 22, 2024

Here we describe ways of reading-in the data for both Python and R.

Note that different packages may lead to partly different datasets (for example, with columns of different types), because of the way different packages treat things like missing values, empty strings, and dates.

Reading in the data

The most important dataset of the PreFer data challenge is the PreFer_train_data.csv dataset (click here for information on the datasets that are available). It contains data on all LISS respondents who were between the ages of 18 and 45 in 2020 (i.e., had birthyears between 1975 and 2002). We will see how we can read-in this data with the help of R or Python.

R

There are several ways in which one could read data into R, but some of them are more successful and quicker than others (TL;DR: use data.table::fread).

read.csv

read.csv works, requires no additional packages, but is very slow.

data <- read.csv("path/to/folder/PreFer_train_data.csv", row.names = FALSE) # this works but is very slow

read_csv

readr::read_csv from the package readr in principle works, but gets many of the column types wrong with default settings (because, by default, it only bases column types on the first 1000 values present in the variable). Do not run this code data <- readr::read_csv("PreFer_train_data.csv"), but use the following code which explicitly tells read_csv that it must make use of the entire column (i.e., all cases) to make a guess of the column type:

library(readr) # requires install.packages("readr") first
data <- readr::read_csv("path/to/folder/PreFer_train_data.csv", guess_max = 6418) # this works but is slow
# 6418 is the number of rows in the data

fread

data.table::fread from the package data.table works like a charm and is very fast. Some additional arguments are useful to avoid default behaviour.

library(data.table) # requires install.packages("data.table") first
data <- data.table::fread("path/to/folder/PreFer_train_data.csv", 
                          keepLeadingZeros = TRUE, # if FALSE adds zeroes to some dates
                          data.table = FALSE) # returns a data.frame object rather than data.table

Python

read_csv from pandas

read_csv from pandas works, but is slow. Specifying low_memory=False is needed. If low_memory=False, then whole columns are read in first, and then the proper data types in the columns are determined. If low_memory=True (default), then pandas reads in the data in chunks of rows, then appends them together. This results in lower memory use while parsing, but incorrect (mixed) type of data in a column, when for example there are many missing values in a column (which are floating point numbers in python) but all other values are integers.

```{python}
import pandas as pd # requires installing pandas first
train = pd.read_csv("path to the data which is NOT in your local repository\\PreFer_train_data.csv", low_memory = False) # this works but is slow
```

read_csv from polars

read_csv from the polars package takes less time. Polars is a pandas alternative designed to process data faster. If you want to work with a pandas dataframe, use to_pandas() to convert. For that, pyarrow package also needs to be installed.

infer_schema_length=6418 is needed to increase the number of lines used for determining column types; 6418 is the number of rows in the data.

```{python}
import polars as pl     # requires installing polars first
import pyarrow          # requires installing pyarrow first
train = pd.read_csv("path to the data which is NOT in your local repository\\PreFer_train_data.csv", infer_schema_length=6418).to_pandas() 
```