<- read.csv("path/to/folder/PreFer_train_data.csv", row.names = FALSE) # this works but is very slow data
Reading in data [LISS]
Here is some useful information on reading in / load PreFer data.
Here we describe ways of reading-in the data for both Python and R.
Note that different packages may lead to partly different datasets (for example, with columns of different types), because of the way different packages treat things like missing values, empty strings, and dates.
Reading in the data
The most important dataset of the PreFer data challenge is the PreFer_train_data.csv
dataset (click here for information on the datasets that are available). It contains data on all LISS respondents who were between the ages of 18 and 45 in 2020 (i.e., had birthyears between 1975 and 2002). We will see how we can read-in this data with the help of R or Python.
R
There are several ways in which one could read data into R, but some of them are more successful and quicker than others (TL;DR: use data.table::fread
).
read.csv
read.csv
works, requires no additional packages, but is very slow.
read_csv
readr::read_csv
from the package readr
in principle works, but gets many of the column types wrong with default settings (because, by default, it only bases column types on the first 1000 values present in the variable). Do not run this code data <- readr::read_csv("PreFer_train_data.csv")
, but use the following code which explicitly tells read_csv
that it must make use of the entire column (i.e., all cases) to make a guess of the column type:
library(readr) # requires install.packages("readr") first
<- readr::read_csv("path/to/folder/PreFer_train_data.csv", guess_max = 6418) # this works but is slow
data # 6418 is the number of rows in the data
fread
data.table::fread
from the package data.table
works like a charm and is very fast. Some additional arguments are useful to avoid default behaviour.
library(data.table) # requires install.packages("data.table") first
<- data.table::fread("path/to/folder/PreFer_train_data.csv",
data keepLeadingZeros = TRUE, # if FALSE adds zeroes to some dates
data.table = FALSE) # returns a data.frame object rather than data.table
Python
read_csv from pandas
read_csv
from pandas
works, but is slow. Specifying low_memory=False
is needed. If low_memory=False
, then whole columns are read in first, and then the proper data types in the columns are determined. If low_memory=True
(default), then pandas
reads in the data in chunks of rows, then appends them together. This results in lower memory use while parsing, but incorrect (mixed) type of data in a column, when for example there are many missing values in a column (which are floating point numbers in python) but all other values are integers.
```{python}
import pandas as pd # requires installing pandas first
= pd.read_csv("path to the data which is NOT in your local repository\\PreFer_train_data.csv", low_memory = False) # this works but is slow
train ```
read_csv from polars
read_csv
from the polars
package takes less time. Polars
is a pandas
alternative designed to process data faster. If you want to work with a pandas dataframe, use to_pandas()
to convert. For that, pyarrow
package also needs to be installed.
infer_schema_length=6418
is needed to increase the number of lines used for determining column types; 6418 is the number of rows in the data.
```{python}
import polars as pl # requires installing polars first
import pyarrow # requires installing pyarrow first
= pd.read_csv("path to the data which is NOT in your local repository\\PreFer_train_data.csv", infer_schema_length=6418).to_pandas()
train ```