Tutorial on reading large datasets

发布日期: 2021-02-03

更新日期: 2021-02-03

文章字数: 2.2k

阅读时长: 13 分

Large datasets

As a Data Scientist or Kaggler, we crave for more data

How many times have you complained about not having good enough data for working on a particular problem? Plenty I’m sure.
How many times have you complained about having too much data to work with? Maybe not many but it’s still a better problem to have.

The most common resolution in the scenario of having too much data is using a part/sample of it that fits in the RAM available. But that leads to wastage of the unused data and sometimes loss of information. Many times there are ways to overcome this challenge without the need of subsampling. A single solution might not cater to all requirements and so different solutions can work in different scenarios.

This notebook aims to describe and summarize some of these techniques. The Riiid! Answer Correctness Prediction dataset is quite a nice sample to experiment on since the plain vanilla pd.read_csv\ will result in an out-of-memory error on Kaggle Notebooks. It has over 100 million rows and 10 columns.

Different packages have their own way of reading data. The methods explored in the notebook (Default pandas\ and rest alphabetically):

Apart from methods of reading data from the raw csv files, it is also common to convert the dataset into another format which uses lesser disk space, is smaller in size and/or can be read faster for subsequent reads. The file types explored in the notebook (Default csv\ and rest alphabetically):

Note that just reading data is not the end of the story. The final decision of which method to use should also consider the downstream tasks and processes of the data that will be required to run. But that is outside the scope of this notebook.

You will also find that for different datasets or different environments, there will be different methods that work best. So there is no clear winner as such.

Feel free to share other approaches that can be added to this list.

Methods

Before exploring various methods let’s once confirm that reading the dataset using the default pandas setting fails.

In [1]:

import pandas as pd
import dask.dataframe as dd

# confirming the default pandas doesn't work (running the below code should result in a memory error)
# data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv")

Method: Pandas

Pandas is probably the most popular method of reading datasets and is also the default on Kaggle. It has a lot of options, flexibility and functions for reading and processing data.

One of the challenges with using pandas for reading large datasets is it’s conservative nature while infering data types of the columns of a dataset often resulting in unnecessary large memory usage for the pandas dataframe. You can pre-define optimal data types of the columns (based on prior knowledge or sample inspection) and provide it explicitly while reading the dataset.

This is the method used in the official starter notebook of the RiiiD competition as well.

Documentation: https://pandas.pydata.org/docs/

In [2]:

%%time

dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8min 22s, sys: 11.6 s, total: 8min 34s
Wall time: 9min 25s

In [3]:

data.head()

Out[3]:

	row_id	timestamp	user_id	content_id	content_type_id	task_container_id	user_answer	answered_correctly	prior_question_elapsed_time	prior_question_had_explanation
0	0	0	115	5692	False	1	3	1	NaN
1	1	56943	115	5716	False	2	2	1	37000.0	False
2	2	118363	115	128	False	0	0	1	55000.0	False
3	3	131167	115	7860	False	3	0	1	19000.0	False
4	4	137965	115	7922	False	4	1	1	11000.0	False

Method: Dask

Dask provides a framework to scale pandas workflows natively using a parallel processing architecture. For those of you who have used Spark, you will find an uncanny similarity between the two.

Documentation: https://docs.dask.org/en/latest/

In [4]:

%%time

dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

data = dd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes).compute()

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 9min 35s, sys: 22.7 s, total: 9min 58s
Wall time: 7min 43s

In [5]:

data.head()

Out[5]:

	row_id	timestamp	user_id	content_id	content_type_id	task_container_id	user_answer	answered_correctly	prior_question_elapsed_time	prior_question_had_explanation
0	0	0	115	5692	False	1	3	1	NaN
1	1	56943	115	5716	False	2	2	1	37000.0	False
2	2	118363	115	128	False	0	0	1	55000.0	False
3	3	131167	115	7860	False	3	0	1	19000.0	False
4	4	137965	115	7922	False	4	1	1	11000.0	False

Method: Datatable

Datatable (heavily inspired by R’s data.table) can read large datasets fairly quickly and is often faster than pandas. It is specifically meant for data processing of tabular datasets with emphasis on speed and support for large sized data.

Documentation: https://datatable.readthedocs.io/en/latest/index.html

In [1]:

# datatable installation with internet
# !pip install datatable==0.11.0 > /dev/null

# datatable installation without internet
!pip install ../input/python-datatable/datatable-0.11.1-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null

import datatable as dt
WARNING: You are using pip version 20.2.3; however, version 20.3.1 is available.
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.

In [7]:

%%time

data = dt.fread("../input/riiid-test-answer-prediction/train.csv")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 41.4 s, sys: 14.3 s, total: 55.7 s
Wall time: 1min 14s

In [8]:

data.head()

Out[8]:

	row_id	timestamp	user_id	content_id	content_type_id	task_container_id	user_answer	answered_correctly	prior_question_elapsed_time	prior_question_had_explanation
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪▪▪▪▪	▪
0	0	0	115	5692	0	1	3	1	NA	NA
1	1	56943	115	5716	0	2	2	1	37000	0
2	2	118363	115	128	0	0	0	1	55000	0
3	3	131167	115	7860	0	3	0	1	19000	0
4	4	137965	115	7922	0	4	1	1	11000	0
5	5	157063	115	156	0	5	2	1	5000	0
6	6	176092	115	51	0	6	0	1	17000	0
7	7	194190	115	50	0	7	3	1	17000	0
8	8	212463	115	7896	0	8	2	1	16000	0
9	9	230983	115	7863	0	9	0	1	16000	0

10 rows × 10 columns

Method: Rapids

Rapids is a great option to scale data processing on GPUs. With a lot of machine learning modelling moving to GPUs, Rapids enables to build end-to-end data science solutions on one or more GPUs.

Documentation: https://docs.rapids.ai/

In [1]:

# rapids installation (make sure to turn on GPU)
import sys
!cp ../input/rapids/rapids.0.16.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path

import cudf

In [2]:

%%time

data = cudf.read_csv("../input/riiid-test-answer-prediction/train.csv")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 3.56 s, sys: 3.28 s, total: 6.85 s
Wall time: 1min 4s

In [3]:

data.head()

Out[3]:

	row_id	timestamp	user_id	content_id	task_container_id	user_answer	answered_correctly	prior_question_elapsed_time	prior_question_had_explanation
0	0	0	115	5692	1	3	1
1	1	56943	115	5716	2	2	1	37000.0	False
2	2	118363	115	128	0	0	1	55000.0	False
3	3	131167	115	7860	3	0	1	19000.0	False
4	4	137965	115	7922	4	1	1	11000.0	False

File Formats

It is common to convert a dataset into a format which is easier or faster to read or smaller in size to store. There are various formats in which datasets can be stored though not all will be readable across different packages. Let’s look at how these datasets can be converted into different formats.

Most of them are available in pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [ ]:

# reading data from csv using datatable and converting to pandas
# data = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()

# writing dataset as csv
# data.to_csv("riiid_train.csv", index=False)

# writing dataset as hdf5
# data.to_hdf("riiid_train.h5", "riiid_train")

# writing dataset as feather
# data.to_feather("riiid_train.feather")

# writing dataset as parquet
# data.to_parquet("riiid_train.parquet")

# writing dataset as pickle
# data.to_pickle("riiid_train.pkl.gzip")

# writing dataset as jay
# dt.Frame(data).to_jay("riiid_train.jay")

All the formats of the dataset (except csv since it’s already provided in the competition) are available in this dataset: https://www.kaggle.com/rohanrao/riiid-train-data-multiple-formats

Format: csv

Most Kaggle datasets are available in csv format and is pretty much the standard format in which datasets are shared. Almost all methods can be used to read data from csv.

In [2]:

%%time

dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8min 15s, sys: 11.7 s, total: 8min 27s
Wall time: 9min 24s

Format: feather

It is common to store data in feather (binary) format specifically for pandas. It significantly improves reading speed of datasets.

In [2]:

%%time

data = pd.read_feather("../input/riiid-train-data-multiple-formats/riiid_train.feather")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 3 s, sys: 17.5 s, total: 20.5 s
Wall time: 19.6 s

Format: hdf5

HDF5 is a high-performance data management suite to store, manage and process large and complex data.

In [2]:

%%time

data = pd.read_hdf("../input/riiid-train-data-multiple-formats/riiid_train.h5", "riiid_train")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 10.5 s, sys: 10.5 s, total: 20.9 s
Wall time: 53.8 s

Format: jay

Datatable uses .jay (binary) format which makes reading datasets blazing fast. An example notebook is shared here and also shown below which reads the entire dataset in less than a second!

In [2]:

%%time

data = dt.fread("../input/riiid-train-data-multiple-formats/riiid_train.jay")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 4.66 ms, sys: 2.77 ms, total: 7.43 ms
Wall time: 24.5 ms

Format: parquet

In the Hadoop ecosystem, parquet was popularly used as the primary file format for tabular datasets and is now extensively used with Spark. It has become more available and efficient over the years and is also supported by pandas.

In [2]:

%%time

data = pd.read_parquet("../input/riiid-train-data-multiple-formats/riiid_train.parquet")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 43.4 s, sys: 17.7 s, total: 1min 1s
Wall time: 47.2 s

Format: pickle

Python objects can be stored in the form of pickle files and pandas has inbuilt functions to read and write dataframes as pickle objects.

In [2]:

%%time

data = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 5.75 s, sys: 8.7 s, total: 14.5 s
Wall time: 42.3 s

There is no winner or loser

Each method has it’s own set of pros and cons. Some examples are:

Pandas requires a lot more RAM to handle large datasets.
Dask can be slow at times especially with transformations that cannot be parallelized.
Datatable doesn’t have a very exhaustive set of data processing functions.
Rapids is not useful if you don’t have a GPU.

So it’s a good idea to explore various options and finally choose whichever appropriately fits the requirements. I strongly believe in not marrying a technology and continuously adapting to newer ideas, better approaches and ultimately the best possible solutions for building data science pipelines.

Even in my personal experience I’ve found different approaches working well on different datasets. So don’t shy away from experimentation.