We have two pretty hefty csv files on hand, 554 MB and 923 MB respectively. It would take significant time for the program to load the entire file to the machine.

We could define nrows=5 to load first 5 rows of the file just to have an idea of how the data looks like.

Loading first 5 rows

# Load Python library
import pandas as pd
# Load first 5 rows 
pd.read_csv("/home/pi/Downloads/works-20210226.csv", nrows=5)
creation date language restricted complete word_count tags Unnamed: 6
0 2021-02-26 en False True 388 10+414093+1001939+4577144+1499536+110+4682892+... NaN
1 2021-02-26 en False True 1638 10+20350917+34816907+23666027+23269305+2326930... NaN
2 2021-02-26 en False True 1502 10+10613413+9780526+3763877+3741104+7657229+30... NaN
3 2021-02-26 en False True 100 10+15322+54862755+20595867+32994286+663+471751... NaN
4 2021-02-26 en False True 994 11+721553+54604+1439500+3938423+53483274+54862... NaN
pd.read_csv("/home/pi/Downloads/tags-20210226.csv", nrows=5)
id type name canonical cached_count merger_id
0 1 Media TV Shows True 910 NaN
1 2 Media Movies True 1164 NaN
2 3 Media Books & Literature True 134 NaN
3 4 Media Cartoons & Comics & Graphic Novels True 166 NaN
4 5 Media Anime & Manga True 501 NaN

Loading Entire File

There are additional steps to take so that we can save memory and potentially speed up the loading process. Jupyter Notebook takes about 54 seconds to read the file on my machine, so be prepared that it might take significant time.

We’ll use chunksize=10000 to save memory by reading chunks of the file at a time, then use pd.concat() to concatenate the chunks.

# The file is too large
# save memory by reading chunks of the file

chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)
# Combine chunks into a dataframe
works = pd.concat(chunker, ignore_index=True)
# First 5 rows
works.iloc[:5,:]
creation date language restricted complete word_count tags Unnamed: 6
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN