We have two pretty hefty csv files on hand, 554 MB and 923 MB respectively. It would take significant time for the program to load the entire file to the machine.

We could define nrows=5 to load first 5 rows of the file just to have an idea of how the data looks like.

Loading first 5 rows

# Load Python library
import pandas as pd
# Load first 5 rows 
pd.read_csv("/home/pi/Downloads/works-20210226.csv", nrows=5)

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2021-02-26	en	False	True	388	10+414093+1001939+4577144+1499536+110+4682892+...	NaN
1	2021-02-26	en	False	True	1638	10+20350917+34816907+23666027+23269305+2326930...	NaN
2	2021-02-26	en	False	True	1502	10+10613413+9780526+3763877+3741104+7657229+30...	NaN
3	2021-02-26	en	False	True	100	10+15322+54862755+20595867+32994286+663+471751...	NaN
4	2021-02-26	en	False	True	994	11+721553+54604+1439500+3938423+53483274+54862...	NaN

pd.read_csv("/home/pi/Downloads/tags-20210226.csv", nrows=5)

	id	type	name	canonical	cached_count	merger_id
0	1	Media	TV Shows	True	910	NaN
1	2	Media	Movies	True	1164	NaN
2	3	Media	Books & Literature	True	134	NaN
3	4	Media	Cartoons & Comics & Graphic Novels	True	166	NaN
4	5	Media	Anime & Manga	True	501	NaN

Loading Entire File

There are additional steps to take so that we can save memory and potentially speed up the loading process. Jupyter Notebook takes about 54 seconds to read the file on my machine, so be prepared that it might take significant time.

We’ll use chunksize=10000 to save memory by reading chunks of the file at a time, then use pd.concat() to concatenate the chunks.

# The file is too large
# save memory by reading chunks of the file

chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)

# Combine chunks into a dataframe
works = pd.concat(chunker, ignore_index=True)

# First 5 rows
works.iloc[:5,:]

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2021-02-26	en	False	True	388.0	10+414093+1001939+4577144+1499536+110+4682892+...	NaN
1	2021-02-26	en	False	True	1638.0	10+20350917+34816907+23666027+23269305+2326930...	NaN
2	2021-02-26	en	False	True	1502.0	10+10613413+9780526+3763877+3741104+7657229+30...	NaN
3	2021-02-26	en	False	True	100.0	10+15322+54862755+20595867+32994286+663+471751...	NaN
4	2021-02-26	en	False	True	994.0	11+721553+54604+1439500+3938423+53483274+54862...	NaN