Loading CSV Files in Python
We have two pretty hefty csv files on hand, 554 MB and 923 MB respectively. It would take significant time for the program to load the entire file to the machine.
We could define nrows=5 to load first 5 rows of the file just to have an idea of how the data looks like.
Loading first 5 rows
# Load Python library
import pandas as pd
# Load first 5 rows
pd.read_csv("/home/pi/Downloads/works-20210226.csv", nrows=5)
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN |
1 | 2021-02-26 | en | False | True | 1638 | 10+20350917+34816907+23666027+23269305+2326930... | NaN |
2 | 2021-02-26 | en | False | True | 1502 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN |
3 | 2021-02-26 | en | False | True | 100 | 10+15322+54862755+20595867+32994286+663+471751... | NaN |
4 | 2021-02-26 | en | False | True | 994 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN |
pd.read_csv("/home/pi/Downloads/tags-20210226.csv", nrows=5)
id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|
0 | 1 | Media | TV Shows | True | 910 | NaN |
1 | 2 | Media | Movies | True | 1164 | NaN |
2 | 3 | Media | Books & Literature | True | 134 | NaN |
3 | 4 | Media | Cartoons & Comics & Graphic Novels | True | 166 | NaN |
4 | 5 | Media | Anime & Manga | True | 501 | NaN |
Loading Entire File
There are additional steps to take so that we can save memory and potentially speed up the loading process. Jupyter Notebook takes about 54 seconds to read the file on my machine, so be prepared that it might take significant time.
We’ll use chunksize=10000 to save memory by reading chunks of the file at a time, then use pd.concat() to concatenate the chunks.
# The file is too large
# save memory by reading chunks of the file
chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)
# Combine chunks into a dataframe
works = pd.concat(chunker, ignore_index=True)
# First 5 rows
works.iloc[:5,:]
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388.0 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN |
1 | 2021-02-26 | en | False | True | 1638.0 | 10+20350917+34816907+23666027+23269305+2326930... | NaN |
2 | 2021-02-26 | en | False | True | 1502.0 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN |
3 | 2021-02-26 | en | False | True | 100.0 | 10+15322+54862755+20595867+32994286+663+471751... | NaN |
4 | 2021-02-26 | en | False | True | 994.0 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN |