Columns, N/A Values, and Simple Data Cleaning
In this section, we focus on navigating the data set, and cleaning missing values.
Loading File
# Load Python library
import pandas as pd
# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
Dataframe Shape
We can use shape
to get the number of rows and columns in the data set. Specifically, shape[0] displays the number of rows, and shape[1] displays the number of columns.
# The number of rows and columns in the dataframe
works.shape
(7269693, 7)
# Number of rows
works.shape[0]
7269693
# Number of cols
works.shape[1]
7
As we can see from above, there are 7269693 rows and 7 columns in the dataframe.
Column Names and Selecting Columns
To preview the data set and its columns, we can print out the first few rows. However, sometimes there are too many columns in a data set it is difficult to display on the screen. Instead, we could print the column names separately.
# View col names
works.columns
Index(['creation date', 'language', 'restricted', 'complete', 'word_count',
'tags', 'Unnamed: 6'],
dtype='object')
Often, we only need certain columns to work with. There are several ways to select columns, let’s select language column for example.
# Select a single column
works.language
0 en
1 en
2 en
3 en
4 en
..
7269688 en
7269689 en
7269690 en
7269691 en
7269692 en
Name: language, Length: 7269693, dtype: object
works['language']
0 en
1 en
2 en
3 en
4 en
..
7269688 en
7269689 en
7269690 en
7269691 en
7269692 en
Name: language, Length: 7269693, dtype: object
works.loc[:,'language']
0 en
1 en
2 en
3 en
4 en
..
7269688 en
7269689 en
7269690 en
7269691 en
7269692 en
Name: language, Length: 7269693, dtype: object
As we can see from above, the three methods yield exactly the same results. Which one to use depends on your own preference.
Let’s select multiple columns at the same time.
# Select multiple columns
works[['creation date','language']]
creation date | language | |
---|---|---|
0 | 2021-02-26 | en |
1 | 2021-02-26 | en |
2 | 2021-02-26 | en |
3 | 2021-02-26 | en |
4 | 2021-02-26 | en |
... | ... | ... |
7269688 | 2008-09-13 | en |
7269689 | 2008-09-13 | en |
7269690 | 2008-09-13 | en |
7269691 | 2008-09-13 | en |
7269692 | 2008-09-13 | en |
7269693 rows × 2 columns
works.loc[:,['creation date', 'language']]
creation date | language | |
---|---|---|
0 | 2021-02-26 | en |
1 | 2021-02-26 | en |
2 | 2021-02-26 | en |
3 | 2021-02-26 | en |
4 | 2021-02-26 | en |
... | ... | ... |
7269688 | 2008-09-13 | en |
7269689 | 2008-09-13 | en |
7269690 | 2008-09-13 | en |
7269691 | 2008-09-13 | en |
7269692 | 2008-09-13 | en |
7269693 rows × 2 columns
With the two columns on hand, we can proceed with more in-depth analysis. For example, we could:
- Find the total number of languages on AO3
- Find top 5 most popular languages on AO3
- Visulize language trend
- Analyze users’ posting habits
etc.
But before we can do any analysis, we need to clean the data set.
N/A values
In real world, the data set may contain missing values (showing up as NaN). In order to prepare the data for further analysis, we need to detect null values, single out the rows, and eventually drop the rows containing null values.
For more details behind the scene, check out Handling Missing Data.
# Detect null values in data
works.isna().any()
creation date False
language True
restricted False
complete False
word_count True
tags True
Unnamed: 6 True
dtype: bool
# Alternatively using isnull(), same results
works.isnull().any()
creation date False
language True
restricted False
complete False
word_count True
tags True
Unnamed: 6 True
dtype: bool
Here it shows that the language, word_count and tages columns all have null values. Let’s check out what they look like before taking any actions regarding the null values.
# Select language column, display only rows containing null values
works['language'][works['language'].isnull()]
73119 NaN
95222 NaN
184633 NaN
211955 NaN
266702 NaN
...
6197226 NaN
6210472 NaN
6216530 NaN
6266535 NaN
6272792 NaN
Name: language, Length: 90, dtype: object
# Same method, word_count column
works['word_count'][works['word_count'].isnull()]
1404 NaN
3846 NaN
3976 NaN
5458 NaN
6170 NaN
..
6531822 NaN
6559452 NaN
6755516 NaN
6847505 NaN
6897542 NaN
Name: word_count, Length: 2268, dtype: float64
If we decide to drop all rows that has at least one null value, we could:
# Drop null values in language column
works.dropna(subset = ['language'])
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388.0 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN |
1 | 2021-02-26 | en | False | True | 1638.0 | 10+20350917+34816907+23666027+23269305+2326930... | NaN |
2 | 2021-02-26 | en | False | True | 1502.0 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN |
3 | 2021-02-26 | en | False | True | 100.0 | 10+15322+54862755+20595867+32994286+663+471751... | NaN |
4 | 2021-02-26 | en | False | True | 994.0 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
7269688 | 2008-09-13 | en | True | True | 705.0 | 78+77+84+101+104+105+106+23+13+16+70+933 | NaN |
7269689 | 2008-09-13 | en | False | True | 1392.0 | 78+77+84+107+23+10+16+70+933+616 | NaN |
7269690 | 2008-09-13 | en | False | True | 1755.0 | 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 | NaN |
7269691 | 2008-09-13 | en | False | True | 1338.0 | 112+113+13+114+16+115+101+117+118+119+120+116+... | NaN |
7269692 | 2008-09-13 | en | False | True | 1836.0 | 123+124+125+127+128+13+129+14+130+131+132+133+... | NaN |
7269603 rows × 7 columns
Compared to the 7269693 rows in original data set, we dropped 90 rows that contain missing values in the language column, which is exactly what we intended to achieve.