In this section, we focus on navigating the data set, and cleaning missing values.

Loading File

# Load Python library
import pandas as pd

# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)

Dataframe Shape

We can use shape to get the number of rows and columns in the data set. Specifically, shape[0] displays the number of rows, and shape[1] displays the number of columns.

# The number of rows and columns in the dataframe
works.shape

(7269693, 7)

# Number of rows
works.shape[0]

# Number of cols
works.shape[1]

As we can see from above, there are 7269693 rows and 7 columns in the dataframe.

Column Names and Selecting Columns

To preview the data set and its columns, we can print out the first few rows. However, sometimes there are too many columns in a data set it is difficult to display on the screen. Instead, we could print the column names separately.

# View col names
works.columns

Index(['creation date', 'language', 'restricted', 'complete', 'word_count',
       'tags', 'Unnamed: 6'],
      dtype='object')

Often, we only need certain columns to work with. There are several ways to select columns, let’s select language column for example.

# Select a single column 
works.language

0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object

works['language']

0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object

works.loc[:,'language']

0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object

As we can see from above, the three methods yield exactly the same results. Which one to use depends on your own preference.

Let’s select multiple columns at the same time.

# Select multiple columns
works[['creation date','language']]

	creation date	language
0	2021-02-26	en
1	2021-02-26	en
2	2021-02-26	en
3	2021-02-26	en
4	2021-02-26	en
...	...	...
7269688	2008-09-13	en
7269689	2008-09-13	en
7269690	2008-09-13	en
7269691	2008-09-13	en
7269692	2008-09-13	en

7269693 rows × 2 columns

works.loc[:,['creation date', 'language']]

	creation date	language
0	2021-02-26	en
1	2021-02-26	en
2	2021-02-26	en
3	2021-02-26	en
4	2021-02-26	en
...	...	...
7269688	2008-09-13	en
7269689	2008-09-13	en
7269690	2008-09-13	en
7269691	2008-09-13	en
7269692	2008-09-13	en

7269693 rows × 2 columns

With the two columns on hand, we can proceed with more in-depth analysis. For example, we could:

Find the total number of languages on AO3
Find top 5 most popular languages on AO3
Visulize language trend
Analyze users’ posting habits

etc.

But before we can do any analysis, we need to clean the data set.

N/A values

In real world, the data set may contain missing values (showing up as NaN). In order to prepare the data for further analysis, we need to detect null values, single out the rows, and eventually drop the rows containing null values.

For more details behind the scene, check out Handling Missing Data.

# Detect null values in data
works.isna().any()

creation date    False
language          True
restricted       False
complete         False
word_count        True
tags              True
Unnamed: 6        True
dtype: bool

# Alternatively using isnull(), same results
works.isnull().any()

creation date    False
language          True
restricted       False
complete         False
word_count        True
tags              True
Unnamed: 6        True
dtype: bool

Here it shows that the language, word_count and tages columns all have null values. Let’s check out what they look like before taking any actions regarding the null values.

# Select language column, display only rows containing null values
works['language'][works['language'].isnull()]

73119      NaN
95222      NaN
184633     NaN
211955     NaN
266702     NaN
          ... 
6197226    NaN
6210472    NaN
6216530    NaN
6266535    NaN
6272792    NaN
Name: language, Length: 90, dtype: object

# Same method, word_count column
works['word_count'][works['word_count'].isnull()]

1404      NaN
3846      NaN
3976      NaN
5458      NaN
6170      NaN
           ..
6531822   NaN
6559452   NaN
6755516   NaN
6847505   NaN
6897542   NaN
Name: word_count, Length: 2268, dtype: float64

If we decide to drop all rows that has at least one null value, we could:

# Drop null values in language column
works.dropna(subset = ['language'])

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2021-02-26	en	False	True	388.0	10+414093+1001939+4577144+1499536+110+4682892+...	NaN
1	2021-02-26	en	False	True	1638.0	10+20350917+34816907+23666027+23269305+2326930...	NaN
2	2021-02-26	en	False	True	1502.0	10+10613413+9780526+3763877+3741104+7657229+30...	NaN
3	2021-02-26	en	False	True	100.0	10+15322+54862755+20595867+32994286+663+471751...	NaN
4	2021-02-26	en	False	True	994.0	11+721553+54604+1439500+3938423+53483274+54862...	NaN
...	...	...	...	...	...	...	...
7269688	2008-09-13	en	True	True	705.0	78+77+84+101+104+105+106+23+13+16+70+933	NaN
7269689	2008-09-13	en	False	True	1392.0	78+77+84+107+23+10+16+70+933+616	NaN
7269690	2008-09-13	en	False	True	1755.0	77+78+69+108+109+62+110+23+9+111+16+70+10128+4858	NaN
7269691	2008-09-13	en	False	True	1338.0	112+113+13+114+16+115+101+117+118+119+120+116+...	NaN
7269692	2008-09-13	en	False	True	1836.0	123+124+125+127+128+13+129+14+130+131+132+133+...	NaN

7269603 rows × 7 columns

Compared to the 7269693 rows in original data set, we dropped 90 rows that contain missing values in the language column, which is exactly what we intended to achieve.