In this section, we focus on navigating the data set, and cleaning missing values.

Loading File

# Load Python library
import pandas as pd

# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)

Dataframe Shape

We can use shape to get the number of rows and columns in the data set. Specifically, shape[0] displays the number of rows, and shape[1] displays the number of columns.

# The number of rows and columns in the dataframe
works.shape
(7269693, 7)
# Number of rows
works.shape[0]
7269693
# Number of cols
works.shape[1]
7

As we can see from above, there are 7269693 rows and 7 columns in the dataframe.

Column Names and Selecting Columns

To preview the data set and its columns, we can print out the first few rows. However, sometimes there are too many columns in a data set it is difficult to display on the screen. Instead, we could print the column names separately.

# View col names
works.columns
Index(['creation date', 'language', 'restricted', 'complete', 'word_count',
       'tags', 'Unnamed: 6'],
      dtype='object')

Often, we only need certain columns to work with. There are several ways to select columns, let’s select language column for example.

# Select a single column 
works.language
0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object
works['language']
0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object
works.loc[:,'language']
0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object

As we can see from above, the three methods yield exactly the same results. Which one to use depends on your own preference.

Let’s select multiple columns at the same time.

# Select multiple columns
works[['creation date','language']]
creation date language
0 2021-02-26 en
1 2021-02-26 en
2 2021-02-26 en
3 2021-02-26 en
4 2021-02-26 en
... ... ...
7269688 2008-09-13 en
7269689 2008-09-13 en
7269690 2008-09-13 en
7269691 2008-09-13 en
7269692 2008-09-13 en

7269693 rows × 2 columns

works.loc[:,['creation date', 'language']]
creation date language
0 2021-02-26 en
1 2021-02-26 en
2 2021-02-26 en
3 2021-02-26 en
4 2021-02-26 en
... ... ...
7269688 2008-09-13 en
7269689 2008-09-13 en
7269690 2008-09-13 en
7269691 2008-09-13 en
7269692 2008-09-13 en

7269693 rows × 2 columns

With the two columns on hand, we can proceed with more in-depth analysis. For example, we could:

  • Find the total number of languages on AO3
  • Find top 5 most popular languages on AO3
  • Visulize language trend
  • Analyze users’ posting habits

etc.

But before we can do any analysis, we need to clean the data set.

N/A values

In real world, the data set may contain missing values (showing up as NaN). In order to prepare the data for further analysis, we need to detect null values, single out the rows, and eventually drop the rows containing null values.

For more details behind the scene, check out Handling Missing Data.

# Detect null values in data
works.isna().any()
creation date    False
language          True
restricted       False
complete         False
word_count        True
tags              True
Unnamed: 6        True
dtype: bool
# Alternatively using isnull(), same results
works.isnull().any()
creation date    False
language          True
restricted       False
complete         False
word_count        True
tags              True
Unnamed: 6        True
dtype: bool

Here it shows that the language, word_count and tages columns all have null values. Let’s check out what they look like before taking any actions regarding the null values.

# Select language column, display only rows containing null values
works['language'][works['language'].isnull()]
73119      NaN
95222      NaN
184633     NaN
211955     NaN
266702     NaN
          ... 
6197226    NaN
6210472    NaN
6216530    NaN
6266535    NaN
6272792    NaN
Name: language, Length: 90, dtype: object
# Same method, word_count column
works['word_count'][works['word_count'].isnull()]
1404      NaN
3846      NaN
3976      NaN
5458      NaN
6170      NaN
           ..
6531822   NaN
6559452   NaN
6755516   NaN
6847505   NaN
6897542   NaN
Name: word_count, Length: 2268, dtype: float64

If we decide to drop all rows that has at least one null value, we could:

# Drop null values in language column
works.dropna(subset = ['language'])
creation date language restricted complete word_count tags Unnamed: 6
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN
... ... ... ... ... ... ... ...
7269688 2008-09-13 en True True 705.0 78+77+84+101+104+105+106+23+13+16+70+933 NaN
7269689 2008-09-13 en False True 1392.0 78+77+84+107+23+10+16+70+933+616 NaN
7269690 2008-09-13 en False True 1755.0 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 NaN
7269691 2008-09-13 en False True 1338.0 112+113+13+114+16+115+101+117+118+119+120+116+... NaN
7269692 2008-09-13 en False True 1836.0 123+124+125+127+128+13+129+14+130+131+132+133+... NaN

7269603 rows × 7 columns

Compared to the 7269693 rows in original data set, we dropped 90 rows that contain missing values in the language column, which is exactly what we intended to achieve.