In part II, we continue to prepare the data for visualization.

Loading File

In part I, we created a csv file named rating.csv. We’ll load the file here along with works-20210226.csv from AO3 official data dump.

# Load python libraries
import pandas as pd
# Load works file
works= pd.read_csv("/home/pi/Downloads/works-20210226.csv")
# Load rating.csv from part I
rating = pd.read_csv("rating.csv")
# preview file
rating
id type name canonical cached_count merger_id
0 9 Rating Not Rated True 825385 NaN
1 10 Rating General Audiences True 2115153 NaN
2 11 Rating Teen And Up Audiences True 2272688 NaN
3 12 Rating Mature True 1151260 NaN
4 13 Rating Explicit True 1238331 NaN
5 12766726 Rating Teen & Up Audiences False 333 NaN
# preview file
works
creation date language restricted complete word_count tags Unnamed: 6
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN
... ... ... ... ... ... ... ...
7269688 2008-09-13 en True True 705.0 78+77+84+101+104+105+106+23+13+16+70+933 NaN
7269689 2008-09-13 en False True 1392.0 78+77+84+107+23+10+16+70+933+616 NaN
7269690 2008-09-13 en False True 1755.0 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 NaN
7269691 2008-09-13 en False True 1338.0 112+113+13+114+16+115+101+117+118+119+120+116+... NaN
7269692 2008-09-13 en False True 1836.0 123+124+125+127+128+13+129+14+130+131+132+133+... NaN

7269693 rows × 7 columns

Data Cleaning

In part I, we went through detailed steps of how to clean and prepare the works DataFrame for visualization. Here we’ll skip the explanation.

# Drop NA value in tags column
works = works.dropna(subset = ['tags'])
# Drop unwanted columns 
works_subset = works[['creation date','tags']].copy(deep=True)
works_subset
creation date tags
0 2021-02-26 10+414093+1001939+4577144+1499536+110+4682892+...
1 2021-02-26 10+20350917+34816907+23666027+23269305+2326930...
2 2021-02-26 10+10613413+9780526+3763877+3741104+7657229+30...
3 2021-02-26 10+15322+54862755+20595867+32994286+663+471751...
4 2021-02-26 11+721553+54604+1439500+3938423+53483274+54862...
... ... ...
7269688 2008-09-13 78+77+84+101+104+105+106+23+13+16+70+933
7269689 2008-09-13 78+77+84+107+23+10+16+70+933+616
7269690 2008-09-13 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858
7269691 2008-09-13 112+113+13+114+16+115+101+117+118+119+120+116+...
7269692 2008-09-13 123+124+125+127+128+13+129+14+130+131+132+133+...

7269690 rows × 2 columns

Clearing Memory

To release unreferenced memory, we use gc.collect().

import gc
del works
gc.collect()
88

Rating Column

# Function to find the mimnimum value in the string, and return that value
def find_rating(x):
    return min([int(n) for n in x.split('+')])
# Create a new column named 'rating'
# Use apply() to apply a function to each row
works_subset['rating'] = works_subset['tags'].apply(lambda x: find_rating(x))
# Drop works with no rating
# Drop tags column
works_subset = works_subset[works_subset['rating'].isin(rating['id'])].drop('tags',axis=1)

Preparation for Visualization

# Clear Memory
gc.collect()
88
# Preview file
works_subset
creation date rating
0 2021-02-26 10
1 2021-02-26 10
2 2021-02-26 10
3 2021-02-26 10
4 2021-02-26 11
... ... ...
7269688 2008-09-13 13
7269689 2008-09-13 10
7269690 2008-09-13 9
7269691 2008-09-13 13
7269692 2008-09-13 13

7269202 rows × 2 columns

# Make sure date column is in datetime format
works_subset['creation date'] = pd.to_datetime(works_subset['creation date'])
# Group data by month and by rating count
# Group the date column by "month" (freq="1M")
# Pd.Grouper(key, freq) is used instead of pd.Series.dt.month
# because it does not aggregate month over multiple years
# Group the rating by counting each rating

subset_group = works_subset.groupby([pd.Grouper(key='creation date',freq='1M'),'rating']).size()
# Reset index
subset_group = subset_group.reset_index()
# Rename columns
subset_group.rename(columns={0:'count'}, inplace=True)
subset_group
creation date rating count
0 2008-09-30 9 76
1 2008-09-30 10 232
2 2008-09-30 11 213
3 2008-09-30 12 174
4 2008-09-30 13 233
... ... ... ...
745 2021-02-28 9 15863
746 2021-02-28 10 42624
747 2021-02-28 11 46716
748 2021-02-28 12 23610
749 2021-02-28 13 25034

750 rows × 3 columns

# Edit data format into pivot table
subset_pivot = subset_group.pivot(index = 'creation date', columns = 'rating', values='count')
subset_pivot
rating 9 10 11 12 13
creation date
2008-09-30 76 232 213 174 233
2008-10-31 38 111 93 43 196
2008-11-30 11 97 97 56 76
2008-12-31 2 93 47 41 56
2009-01-31 18 175 104 78 133
... ... ... ... ... ...
2020-10-31 14188 42416 47706 22015 28723
2020-11-30 13397 38003 42168 19005 21743
2020-12-31 15763 50443 51435 22664 26656
2021-01-31 16875 45592 51099 23830 27084
2021-02-28 15863 42624 46716 23610 25034

150 rows × 5 columns

Exporting To CSV File

In order to quickly access the DataFrame in part III, we export it to a local csv file.

subset_pivot.to_csv("rating_pivot.csv")