In part II, we continue to prepare the data for visualization.

Loading File

In part I, we created a csv file named rating.csv. We’ll load the file here along with works-20210226.csv from AO3 official data dump.

# Load python libraries
import pandas as pd

# Load works file
works= pd.read_csv("/home/pi/Downloads/works-20210226.csv")

# Load rating.csv from part I
rating = pd.read_csv("rating.csv")

# preview file
rating

	id	type	name	canonical	cached_count	merger_id
0	9	Rating	Not Rated	True	825385	NaN
1	10	Rating	General Audiences	True	2115153	NaN
2	11	Rating	Teen And Up Audiences	True	2272688	NaN
3	12	Rating	Mature	True	1151260	NaN
4	13	Rating	Explicit	True	1238331	NaN
5	12766726	Rating	Teen & Up Audiences	False	333	NaN

# preview file
works

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2021-02-26	en	False	True	388.0	10+414093+1001939+4577144+1499536+110+4682892+...	NaN
1	2021-02-26	en	False	True	1638.0	10+20350917+34816907+23666027+23269305+2326930...	NaN
2	2021-02-26	en	False	True	1502.0	10+10613413+9780526+3763877+3741104+7657229+30...	NaN
3	2021-02-26	en	False	True	100.0	10+15322+54862755+20595867+32994286+663+471751...	NaN
4	2021-02-26	en	False	True	994.0	11+721553+54604+1439500+3938423+53483274+54862...	NaN
...	...	...	...	...	...	...	...
7269688	2008-09-13	en	True	True	705.0	78+77+84+101+104+105+106+23+13+16+70+933	NaN
7269689	2008-09-13	en	False	True	1392.0	78+77+84+107+23+10+16+70+933+616	NaN
7269690	2008-09-13	en	False	True	1755.0	77+78+69+108+109+62+110+23+9+111+16+70+10128+4858	NaN
7269691	2008-09-13	en	False	True	1338.0	112+113+13+114+16+115+101+117+118+119+120+116+...	NaN
7269692	2008-09-13	en	False	True	1836.0	123+124+125+127+128+13+129+14+130+131+132+133+...	NaN

7269693 rows × 7 columns

Data Cleaning

In part I, we went through detailed steps of how to clean and prepare the works DataFrame for visualization. Here we’ll skip the explanation.

# Drop NA value in tags column
works = works.dropna(subset = ['tags'])

# Drop unwanted columns 
works_subset = works[['creation date','tags']].copy(deep=True)
works_subset

	creation date	tags
0	2021-02-26	10+414093+1001939+4577144+1499536+110+4682892+...
1	2021-02-26	10+20350917+34816907+23666027+23269305+2326930...
2	2021-02-26	10+10613413+9780526+3763877+3741104+7657229+30...
3	2021-02-26	10+15322+54862755+20595867+32994286+663+471751...
4	2021-02-26	11+721553+54604+1439500+3938423+53483274+54862...
...	...	...
7269688	2008-09-13	78+77+84+101+104+105+106+23+13+16+70+933
7269689	2008-09-13	78+77+84+107+23+10+16+70+933+616
7269690	2008-09-13	77+78+69+108+109+62+110+23+9+111+16+70+10128+4858
7269691	2008-09-13	112+113+13+114+16+115+101+117+118+119+120+116+...
7269692	2008-09-13	123+124+125+127+128+13+129+14+130+131+132+133+...

7269690 rows × 2 columns

Clearing Memory

To release unreferenced memory, we use gc.collect().

import gc

del works
gc.collect()

Rating Column

# Function to find the mimnimum value in the string, and return that value
def find_rating(x):
    return min([int(n) for n in x.split('+')])

# Create a new column named 'rating'
# Use apply() to apply a function to each row
works_subset['rating'] = works_subset['tags'].apply(lambda x: find_rating(x))

# Drop works with no rating
# Drop tags column
works_subset = works_subset[works_subset['rating'].isin(rating['id'])].drop('tags',axis=1)

Preparation for Visualization

# Clear Memory
gc.collect()

# Preview file
works_subset

	creation date	rating
0	2021-02-26	10
1	2021-02-26	10
2	2021-02-26	10
3	2021-02-26	10
4	2021-02-26	11
...	...	...
7269688	2008-09-13	13
7269689	2008-09-13	10
7269690	2008-09-13	9
7269691	2008-09-13	13
7269692	2008-09-13	13

7269202 rows × 2 columns

# Make sure date column is in datetime format
works_subset['creation date'] = pd.to_datetime(works_subset['creation date'])

# Group data by month and by rating count
# Group the date column by "month" (freq="1M")
# Pd.Grouper(key, freq) is used instead of pd.Series.dt.month
# because it does not aggregate month over multiple years
# Group the rating by counting each rating

subset_group = works_subset.groupby([pd.Grouper(key='creation date',freq='1M'),'rating']).size()

# Reset index
subset_group = subset_group.reset_index()

# Rename columns
subset_group.rename(columns={0:'count'}, inplace=True)

subset_group

	creation date	rating	count
0	2008-09-30	9	76
1	2008-09-30	10	232
2	2008-09-30	11	213
3	2008-09-30	12	174
4	2008-09-30	13	233
...	...	...	...
745	2021-02-28	9	15863
746	2021-02-28	10	42624
747	2021-02-28	11	46716
748	2021-02-28	12	23610
749	2021-02-28	13	25034

750 rows × 3 columns

# Edit data format into pivot table
subset_pivot = subset_group.pivot(index = 'creation date', columns = 'rating', values='count')
subset_pivot

rating	9	10	11	12	13
creation date
2008-09-30	76	232	213	174	233
2008-10-31	38	111	93	43	196
2008-11-30	11	97	97	56	76
2008-12-31	2	93	47	41	56
2009-01-31	18	175	104	78	133
...	...	...	...	...	...
2020-10-31	14188	42416	47706	22015	28723
2020-11-30	13397	38003	42168	19005	21743
2020-12-31	15763	50443	51435	22664	26656
2021-01-31	16875	45592	51099	23830	27084
2021-02-28	15863	42624	46716	23610	25034

150 rows × 5 columns

Exporting To CSV File

In order to quickly access the DataFrame in part III, we export it to a local csv file.

subset_pivot.to_csv("rating_pivot.csv")