In part III, we use the bar_chart_race package to create animated chart for our data set.

Loading File

In part I and part II, we prepared and saved the DataFrame to local csv files. We’ll load the two files here.

# Load python libraries
import pandas as pd
# Load rating.csv from part I
rating = pd.read_csv("rating.csv")
# preview file
rating
id type name canonical cached_count merger_id
0 9 Rating Not Rated True 825385 NaN
1 10 Rating General Audiences True 2115153 NaN
2 11 Rating Teen And Up Audiences True 2272688 NaN
3 12 Rating Mature True 1151260 NaN
4 13 Rating Explicit True 1238331 NaN
5 12766726 Rating Teen & Up Audiences False 333 NaN
# Load rating_pivot.csv from part II
df = pd.read_csv("rating_pivot.csv")
# preview file
df
creation date 9 10 11 12 13
0 2008-09-30 76 232 213 174 233
1 2008-10-31 38 111 93 43 196
2 2008-11-30 11 97 97 56 76
3 2008-12-31 2 93 47 41 56
4 2009-01-31 18 175 104 78 133
... ... ... ... ... ... ...
145 2020-10-31 14188 42416 47706 22015 28723
146 2020-11-30 13397 38003 42168 19005 21743
147 2020-12-31 15763 50443 51435 22664 26656
148 2021-01-31 16875 45592 51099 23830 27084
149 2021-02-28 15863 42624 46716 23610 25034

150 rows × 6 columns

Data Cleaning

There are still some data cleaning to do, namely:

  • Making the “creation date” column as index;

  • The “creation date” column shows which month the data was collected, however it includes the last day of the month in the string, and should be corrected;

  • Changing the column name from tag id to tag name.

# Set index
df.set_index("creation date", inplace=True)
df
9 10 11 12 13
creation date
2008-09-30 76 232 213 174 233
2008-10-31 38 111 93 43 196
2008-11-30 11 97 97 56 76
2008-12-31 2 93 47 41 56
2009-01-31 18 175 104 78 133
... ... ... ... ... ...
2020-10-31 14188 42416 47706 22015 28723
2020-11-30 13397 38003 42168 19005 21743
2020-12-31 15763 50443 51435 22664 26656
2021-01-31 16875 45592 51099 23830 27084
2021-02-28 15863 42624 46716 23610 25034

150 rows × 5 columns

# Remove day from date string
# Use .str to access the string on each row
df.index = df.index.str[:-3]
df
9 10 11 12 13
creation date
2008-09 76 232 213 174 233
2008-10 38 111 93 43 196
2008-11 11 97 97 56 76
2008-12 2 93 47 41 56
2009-01 18 175 104 78 133
... ... ... ... ... ...
2020-10 14188 42416 47706 22015 28723
2020-11 13397 38003 42168 19005 21743
2020-12 15763 50443 51435 22664 26656
2021-01 16875 45592 51099 23830 27084
2021-02 15863 42624 46716 23610 25034

150 rows × 5 columns

# Change tag id to tag name
# We ditched tag id 13 because it's a duplicate
df.columns = rating.name[:5]
df
name Not Rated General Audiences Teen And Up Audiences Mature Explicit
creation date
2008-09 76 232 213 174 233
2008-10 38 111 93 43 196
2008-11 11 97 97 56 76
2008-12 2 93 47 41 56
2009-01 18 175 104 78 133
... ... ... ... ... ...
2020-10 14188 42416 47706 22015 28723
2020-11 13397 38003 42168 19005 21743
2020-12 15763 50443 51435 22664 26656
2021-01 16875 45592 51099 23830 27084
2021-02 15863 42624 46716 23610 25034

150 rows × 5 columns

Monthly Posts v.s. Cumulative Sum

We have two options here. The DataFrame contains the number of works posted per month under each rating category. We can also calculate the cumulative sum of posts. The end result should be very close to the total number of works on AO3 at the time of the data dump. Remember, in previous posts, as we were cleaning the data set, we made decisions to drop some works from the data set due to N/A values or duplicates.

# Cumulative sum
df_cumsum = df.cumsum()
df_cumsum
name Not Rated General Audiences Teen And Up Audiences Mature Explicit
creation date
2008-09 76 232 213 174 233
2008-10 114 343 306 217 429
2008-11 125 440 403 273 505
2008-12 127 533 450 314 561
2009-01 145 708 554 392 694
... ... ... ... ... ...
2020-10 651539 1895727 2006772 1014066 1081494
2020-11 664936 1933730 2048940 1033071 1103237
2020-12 680699 1984173 2100375 1055735 1129893
2021-01 697574 2029765 2151474 1079565 1156977
2021-02 713437 2072389 2198190 1103175 1182011

150 rows × 5 columns

# Export to local csv file
df_cumsum.to_csv("rating-cumsum.csv")

Animated Bar Chart with Bar-Chart-Race

We use the bar_chart_race package to automate the animation process. You can of course create the whole chart from scratch like this. More tutorials about the package from the author can be found here.

# Load the package
import bar_chart_race as bcr

# Load gc to manually release memory in case Jupyter Notebook crashes
import gc
# Clear memory
gc.collect()
31

Customization: show total work count

# Function to show total work count
# From bar-chart-race tutorial
def summary(values, ranks):
    total_works = values.sum()
    s = f'Total Works - {total_works:,.0f}'
    return {'x': .99, 'y': .05, 's': s, 'ha': 'right', 'size': 8}

Animation

# filename=None in order to display in Jupyter Notebook cell
# period_summary_func=summary to show total work count
bcr.bar_chart_race(df=df_cumsum, filename=None, period_summary_func=summary, title='AO3 Works Rating Breakdown \n 2008-2021')
/home/pi/.local/lib/python3.7/site-packages/bar_chart_race/_make_chart.py:286: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(self.df_values.columns)
/home/pi/.local/lib/python3.7/site-packages/bar_chart_race/_make_chart.py:287: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels([max_val] * len(ax.get_xticks()))