Rating Tags in Works Part III
In part III, we use the bar_chart_race package to create animated chart for our data set.
Loading File
In part I and part II, we prepared and saved the DataFrame to local csv files. We’ll load the two files here.
# Load python libraries
import pandas as pd
# Load rating.csv from part I
rating = pd.read_csv("rating.csv")
# preview file
rating
id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|
0 | 9 | Rating | Not Rated | True | 825385 | NaN |
1 | 10 | Rating | General Audiences | True | 2115153 | NaN |
2 | 11 | Rating | Teen And Up Audiences | True | 2272688 | NaN |
3 | 12 | Rating | Mature | True | 1151260 | NaN |
4 | 13 | Rating | Explicit | True | 1238331 | NaN |
5 | 12766726 | Rating | Teen & Up Audiences | False | 333 | NaN |
# Load rating_pivot.csv from part II
df = pd.read_csv("rating_pivot.csv")
# preview file
df
creation date | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|
0 | 2008-09-30 | 76 | 232 | 213 | 174 | 233 |
1 | 2008-10-31 | 38 | 111 | 93 | 43 | 196 |
2 | 2008-11-30 | 11 | 97 | 97 | 56 | 76 |
3 | 2008-12-31 | 2 | 93 | 47 | 41 | 56 |
4 | 2009-01-31 | 18 | 175 | 104 | 78 | 133 |
... | ... | ... | ... | ... | ... | ... |
145 | 2020-10-31 | 14188 | 42416 | 47706 | 22015 | 28723 |
146 | 2020-11-30 | 13397 | 38003 | 42168 | 19005 | 21743 |
147 | 2020-12-31 | 15763 | 50443 | 51435 | 22664 | 26656 |
148 | 2021-01-31 | 16875 | 45592 | 51099 | 23830 | 27084 |
149 | 2021-02-28 | 15863 | 42624 | 46716 | 23610 | 25034 |
150 rows × 6 columns
Data Cleaning
There are still some data cleaning to do, namely:
-
Making the “creation date” column as index;
-
The “creation date” column shows which month the data was collected, however it includes the last day of the month in the string, and should be corrected;
-
Changing the column name from tag id to tag name.
# Set index
df.set_index("creation date", inplace=True)
df
9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|
creation date | |||||
2008-09-30 | 76 | 232 | 213 | 174 | 233 |
2008-10-31 | 38 | 111 | 93 | 43 | 196 |
2008-11-30 | 11 | 97 | 97 | 56 | 76 |
2008-12-31 | 2 | 93 | 47 | 41 | 56 |
2009-01-31 | 18 | 175 | 104 | 78 | 133 |
... | ... | ... | ... | ... | ... |
2020-10-31 | 14188 | 42416 | 47706 | 22015 | 28723 |
2020-11-30 | 13397 | 38003 | 42168 | 19005 | 21743 |
2020-12-31 | 15763 | 50443 | 51435 | 22664 | 26656 |
2021-01-31 | 16875 | 45592 | 51099 | 23830 | 27084 |
2021-02-28 | 15863 | 42624 | 46716 | 23610 | 25034 |
150 rows × 5 columns
# Remove day from date string
# Use .str to access the string on each row
df.index = df.index.str[:-3]
df
9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|
creation date | |||||
2008-09 | 76 | 232 | 213 | 174 | 233 |
2008-10 | 38 | 111 | 93 | 43 | 196 |
2008-11 | 11 | 97 | 97 | 56 | 76 |
2008-12 | 2 | 93 | 47 | 41 | 56 |
2009-01 | 18 | 175 | 104 | 78 | 133 |
... | ... | ... | ... | ... | ... |
2020-10 | 14188 | 42416 | 47706 | 22015 | 28723 |
2020-11 | 13397 | 38003 | 42168 | 19005 | 21743 |
2020-12 | 15763 | 50443 | 51435 | 22664 | 26656 |
2021-01 | 16875 | 45592 | 51099 | 23830 | 27084 |
2021-02 | 15863 | 42624 | 46716 | 23610 | 25034 |
150 rows × 5 columns
# Change tag id to tag name
# We ditched tag id 13 because it's a duplicate
df.columns = rating.name[:5]
df
name | Not Rated | General Audiences | Teen And Up Audiences | Mature | Explicit |
---|---|---|---|---|---|
creation date | |||||
2008-09 | 76 | 232 | 213 | 174 | 233 |
2008-10 | 38 | 111 | 93 | 43 | 196 |
2008-11 | 11 | 97 | 97 | 56 | 76 |
2008-12 | 2 | 93 | 47 | 41 | 56 |
2009-01 | 18 | 175 | 104 | 78 | 133 |
... | ... | ... | ... | ... | ... |
2020-10 | 14188 | 42416 | 47706 | 22015 | 28723 |
2020-11 | 13397 | 38003 | 42168 | 19005 | 21743 |
2020-12 | 15763 | 50443 | 51435 | 22664 | 26656 |
2021-01 | 16875 | 45592 | 51099 | 23830 | 27084 |
2021-02 | 15863 | 42624 | 46716 | 23610 | 25034 |
150 rows × 5 columns
Monthly Posts v.s. Cumulative Sum
We have two options here. The DataFrame contains the number of works posted per month under each rating category. We can also calculate the cumulative sum of posts. The end result should be very close to the total number of works on AO3 at the time of the data dump. Remember, in previous posts, as we were cleaning the data set, we made decisions to drop some works from the data set due to N/A values or duplicates.
# Cumulative sum
df_cumsum = df.cumsum()
df_cumsum
name | Not Rated | General Audiences | Teen And Up Audiences | Mature | Explicit |
---|---|---|---|---|---|
creation date | |||||
2008-09 | 76 | 232 | 213 | 174 | 233 |
2008-10 | 114 | 343 | 306 | 217 | 429 |
2008-11 | 125 | 440 | 403 | 273 | 505 |
2008-12 | 127 | 533 | 450 | 314 | 561 |
2009-01 | 145 | 708 | 554 | 392 | 694 |
... | ... | ... | ... | ... | ... |
2020-10 | 651539 | 1895727 | 2006772 | 1014066 | 1081494 |
2020-11 | 664936 | 1933730 | 2048940 | 1033071 | 1103237 |
2020-12 | 680699 | 1984173 | 2100375 | 1055735 | 1129893 |
2021-01 | 697574 | 2029765 | 2151474 | 1079565 | 1156977 |
2021-02 | 713437 | 2072389 | 2198190 | 1103175 | 1182011 |
150 rows × 5 columns
# Export to local csv file
df_cumsum.to_csv("rating-cumsum.csv")
Animated Bar Chart with Bar-Chart-Race
We use the bar_chart_race package to automate the animation process. You can of course create the whole chart from scratch like this. More tutorials about the package from the author can be found here.
# Load the package
import bar_chart_race as bcr
# Load gc to manually release memory in case Jupyter Notebook crashes
import gc
# Clear memory
gc.collect()
31
Customization: show total work count
# Function to show total work count
# From bar-chart-race tutorial
def summary(values, ranks):
total_works = values.sum()
s = f'Total Works - {total_works:,.0f}'
return {'x': .99, 'y': .05, 's': s, 'ha': 'right', 'size': 8}
Animation
# filename=None in order to display in Jupyter Notebook cell
# period_summary_func=summary to show total work count
bcr.bar_chart_race(df=df_cumsum, filename=None, period_summary_func=summary, title='AO3 Works Rating Breakdown \n 2008-2021')
/home/pi/.local/lib/python3.7/site-packages/bar_chart_race/_make_chart.py:286: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_yticklabels(self.df_values.columns)
/home/pi/.local/lib/python3.7/site-packages/bar_chart_race/_make_chart.py:287: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels([max_val] * len(ax.get_xticks()))