AO3 Trivia: Popular Tags Part II
In this section, we use the python Circlify library and matplotlib to visualize the popular tags on AO3. We’ll be following the Basic Circle Packing Chart example on Python Graph Gallery.
Loading File
# Load python libraries
import pandas as pd
Here we use a csv file “fandom500.csv”, it is the top 500 most popular fandoms based on cached_count extracted from the original file tags-20210226.csv. The previous post covers how to find popular tags.
# Load data
# Only load first 20 rows from the file
fandom = pd.read_csv("fandom500.csv", nrows=20)
# preview
fandom[:5]
Unnamed: 0 | id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|---|
0 | 94292 | 136512 | Fandom | Harry Potter - J. K. Rowling | True | 361919 | NaN |
1 | 25 | 27 | Fandom | Supernatural | True | 310300 | NaN |
2 | 230408 | 414093 | Fandom | Marvel Cinematic Universe | True | 240536 | NaN |
3 | 1553725 | 3828398 | Fandom | 僕のヒーローアカデミア | Boku no Hero Academia | My Hero ... | True | 204096 | NaN |
4 | 680695 | 1002903 | Fandom | 방탄소년단 | Bangtan Boys | BTS | True | 203097 | NaN |
Circlify
We use Circlify to generate a basic circle packing chart representing the 20 most popular fandoms on AO3.
%matplotlib inline
# Load matplotlib
import matplotlib.pyplot as plt
# Load the circlify library
import circlify
# Data is a list of positive values sorted from largest to smallest.
# Compute circle positions:
circles = circlify.circlify(
fandom.cached_count.tolist(),
show_enclosure=False,
target_enclosure=circlify.Circle(x=0, y=0, r=1)
)
# Preview
circles[:10]
[Circle(x=-0.5059826765465478, y=0.7164728244834262, r=0.12287299825597259, level=1, ex={'datum': 60008}),
Circle(x=-0.784426155399832, y=0.3403961376432391, r=0.12318588426100985, level=1, ex={'datum': 60314}),
Circle(x=0.30366446944493447, y=0.7688619710123179, r=0.12919805590982159, level=1, ex={'datum': 66345}),
Circle(x=-0.1494359955824714, y=0.7915035957280849, r=0.13002789025498526, level=1, ex={'datum': 67200}),
Circle(x=-0.5658025925053127, y=0.46985740969013184, r=0.13089379460507566, level=1, ex={'datum': 68098}),
Circle(x=-0.7481171059523007, y=-0.06757414988202783, r=0.15188595016235845, level=1, ex={'datum': 91692}),
Circle(x=-0.664278156659797, y=-0.3929803440624641, r=0.1551327343608837, level=1, ex={'datum': 95654}),
Circle(x=-0.28884396308873017, y=0.5424213773613182, r=0.15541305444415118, level=1, ex={'datum': 96000}),
Circle(x=-0.5500991559526954, y=0.17676556661001908, r=0.16261843069339346, level=1, ex={'datum': 105108}),
Circle(x=-0.07815880962512435, y=-0.729119157631173, r=0.16327618873069183, level=1, ex={'datum': 105960})]
# Visualization with circlify.Bubbles()
circlify.bubbles(circles=circles)
The default function circlify.Bubbles() generates a circle packing chart with the cached_count. It is difficult to add annotations and other customization on the chart. We’ll follow the Basic Circle Packing Chart example on Python Graph Gallery to plot the chart with matplotlib.
Matplotlib Circles
# Create just a figure and only one subplot
# Must use plt.subplots instead of subplot
# In Jupyter notebook, include all code in one cell
fig, ax = plt.subplots(figsize=(10,10))
# Find axes boundaries
# This is due to ax.add_patch() not updating boundaries
# Thus to avoid drawing circles outside of fig
lim = max([max(abs(circle.x)+circle.r, abs(circle.y)+circle.r) for circle in circles])
plt.xlim(-lim,lim)
plt.ylim(-lim,lim)
# List of labels to add on the circles
labels = fandom.name
# Remove axis
# Add title
plt.axis('off')
plt.title('AO3 Popular Fandoms')
# print circles
for circle, label in zip(circles, labels):
x, y, r = circle
ax.add_patch(plt.Circle((x,y), r, alpha=0.2,linewidth=2))
plt.annotate(label, (x,y), va='center',ha='center')
We have successfully added the labels to the circles. However the text string is too long and not very visually pleasing. We’re going to shorten the labels with regular expression.
Regular Expression
Here’s the plan. We want to get rid of some addtional information on the labels. We want to match string before the parentheses, or after the | sign, or before the hyphen. A great way to test out your regular expression is to use regex101.
- Text string before the parentheses:
We use lookahead assertion (?=\() to matches a text string that is followed by a “(“, without making the “(“ part of the match. Note that “\” is the escape character.
- Text string before a hyphen:
Same as above, we use lookahead (?=\-) to match the string before the hyphen.
- Text string after |:
We first flip the text string so that we can match the text before the first |, then we flip the string back.
# Load library
import re
# Create a function to match string
def short(s):
if "(" in s:
return re.match(r".*(?=\()", s).group().strip()
elif "-" in s:
return re.match(r".*(?=\-)", s).group().strip()
elif "|" in s:
return re.match(r"^(.*?)(?=\|)", s[::-1]).group().strip()[::-1]
else:
return s
# Original labels
labels[:5]
0 Harry Potter - J. K. Rowling
1 Supernatural
2 Marvel Cinematic Universe
3 僕のヒーローアカデミア | Boku no Hero Academia | My Hero ...
4 방탄소년단 | Bangtan Boys | BTS
Name: name, dtype: object
# Use .apply() to apply the function on the original labels
labels2 = labels.apply(lambda x: short(x))
# Short labels
labels2[:5]
0 Harry Potter
1 Supernatural
2 Marvel Cinematic Universe
3 My Hero Academia
4 BTS
Name: name, dtype: object
# Repeat the above plotting process
# This time with short labels
# Create just a figure and only one subplot
# Must use plt.subplots instead of subplot
# In Jupyter notebook, include all code in one cell
fig, ax = plt.subplots(figsize=(10,10))
# Find axes boundaries
# This is due to ax.add_patch() not updating boundaries
# Thus avoid drawing circles outside of fig
lim = max([max(abs(circle.x)+circle.r, abs(circle.y)+circle.r) for circle in circles])
plt.xlim(-lim,lim)
plt.ylim(-lim,lim)
# List of labels to add on the circles
labels = fandom.name
# Remove axis
# Add title
plt.axis('off')
plt.title('AO3 Popular Fandoms')
# print circles
for circle, label in zip(circles, labels2):
x, y, r = circle
ax.add_patch(plt.Circle((x,y), r, alpha=0.2,linewidth=2))
plt.annotate(label, (x,y), va='center',ha='center')