In Part II, we look into the fandoms that have works with most words.

Loading File

# Load python libraries
import pandas as pd
import gc
# Load data
chunker = pd.read_csv("/home/pi/Downloads/tags-20210226.csv", chunksize=10000)
tags = pd.concat(chunker, ignore_index=True)
# preview
tags
id type name canonical cached_count merger_id
0 1 Media TV Shows True 910 NaN
1 2 Media Movies True 1164 NaN
2 3 Media Books & Literature True 134 NaN
3 4 Media Cartoons & Comics & Graphic Novels True 166 NaN
4 5 Media Anime & Manga True 501 NaN
... ... ... ... ... ... ...
14467133 55395603 Freeform Redacted False 0 NaN
14467134 55395606 Freeform Redacted False 0 NaN
14467135 55395609 Freeform Redacted False 0 NaN
14467136 55395612 Freeform Redacted False 0 NaN
14467137 55395615 Freeform Redacted False 0 NaN

14467138 rows × 6 columns

Note from the above preview, some tags have names as “Redacted”. This is due to the fact that these tags have the cached_count less than 5.

In Part I, we’ve extracted 10 works with most words on AO3, and saved it to a local file named “top-words-all-time.csv”; we also filtered the data by year, found the work with most words for that year, and saved the DataFrame as ““top-words-by-year.csv”.

# Load data
works_all = pd.read_csv("trivia/top-words-all-time.csv")
works_all
creation date language restricted complete word_count tags Unnamed: 6
0 2016-08-28 en False False 5078036.0 22+541478+15918+126089+63182+12+741433+230931+... NaN
1 2014-06-22 en False False 4796066.0 23+14+15322+109011+108231+108232+186363+600534... NaN
2 2018-07-14 en False False 4332910.0 1026+109503+12695+16+24754629+116+37259+11+796... NaN
3 2013-10-06 en False False 3817471.0 11+21+16+1133664+48012+48013+648995+16999+1090... NaN
4 2019-12-18 en False False 3456587.0 12+3693074+14030081+10482076+8658412+8658409+1... NaN
5 2018-12-18 en False False 3312781.0 12+254648+13714235+19334348+557795+1275+282154... NaN
6 2019-10-29 vi True False 3163926.0 5450+14+9 NaN
7 2014-12-06 en False False 3085821.0 13+116+22+23+14+1001939+245368+586439+261582+7... NaN
8 2013-02-16 en False True 2853949.0 13+14+951+40167+6563+6560+6559+950+1056+109629... NaN
9 2020-10-17 en False False 2598127.0 11+13999+2927+6276+2246+17+61 NaN
# Load data
works_year = pd.read_csv("trivia/top-words-by-year.csv")
works_year
creation date language restricted complete word_count tags Unnamed: 6
0 2008-11-06 en False False 128163.0 22+183+2390+1048+966+16+1000+968+184+2395+2379... NaN
1 2009-11-14 en False True 756596.0 23+19+13+114941+63594+125727+134988 NaN
2 2010-04-14 en False False 1005091.0 2246+14+78550+8096+95285+8133+95354+8094+8130+... NaN
3 2011-06-06 zh True True 1490481.0 13+23+14+1039+20020+24+22 NaN
4 2012-04-16 en False False 1310636.0 13+23+14+136512+972932+4937593+70650+1833+2417... NaN
5 2013-10-06 en False False 3817471.0 11+21+16+1133664+48012+48013+648995+16999+1090... NaN
6 2014-06-22 en False False 4796066.0 23+14+15322+109011+108231+108232+186363+600534... NaN
7 2015-09-20 en False False 1779264.0 21+16+10767+248734+8005+28451+1000+192+12 NaN
8 2016-08-28 en False False 5078036.0 22+541478+15918+126089+63182+12+741433+230931+... NaN
9 2017-09-23 en False False 2588814.0 299359+299357+299358+2927+1371926+21+14+12 NaN
10 2018-07-14 en False False 4332910.0 1026+109503+12695+16+24754629+116+37259+11+796... NaN
11 2019-12-18 en False False 3456587.0 12+3693074+14030081+10482076+8658412+8658409+1... NaN
12 2020-10-17 en False False 2598127.0 11+13999+2927+6276+2246+17+61 NaN
13 2021-01-02 en False False 2261752.0 839+30349+9830863+9830815+18525+262083+5195777... NaN

From Tag ID To Tag Name

On one hand, we have tag ids associated with works in one DataFrame; on the other hand, we have tag ids with names in another DataFrame.

Let’s split the tag id string into individual tag ids, find the respective name in tags DataFrame, and append the name back to the works DataFrame under a new column.

Because a work has multiple tags, some works have as many as 500+ tags (from previous post, Works With Most Tags), we’ll only extract the fandom tag from the tags DataFrame.

We can check the types of tags that are in the data set.

# Types of tags
tags.type.unique()
array(['Media', 'Rating', 'ArchiveWarning', 'Category', 'Character',
       'Fandom', 'Relationship', 'Freeform', 'UnsortedTag'], dtype=object)

In order to user .loc[] method to quickly select specific rows based on index labels, we first set tag id as index in the tags DataFrame.

Set Index

# Set id column as index
tags.set_index("id", inplace=True)
tags
type name canonical cached_count merger_id
id
1 Media TV Shows True 910 NaN
2 Media Movies True 1164 NaN
3 Media Books & Literature True 134 NaN
4 Media Cartoons & Comics & Graphic Novels True 166 NaN
5 Media Anime & Manga True 501 NaN
... ... ... ... ... ...
55395603 Freeform Redacted False 0 NaN
55395606 Freeform Redacted False 0 NaN
55395609 Freeform Redacted False 0 NaN
55395612 Freeform Redacted False 0 NaN
55395615 Freeform Redacted False 0 NaN

14467138 rows × 5 columns

Add_fandom function

We use a function to quickly split tags string into separate ids, use id to locate the tag name, and only select id names that are fandoms.

# x is id string
# df is tags DataFrame
# The function returns all tag names that are Fandom type
# In case of multiple fandoms, tolist() is used 

def add_fandom(x,df):
    subset = df.loc[[int(n) for n in x.split("+")]]
    return subset[subset.type == "Fandom"].name.tolist()

Apply to DataFrame

# Add fandom to a new column
works_all["fandom"] = works_all["tags"].apply(lambda x: add_fandom(x,tags))
works_all
creation date language restricted complete word_count tags Unnamed: 6 fandom
0 2016-08-28 en False False 5078036.0 22+541478+15918+126089+63182+12+741433+230931+... NaN [The Hobbit - All Media Types, The Lord of the...
1 2014-06-22 en False False 4796066.0 23+14+15322+109011+108231+108232+186363+600534... NaN [Formula 1 RPF]
2 2018-07-14 en False False 4332910.0 1026+109503+12695+16+24754629+116+37259+11+796... NaN [Terminator: The Sarah Connor Chronicles, Term...
3 2013-10-06 en False False 3817471.0 11+21+16+1133664+48012+48013+648995+16999+1090... NaN [Fiction Wrestling - Fandom]
4 2019-12-18 en False False 3456587.0 12+3693074+14030081+10482076+8658412+8658409+1... NaN [モブサイコ100 | Mob Psycho 100]
5 2018-12-18 en False False 3312781.0 12+254648+13714235+19334348+557795+1275+282154... NaN [Minecraft (Video Game)]
6 2019-10-29 vi True False 3163926.0 5450+14+9 NaN [No Fandom]
7 2014-12-06 en False False 3085821.0 13+116+22+23+14+1001939+245368+586439+261582+7... NaN [The Avengers (Marvel Movies), Thor (Movies), ...
8 2013-02-16 en False True 2853949.0 13+14+951+40167+6563+6560+6559+950+1056+109629... NaN [NCIS]
9 2020-10-17 en False False 2598127.0 11+13999+2927+6276+2246+17+61 NaN [Naruto]

To summarize, the work with most words on AO3 is from The Hobbit fandom, followed by a work from Formula 1 RPF fandom. Interestingly, 9 works out of 10 are still works in progress. Note that the date in the data set is “creation date”, meaning which year the work was created. They may well be updated regularly till this day. Thus, it is unsurprising that the works we see above are all created several years ago. It takes time to write fanfictions!

# Same with works_year DataFrame
works_year["fandom"] = works_year["tags"].apply(lambda x: add_fandom(x,tags))
works_year
creation date language restricted complete word_count tags Unnamed: 6 fandom
0 2008-11-06 en False False 128163.0 22+183+2390+1048+966+16+1000+968+184+2395+2379... NaN [Harry Potter - Rowling]
1 2009-11-14 en False True 756596.0 23+19+13+114941+63594+125727+134988 NaN [Lord of the Rings - J. R. R. Tolkien]
2 2010-04-14 en False False 1005091.0 2246+14+78550+8096+95285+8133+95354+8094+8130+... NaN [Weiß Kreuz]
3 2011-06-06 zh True True 1490481.0 13+23+14+1039+20020+24+22 NaN [Queer as Folk (US)]
4 2012-04-16 en False False 1310636.0 13+23+14+136512+972932+4937593+70650+1833+2417... NaN [Harry Potter - J. K. Rowling]
5 2013-10-06 en False False 3817471.0 11+21+16+1133664+48012+48013+648995+16999+1090... NaN [Fiction Wrestling - Fandom]
6 2014-06-22 en False False 4796066.0 23+14+15322+109011+108231+108232+186363+600534... NaN [Formula 1 RPF]
7 2015-09-20 en False False 1779264.0 21+16+10767+248734+8005+28451+1000+192+12 NaN [One Piece]
8 2016-08-28 en False False 5078036.0 22+541478+15918+126089+63182+12+741433+230931+... NaN [The Hobbit - All Media Types, The Lord of the...
9 2017-09-23 en False False 2588814.0 299359+299357+299358+2927+1371926+21+14+12 NaN [Wiedźmin | The Witcher (Video Game), Wiedźmin...
10 2018-07-14 en False False 4332910.0 1026+109503+12695+16+24754629+116+37259+11+796... NaN [Terminator: The Sarah Connor Chronicles, Term...
11 2019-12-18 en False False 3456587.0 12+3693074+14030081+10482076+8658412+8658409+1... NaN [モブサイコ100 | Mob Psycho 100]
12 2020-10-17 en False False 2598127.0 11+13999+2927+6276+2246+17+61 NaN [Naruto]
13 2021-01-02 en False False 2261752.0 839+30349+9830863+9830815+18525+262083+5195777... NaN [X-Men (Movieverse), X-Men (Comicverse), X-Men...

Here, we have the works with most words created each year. Some works are also in the previous DataFram.