In Part II, we look into the fandoms that have works with most words.

Loading File

# Load python libraries
import pandas as pd
import gc

# Load data
chunker = pd.read_csv("/home/pi/Downloads/tags-20210226.csv", chunksize=10000)
tags = pd.concat(chunker, ignore_index=True)

# preview
tags

	id	type	name	canonical	cached_count	merger_id
0	1	Media	TV Shows	True	910	NaN
1	2	Media	Movies	True	1164	NaN
2	3	Media	Books & Literature	True	134	NaN
3	4	Media	Cartoons & Comics & Graphic Novels	True	166	NaN
4	5	Media	Anime & Manga	True	501	NaN
...	...	...	...	...	...	...
14467133	55395603	Freeform	Redacted	False	0	NaN
14467134	55395606	Freeform	Redacted	False	0	NaN
14467135	55395609	Freeform	Redacted	False	0	NaN
14467136	55395612	Freeform	Redacted	False	0	NaN
14467137	55395615	Freeform	Redacted	False	0	NaN

14467138 rows × 6 columns

Note from the above preview, some tags have names as “Redacted”. This is due to the fact that these tags have the cached_count less than 5.

In Part I, we’ve extracted 10 works with most words on AO3, and saved it to a local file named “top-words-all-time.csv”; we also filtered the data by year, found the work with most words for that year, and saved the DataFrame as ““top-words-by-year.csv”.

# Load data
works_all = pd.read_csv("trivia/top-words-all-time.csv")

works_all

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2016-08-28	en	False	False	5078036.0	22+541478+15918+126089+63182+12+741433+230931+...	NaN
1	2014-06-22	en	False	False	4796066.0	23+14+15322+109011+108231+108232+186363+600534...	NaN
2	2018-07-14	en	False	False	4332910.0	1026+109503+12695+16+24754629+116+37259+11+796...	NaN
3	2013-10-06	en	False	False	3817471.0	11+21+16+1133664+48012+48013+648995+16999+1090...	NaN
4	2019-12-18	en	False	False	3456587.0	12+3693074+14030081+10482076+8658412+8658409+1...	NaN
5	2018-12-18	en	False	False	3312781.0	12+254648+13714235+19334348+557795+1275+282154...	NaN
6	2019-10-29	vi	True	False	3163926.0	5450+14+9	NaN
7	2014-12-06	en	False	False	3085821.0	13+116+22+23+14+1001939+245368+586439+261582+7...	NaN
8	2013-02-16	en	False	True	2853949.0	13+14+951+40167+6563+6560+6559+950+1056+109629...	NaN
9	2020-10-17	en	False	False	2598127.0	11+13999+2927+6276+2246+17+61	NaN

# Load data
works_year = pd.read_csv("trivia/top-words-by-year.csv")

works_year

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2008-11-06	en	False	False	128163.0	22+183+2390+1048+966+16+1000+968+184+2395+2379...	NaN
1	2009-11-14	en	False	True	756596.0	23+19+13+114941+63594+125727+134988	NaN
2	2010-04-14	en	False	False	1005091.0	2246+14+78550+8096+95285+8133+95354+8094+8130+...	NaN
3	2011-06-06	zh	True	True	1490481.0	13+23+14+1039+20020+24+22	NaN
4	2012-04-16	en	False	False	1310636.0	13+23+14+136512+972932+4937593+70650+1833+2417...	NaN
5	2013-10-06	en	False	False	3817471.0	11+21+16+1133664+48012+48013+648995+16999+1090...	NaN
6	2014-06-22	en	False	False	4796066.0	23+14+15322+109011+108231+108232+186363+600534...	NaN
7	2015-09-20	en	False	False	1779264.0	21+16+10767+248734+8005+28451+1000+192+12	NaN
8	2016-08-28	en	False	False	5078036.0	22+541478+15918+126089+63182+12+741433+230931+...	NaN
9	2017-09-23	en	False	False	2588814.0	299359+299357+299358+2927+1371926+21+14+12	NaN
10	2018-07-14	en	False	False	4332910.0	1026+109503+12695+16+24754629+116+37259+11+796...	NaN
11	2019-12-18	en	False	False	3456587.0	12+3693074+14030081+10482076+8658412+8658409+1...	NaN
12	2020-10-17	en	False	False	2598127.0	11+13999+2927+6276+2246+17+61	NaN
13	2021-01-02	en	False	False	2261752.0	839+30349+9830863+9830815+18525+262083+5195777...	NaN

From Tag ID To Tag Name

On one hand, we have tag ids associated with works in one DataFrame; on the other hand, we have tag ids with names in another DataFrame.

Let’s split the tag id string into individual tag ids, find the respective name in tags DataFrame, and append the name back to the works DataFrame under a new column.

Because a work has multiple tags, some works have as many as 500+ tags (from previous post, Works With Most Tags), we’ll only extract the fandom tag from the tags DataFrame.

We can check the types of tags that are in the data set.

# Types of tags
tags.type.unique()

array(['Media', 'Rating', 'ArchiveWarning', 'Category', 'Character',
       'Fandom', 'Relationship', 'Freeform', 'UnsortedTag'], dtype=object)

In order to user .loc[] method to quickly select specific rows based on index labels, we first set tag id as index in the tags DataFrame.

Set Index

# Set id column as index
tags.set_index("id", inplace=True)
tags

	type	name	canonical	cached_count	merger_id
id
1	Media	TV Shows	True	910	NaN
2	Media	Movies	True	1164	NaN
3	Media	Books & Literature	True	134	NaN
4	Media	Cartoons & Comics & Graphic Novels	True	166	NaN
5	Media	Anime & Manga	True	501	NaN
...	...	...	...	...	...
55395603	Freeform	Redacted	False	0	NaN
55395606	Freeform	Redacted	False	0	NaN
55395609	Freeform	Redacted	False	0	NaN
55395612	Freeform	Redacted	False	0	NaN
55395615	Freeform	Redacted	False	0	NaN

14467138 rows × 5 columns

Add_fandom function

We use a function to quickly split tags string into separate ids, use id to locate the tag name, and only select id names that are fandoms.

# x is id string
# df is tags DataFrame
# The function returns all tag names that are Fandom type
# In case of multiple fandoms, tolist() is used 

def add_fandom(x,df):
    subset = df.loc[[int(n) for n in x.split("+")]]
    return subset[subset.type == "Fandom"].name.tolist()

Apply to DataFrame

# Add fandom to a new column
works_all["fandom"] = works_all["tags"].apply(lambda x: add_fandom(x,tags))

works_all

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6	fandom
0	2016-08-28	en	False	False	5078036.0	22+541478+15918+126089+63182+12+741433+230931+...	NaN	[The Hobbit - All Media Types, The Lord of the...
1	2014-06-22	en	False	False	4796066.0	23+14+15322+109011+108231+108232+186363+600534...	NaN	[Formula 1 RPF]
2	2018-07-14	en	False	False	4332910.0	1026+109503+12695+16+24754629+116+37259+11+796...	NaN	[Terminator: The Sarah Connor Chronicles, Term...
3	2013-10-06	en	False	False	3817471.0	11+21+16+1133664+48012+48013+648995+16999+1090...	NaN	[Fiction Wrestling - Fandom]
4	2019-12-18	en	False	False	3456587.0	12+3693074+14030081+10482076+8658412+8658409+1...	NaN	[モブサイコ100 \| Mob Psycho 100]
5	2018-12-18	en	False	False	3312781.0	12+254648+13714235+19334348+557795+1275+282154...	NaN	[Minecraft (Video Game)]
6	2019-10-29	vi	True	False	3163926.0	5450+14+9	NaN	[No Fandom]
7	2014-12-06	en	False	False	3085821.0	13+116+22+23+14+1001939+245368+586439+261582+7...	NaN	[The Avengers (Marvel Movies), Thor (Movies), ...
8	2013-02-16	en	False	True	2853949.0	13+14+951+40167+6563+6560+6559+950+1056+109629...	NaN	[NCIS]
9	2020-10-17	en	False	False	2598127.0	11+13999+2927+6276+2246+17+61	NaN	[Naruto]

To summarize, the work with most words on AO3 is from The Hobbit fandom, followed by a work from Formula 1 RPF fandom. Interestingly, 9 works out of 10 are still works in progress. Note that the date in the data set is “creation date”, meaning which year the work was created. They may well be updated regularly till this day. Thus, it is unsurprising that the works we see above are all created several years ago. It takes time to write fanfictions!

# Same with works_year DataFrame
works_year["fandom"] = works_year["tags"].apply(lambda x: add_fandom(x,tags))

works_year

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6	fandom
0	2008-11-06	en	False	False	128163.0	22+183+2390+1048+966+16+1000+968+184+2395+2379...	NaN	[Harry Potter - Rowling]
1	2009-11-14	en	False	True	756596.0	23+19+13+114941+63594+125727+134988	NaN	[Lord of the Rings - J. R. R. Tolkien]
2	2010-04-14	en	False	False	1005091.0	2246+14+78550+8096+95285+8133+95354+8094+8130+...	NaN	[Weiß Kreuz]
3	2011-06-06	zh	True	True	1490481.0	13+23+14+1039+20020+24+22	NaN	[Queer as Folk (US)]
4	2012-04-16	en	False	False	1310636.0	13+23+14+136512+972932+4937593+70650+1833+2417...	NaN	[Harry Potter - J. K. Rowling]
5	2013-10-06	en	False	False	3817471.0	11+21+16+1133664+48012+48013+648995+16999+1090...	NaN	[Fiction Wrestling - Fandom]
6	2014-06-22	en	False	False	4796066.0	23+14+15322+109011+108231+108232+186363+600534...	NaN	[Formula 1 RPF]
7	2015-09-20	en	False	False	1779264.0	21+16+10767+248734+8005+28451+1000+192+12	NaN	[One Piece]
8	2016-08-28	en	False	False	5078036.0	22+541478+15918+126089+63182+12+741433+230931+...	NaN	[The Hobbit - All Media Types, The Lord of the...
9	2017-09-23	en	False	False	2588814.0	299359+299357+299358+2927+1371926+21+14+12	NaN	[Wiedźmin \| The Witcher (Video Game), Wiedźmin...
10	2018-07-14	en	False	False	4332910.0	1026+109503+12695+16+24754629+116+37259+11+796...	NaN	[Terminator: The Sarah Connor Chronicles, Term...
11	2019-12-18	en	False	False	3456587.0	12+3693074+14030081+10482076+8658412+8658409+1...	NaN	[モブサイコ100 \| Mob Psycho 100]
12	2020-10-17	en	False	False	2598127.0	11+13999+2927+6276+2246+17+61	NaN	[Naruto]
13	2021-01-02	en	False	False	2261752.0	839+30349+9830863+9830815+18525+262083+5195777...	NaN	[X-Men (Movieverse), X-Men (Comicverse), X-Men...

Here, we have the works with most words created each year. Some works are also in the previous DataFram.