AO3 Trivia: Works With Most Words Part II
In Part II, we look into the fandoms that have works with most words.
Loading File
# Load python libraries
import pandas as pd
import gc
# Load data
chunker = pd.read_csv("/home/pi/Downloads/tags-20210226.csv", chunksize=10000)
tags = pd.concat(chunker, ignore_index=True)
# preview
tags
| id | type | name | canonical | cached_count | merger_id | |
|---|---|---|---|---|---|---|
| 0 | 1 | Media | TV Shows | True | 910 | NaN |
| 1 | 2 | Media | Movies | True | 1164 | NaN |
| 2 | 3 | Media | Books & Literature | True | 134 | NaN |
| 3 | 4 | Media | Cartoons & Comics & Graphic Novels | True | 166 | NaN |
| 4 | 5 | Media | Anime & Manga | True | 501 | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 14467133 | 55395603 | Freeform | Redacted | False | 0 | NaN |
| 14467134 | 55395606 | Freeform | Redacted | False | 0 | NaN |
| 14467135 | 55395609 | Freeform | Redacted | False | 0 | NaN |
| 14467136 | 55395612 | Freeform | Redacted | False | 0 | NaN |
| 14467137 | 55395615 | Freeform | Redacted | False | 0 | NaN |
14467138 rows × 6 columns
Note from the above preview, some tags have names as “Redacted”. This is due to the fact that these tags have the cached_count less than 5.
In Part I, we’ve extracted 10 works with most words on AO3, and saved it to a local file named “top-words-all-time.csv”; we also filtered the data by year, found the work with most words for that year, and saved the DataFrame as ““top-words-by-year.csv”.
# Load data
works_all = pd.read_csv("trivia/top-words-all-time.csv")
works_all
| creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 2016-08-28 | en | False | False | 5078036.0 | 22+541478+15918+126089+63182+12+741433+230931+... | NaN |
| 1 | 2014-06-22 | en | False | False | 4796066.0 | 23+14+15322+109011+108231+108232+186363+600534... | NaN |
| 2 | 2018-07-14 | en | False | False | 4332910.0 | 1026+109503+12695+16+24754629+116+37259+11+796... | NaN |
| 3 | 2013-10-06 | en | False | False | 3817471.0 | 11+21+16+1133664+48012+48013+648995+16999+1090... | NaN |
| 4 | 2019-12-18 | en | False | False | 3456587.0 | 12+3693074+14030081+10482076+8658412+8658409+1... | NaN |
| 5 | 2018-12-18 | en | False | False | 3312781.0 | 12+254648+13714235+19334348+557795+1275+282154... | NaN |
| 6 | 2019-10-29 | vi | True | False | 3163926.0 | 5450+14+9 | NaN |
| 7 | 2014-12-06 | en | False | False | 3085821.0 | 13+116+22+23+14+1001939+245368+586439+261582+7... | NaN |
| 8 | 2013-02-16 | en | False | True | 2853949.0 | 13+14+951+40167+6563+6560+6559+950+1056+109629... | NaN |
| 9 | 2020-10-17 | en | False | False | 2598127.0 | 11+13999+2927+6276+2246+17+61 | NaN |
# Load data
works_year = pd.read_csv("trivia/top-words-by-year.csv")
works_year
| creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 2008-11-06 | en | False | False | 128163.0 | 22+183+2390+1048+966+16+1000+968+184+2395+2379... | NaN |
| 1 | 2009-11-14 | en | False | True | 756596.0 | 23+19+13+114941+63594+125727+134988 | NaN |
| 2 | 2010-04-14 | en | False | False | 1005091.0 | 2246+14+78550+8096+95285+8133+95354+8094+8130+... | NaN |
| 3 | 2011-06-06 | zh | True | True | 1490481.0 | 13+23+14+1039+20020+24+22 | NaN |
| 4 | 2012-04-16 | en | False | False | 1310636.0 | 13+23+14+136512+972932+4937593+70650+1833+2417... | NaN |
| 5 | 2013-10-06 | en | False | False | 3817471.0 | 11+21+16+1133664+48012+48013+648995+16999+1090... | NaN |
| 6 | 2014-06-22 | en | False | False | 4796066.0 | 23+14+15322+109011+108231+108232+186363+600534... | NaN |
| 7 | 2015-09-20 | en | False | False | 1779264.0 | 21+16+10767+248734+8005+28451+1000+192+12 | NaN |
| 8 | 2016-08-28 | en | False | False | 5078036.0 | 22+541478+15918+126089+63182+12+741433+230931+... | NaN |
| 9 | 2017-09-23 | en | False | False | 2588814.0 | 299359+299357+299358+2927+1371926+21+14+12 | NaN |
| 10 | 2018-07-14 | en | False | False | 4332910.0 | 1026+109503+12695+16+24754629+116+37259+11+796... | NaN |
| 11 | 2019-12-18 | en | False | False | 3456587.0 | 12+3693074+14030081+10482076+8658412+8658409+1... | NaN |
| 12 | 2020-10-17 | en | False | False | 2598127.0 | 11+13999+2927+6276+2246+17+61 | NaN |
| 13 | 2021-01-02 | en | False | False | 2261752.0 | 839+30349+9830863+9830815+18525+262083+5195777... | NaN |
From Tag ID To Tag Name
On one hand, we have tag ids associated with works in one DataFrame; on the other hand, we have tag ids with names in another DataFrame.
Let’s split the tag id string into individual tag ids, find the respective name in tags DataFrame, and append the name back to the works DataFrame under a new column.
Because a work has multiple tags, some works have as many as 500+ tags (from previous post, Works With Most Tags), we’ll only extract the fandom tag from the tags DataFrame.
We can check the types of tags that are in the data set.
# Types of tags
tags.type.unique()
array(['Media', 'Rating', 'ArchiveWarning', 'Category', 'Character',
'Fandom', 'Relationship', 'Freeform', 'UnsortedTag'], dtype=object)
In order to user .loc[] method to quickly select specific rows based on index labels, we first set tag id as index in the tags DataFrame.
Set Index
# Set id column as index
tags.set_index("id", inplace=True)
tags
| type | name | canonical | cached_count | merger_id | |
|---|---|---|---|---|---|
| id | |||||
| 1 | Media | TV Shows | True | 910 | NaN |
| 2 | Media | Movies | True | 1164 | NaN |
| 3 | Media | Books & Literature | True | 134 | NaN |
| 4 | Media | Cartoons & Comics & Graphic Novels | True | 166 | NaN |
| 5 | Media | Anime & Manga | True | 501 | NaN |
| ... | ... | ... | ... | ... | ... |
| 55395603 | Freeform | Redacted | False | 0 | NaN |
| 55395606 | Freeform | Redacted | False | 0 | NaN |
| 55395609 | Freeform | Redacted | False | 0 | NaN |
| 55395612 | Freeform | Redacted | False | 0 | NaN |
| 55395615 | Freeform | Redacted | False | 0 | NaN |
14467138 rows × 5 columns
Add_fandom function
We use a function to quickly split tags string into separate ids, use id to locate the tag name, and only select id names that are fandoms.
# x is id string
# df is tags DataFrame
# The function returns all tag names that are Fandom type
# In case of multiple fandoms, tolist() is used
def add_fandom(x,df):
subset = df.loc[[int(n) for n in x.split("+")]]
return subset[subset.type == "Fandom"].name.tolist()
Apply to DataFrame
# Add fandom to a new column
works_all["fandom"] = works_all["tags"].apply(lambda x: add_fandom(x,tags))
works_all
| creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | fandom | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2016-08-28 | en | False | False | 5078036.0 | 22+541478+15918+126089+63182+12+741433+230931+... | NaN | [The Hobbit - All Media Types, The Lord of the... |
| 1 | 2014-06-22 | en | False | False | 4796066.0 | 23+14+15322+109011+108231+108232+186363+600534... | NaN | [Formula 1 RPF] |
| 2 | 2018-07-14 | en | False | False | 4332910.0 | 1026+109503+12695+16+24754629+116+37259+11+796... | NaN | [Terminator: The Sarah Connor Chronicles, Term... |
| 3 | 2013-10-06 | en | False | False | 3817471.0 | 11+21+16+1133664+48012+48013+648995+16999+1090... | NaN | [Fiction Wrestling - Fandom] |
| 4 | 2019-12-18 | en | False | False | 3456587.0 | 12+3693074+14030081+10482076+8658412+8658409+1... | NaN | [モブサイコ100 | Mob Psycho 100] |
| 5 | 2018-12-18 | en | False | False | 3312781.0 | 12+254648+13714235+19334348+557795+1275+282154... | NaN | [Minecraft (Video Game)] |
| 6 | 2019-10-29 | vi | True | False | 3163926.0 | 5450+14+9 | NaN | [No Fandom] |
| 7 | 2014-12-06 | en | False | False | 3085821.0 | 13+116+22+23+14+1001939+245368+586439+261582+7... | NaN | [The Avengers (Marvel Movies), Thor (Movies), ... |
| 8 | 2013-02-16 | en | False | True | 2853949.0 | 13+14+951+40167+6563+6560+6559+950+1056+109629... | NaN | [NCIS] |
| 9 | 2020-10-17 | en | False | False | 2598127.0 | 11+13999+2927+6276+2246+17+61 | NaN | [Naruto] |
To summarize, the work with most words on AO3 is from The Hobbit fandom, followed by a work from Formula 1 RPF fandom. Interestingly, 9 works out of 10 are still works in progress. Note that the date in the data set is “creation date”, meaning which year the work was created. They may well be updated regularly till this day. Thus, it is unsurprising that the works we see above are all created several years ago. It takes time to write fanfictions!
# Same with works_year DataFrame
works_year["fandom"] = works_year["tags"].apply(lambda x: add_fandom(x,tags))
works_year
| creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | fandom | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2008-11-06 | en | False | False | 128163.0 | 22+183+2390+1048+966+16+1000+968+184+2395+2379... | NaN | [Harry Potter - Rowling] |
| 1 | 2009-11-14 | en | False | True | 756596.0 | 23+19+13+114941+63594+125727+134988 | NaN | [Lord of the Rings - J. R. R. Tolkien] |
| 2 | 2010-04-14 | en | False | False | 1005091.0 | 2246+14+78550+8096+95285+8133+95354+8094+8130+... | NaN | [Weiß Kreuz] |
| 3 | 2011-06-06 | zh | True | True | 1490481.0 | 13+23+14+1039+20020+24+22 | NaN | [Queer as Folk (US)] |
| 4 | 2012-04-16 | en | False | False | 1310636.0 | 13+23+14+136512+972932+4937593+70650+1833+2417... | NaN | [Harry Potter - J. K. Rowling] |
| 5 | 2013-10-06 | en | False | False | 3817471.0 | 11+21+16+1133664+48012+48013+648995+16999+1090... | NaN | [Fiction Wrestling - Fandom] |
| 6 | 2014-06-22 | en | False | False | 4796066.0 | 23+14+15322+109011+108231+108232+186363+600534... | NaN | [Formula 1 RPF] |
| 7 | 2015-09-20 | en | False | False | 1779264.0 | 21+16+10767+248734+8005+28451+1000+192+12 | NaN | [One Piece] |
| 8 | 2016-08-28 | en | False | False | 5078036.0 | 22+541478+15918+126089+63182+12+741433+230931+... | NaN | [The Hobbit - All Media Types, The Lord of the... |
| 9 | 2017-09-23 | en | False | False | 2588814.0 | 299359+299357+299358+2927+1371926+21+14+12 | NaN | [Wiedźmin | The Witcher (Video Game), Wiedźmin... |
| 10 | 2018-07-14 | en | False | False | 4332910.0 | 1026+109503+12695+16+24754629+116+37259+11+796... | NaN | [Terminator: The Sarah Connor Chronicles, Term... |
| 11 | 2019-12-18 | en | False | False | 3456587.0 | 12+3693074+14030081+10482076+8658412+8658409+1... | NaN | [モブサイコ100 | Mob Psycho 100] |
| 12 | 2020-10-17 | en | False | False | 2598127.0 | 11+13999+2927+6276+2246+17+61 | NaN | [Naruto] |
| 13 | 2021-01-02 | en | False | False | 2261752.0 | 839+30349+9830863+9830815+18525+262083+5195777... | NaN | [X-Men (Movieverse), X-Men (Comicverse), X-Men... |
Here, we have the works with most words created each year. Some works are also in the previous DataFram.