AO3 Trivia: Works With Most Words Part I
In this series of posts, we look at some fun facts on AO3, such as:
-
works with most tags;
-
works with most words;
-
top fandoms;
etc.
Loading File
# Load python libraries
import pandas as pd
import gc
# Load data
chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
# Preview
works
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388.0 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN |
1 | 2021-02-26 | en | False | True | 1638.0 | 10+20350917+34816907+23666027+23269305+2326930... | NaN |
2 | 2021-02-26 | en | False | True | 1502.0 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN |
3 | 2021-02-26 | en | False | True | 100.0 | 10+15322+54862755+20595867+32994286+663+471751... | NaN |
4 | 2021-02-26 | en | False | True | 994.0 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
7269688 | 2008-09-13 | en | True | True | 705.0 | 78+77+84+101+104+105+106+23+13+16+70+933 | NaN |
7269689 | 2008-09-13 | en | False | True | 1392.0 | 78+77+84+107+23+10+16+70+933+616 | NaN |
7269690 | 2008-09-13 | en | False | True | 1755.0 | 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 | NaN |
7269691 | 2008-09-13 | en | False | True | 1338.0 | 112+113+13+114+16+115+101+117+118+119+120+116+... | NaN |
7269692 | 2008-09-13 | en | False | True | 1836.0 | 123+124+125+127+128+13+129+14+130+131+132+133+... | NaN |
7269693 rows × 7 columns
Finding The Works With Most Words
We use .nlargest() method on the word_count column to find the 10 works with most words on AO3.
# use .nlargest() method
works.loc[works['word_count'].nlargest(10).index]
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
4978554 | 2016-08-28 | en | False | False | 5078036.0 | 22+541478+15918+126089+63182+12+741433+230931+... | NaN |
6212214 | 2014-06-22 | en | False | False | 4796066.0 | 23+14+15322+109011+108231+108232+186363+600534... | NaN |
3502920 | 2018-07-14 | en | False | False | 4332910.0 | 1026+109503+12695+16+24754629+116+37259+11+796... | NaN |
6500931 | 2013-10-06 | en | False | False | 3817471.0 | 11+21+16+1133664+48012+48013+648995+16999+1090... | NaN |
1996502 | 2019-12-18 | en | False | False | 3456587.0 | 12+3693074+14030081+10482076+8658412+8658409+1... | NaN |
3102917 | 2018-12-18 | en | False | False | 3312781.0 | 12+254648+13714235+19334348+557795+1275+282154... | NaN |
2146591 | 2019-10-29 | vi | True | False | 3163926.0 | 5450+14+9 | NaN |
6010465 | 2014-12-06 | en | False | False | 3085821.0 | 13+116+22+23+14+1001939+245368+586439+261582+7... | NaN |
6723756 | 2013-02-16 | en | False | True | 2853949.0 | 13+14+951+40167+6563+6560+6559+950+1056+109629... | NaN |
691137 | 2020-10-17 | en | False | False | 2598127.0 | 11+13999+2927+6276+2246+17+61 | NaN |
# save the subset in local csv file
works.loc[works['word_count'].nlargest(10).index].to_csv("trivia/top-words-all-time.csv",index=False)
We can see from above the work with most words on AO3 has a word count of 5,078,036. In the following post, we’re going to find more informtion with the tag ids.
Works With Most Words By Year
Let’s do something a little more complex than above. Let’s find the work with most words in each year.
The steps are as follows:
-
splitting the data by year using .groupby();
-
selecting the word_count column;
-
using .idxmax() method to find the index of the maximum value in said column;
-
using the .loc() and above index to select rows.
# clear memory
gc.collect()
289
# make sure the data column is in datetime format
works['creation date'] = pd.to_datetime(works['creation date'])
# find works with most words by year
works.loc[works.groupby(by=[works['creation date'].dt.year]).word_count.idxmax()]
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
7268202 | 2008-11-06 | en | False | False | 128163.0 | 22+183+2390+1048+966+16+1000+968+184+2395+2379... | NaN |
7262756 | 2009-11-14 | en | False | True | 756596.0 | 23+19+13+114941+63594+125727+134988 | NaN |
7200298 | 2010-04-14 | en | False | False | 1005091.0 | 2246+14+78550+8096+95285+8133+95354+8094+8130+... | NaN |
7094887 | 2011-06-06 | zh | True | True | 1490481.0 | 13+23+14+1039+20020+24+22 | NaN |
6955954 | 2012-04-16 | en | False | False | 1310636.0 | 13+23+14+136512+972932+4937593+70650+1833+2417... | NaN |
6500931 | 2013-10-06 | en | False | False | 3817471.0 | 11+21+16+1133664+48012+48013+648995+16999+1090... | NaN |
6212214 | 2014-06-22 | en | False | False | 4796066.0 | 23+14+15322+109011+108231+108232+186363+600534... | NaN |
5580243 | 2015-09-20 | en | False | False | 1779264.0 | 21+16+10767+248734+8005+28451+1000+192+12 | NaN |
4978554 | 2016-08-28 | en | False | False | 5078036.0 | 22+541478+15918+126089+63182+12+741433+230931+... | NaN |
4168858 | 2017-09-23 | en | False | False | 2588814.0 | 299359+299357+299358+2927+1371926+21+14+12 | NaN |
3502920 | 2018-07-14 | en | False | False | 4332910.0 | 1026+109503+12695+16+24754629+116+37259+11+796... | NaN |
1996502 | 2019-12-18 | en | False | False | 3456587.0 | 12+3693074+14030081+10482076+8658412+8658409+1... | NaN |
691137 | 2020-10-17 | en | False | False | 2598127.0 | 11+13999+2927+6276+2246+17+61 | NaN |
306970 | 2021-01-02 | en | False | False | 2261752.0 | 839+30349+9830863+9830815+18525+262083+5195777... | NaN |
# save the subset in local csv file
works.loc[works.groupby(by=[works['creation date'].dt.year]).word_count.idxmax()].to_csv("trivia/top-words-by-year.csv",index=False)