In this series of posts, we look at some fun facts on AO3, such as:

  • works with most tags;

  • works with most words;

  • top fandoms;

etc.

Loading File

# Load python libraries
import pandas as pd
import gc
# Load data

chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
# Preview
works
creation date language restricted complete word_count tags Unnamed: 6
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN
... ... ... ... ... ... ... ...
7269688 2008-09-13 en True True 705.0 78+77+84+101+104+105+106+23+13+16+70+933 NaN
7269689 2008-09-13 en False True 1392.0 78+77+84+107+23+10+16+70+933+616 NaN
7269690 2008-09-13 en False True 1755.0 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 NaN
7269691 2008-09-13 en False True 1338.0 112+113+13+114+16+115+101+117+118+119+120+116+... NaN
7269692 2008-09-13 en False True 1836.0 123+124+125+127+128+13+129+14+130+131+132+133+... NaN

7269693 rows × 7 columns

Finding The Works With Most Words

We use .nlargest() method on the word_count column to find the 10 works with most words on AO3.

# use .nlargest() method
works.loc[works['word_count'].nlargest(10).index]
creation date language restricted complete word_count tags Unnamed: 6
4978554 2016-08-28 en False False 5078036.0 22+541478+15918+126089+63182+12+741433+230931+... NaN
6212214 2014-06-22 en False False 4796066.0 23+14+15322+109011+108231+108232+186363+600534... NaN
3502920 2018-07-14 en False False 4332910.0 1026+109503+12695+16+24754629+116+37259+11+796... NaN
6500931 2013-10-06 en False False 3817471.0 11+21+16+1133664+48012+48013+648995+16999+1090... NaN
1996502 2019-12-18 en False False 3456587.0 12+3693074+14030081+10482076+8658412+8658409+1... NaN
3102917 2018-12-18 en False False 3312781.0 12+254648+13714235+19334348+557795+1275+282154... NaN
2146591 2019-10-29 vi True False 3163926.0 5450+14+9 NaN
6010465 2014-12-06 en False False 3085821.0 13+116+22+23+14+1001939+245368+586439+261582+7... NaN
6723756 2013-02-16 en False True 2853949.0 13+14+951+40167+6563+6560+6559+950+1056+109629... NaN
691137 2020-10-17 en False False 2598127.0 11+13999+2927+6276+2246+17+61 NaN
# save the subset in local csv file 
works.loc[works['word_count'].nlargest(10).index].to_csv("trivia/top-words-all-time.csv",index=False)

We can see from above the work with most words on AO3 has a word count of 5,078,036. In the following post, we’re going to find more informtion with the tag ids.

Works With Most Words By Year

Let’s do something a little more complex than above. Let’s find the work with most words in each year.

The steps are as follows:

  1. splitting the data by year using .groupby();

  2. selecting the word_count column;

  3. using .idxmax() method to find the index of the maximum value in said column;

  4. using the .loc() and above index to select rows.

# clear memory
gc.collect()
289
# make sure the data column is in datetime format
works['creation date'] = pd.to_datetime(works['creation date'])
# find works with most words by year

works.loc[works.groupby(by=[works['creation date'].dt.year]).word_count.idxmax()]
creation date language restricted complete word_count tags Unnamed: 6
7268202 2008-11-06 en False False 128163.0 22+183+2390+1048+966+16+1000+968+184+2395+2379... NaN
7262756 2009-11-14 en False True 756596.0 23+19+13+114941+63594+125727+134988 NaN
7200298 2010-04-14 en False False 1005091.0 2246+14+78550+8096+95285+8133+95354+8094+8130+... NaN
7094887 2011-06-06 zh True True 1490481.0 13+23+14+1039+20020+24+22 NaN
6955954 2012-04-16 en False False 1310636.0 13+23+14+136512+972932+4937593+70650+1833+2417... NaN
6500931 2013-10-06 en False False 3817471.0 11+21+16+1133664+48012+48013+648995+16999+1090... NaN
6212214 2014-06-22 en False False 4796066.0 23+14+15322+109011+108231+108232+186363+600534... NaN
5580243 2015-09-20 en False False 1779264.0 21+16+10767+248734+8005+28451+1000+192+12 NaN
4978554 2016-08-28 en False False 5078036.0 22+541478+15918+126089+63182+12+741433+230931+... NaN
4168858 2017-09-23 en False False 2588814.0 299359+299357+299358+2927+1371926+21+14+12 NaN
3502920 2018-07-14 en False False 4332910.0 1026+109503+12695+16+24754629+116+37259+11+796... NaN
1996502 2019-12-18 en False False 3456587.0 12+3693074+14030081+10482076+8658412+8658409+1... NaN
691137 2020-10-17 en False False 2598127.0 11+13999+2927+6276+2246+17+61 NaN
306970 2021-01-02 en False False 2261752.0 839+30349+9830863+9830815+18525+262083+5195777... NaN
# save the subset in local csv file
works.loc[works.groupby(by=[works['creation date'].dt.year]).word_count.idxmax()].to_csv("trivia/top-words-by-year.csv",index=False)