In this series of posts, we look at some fun facts on AO3, such as:

  • works with most tags;

  • works with most words;

  • top fandoms;

etc.

Loading File

# Load python libraries
import pandas as pd
import gc
# Load data

chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
# Preview
works
creation date language restricted complete word_count tags Unnamed: 6
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN
... ... ... ... ... ... ... ...
7269688 2008-09-13 en True True 705.0 78+77+84+101+104+105+106+23+13+16+70+933 NaN
7269689 2008-09-13 en False True 1392.0 78+77+84+107+23+10+16+70+933+616 NaN
7269690 2008-09-13 en False True 1755.0 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 NaN
7269691 2008-09-13 en False True 1338.0 112+113+13+114+16+115+101+117+118+119+120+116+... NaN
7269692 2008-09-13 en False True 1836.0 123+124+125+127+128+13+129+14+130+131+132+133+... NaN

7269693 rows × 7 columns

Finding The Number Of Tags

Each row in the tags column is a string with the plus sign separating tag ids. To find the number of tags, we simply split the string into a list by the plus sign, and find the length of the list.

# Drop na value
works = works.dropna(subset=["tags"])
# Add a tag length column
works["tag_len"] = works["tags"].apply(lambda x: len(x.split("+")))
works
creation date language restricted complete word_count tags Unnamed: 6 tag_len
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN 9
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN 17
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN 21
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN 14
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN 17
... ... ... ... ... ... ... ... ...
7269688 2008-09-13 en True True 705.0 78+77+84+101+104+105+106+23+13+16+70+933 NaN 12
7269689 2008-09-13 en False True 1392.0 78+77+84+107+23+10+16+70+933+616 NaN 10
7269690 2008-09-13 en False True 1755.0 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 NaN 14
7269691 2008-09-13 en False True 1338.0 112+113+13+114+16+115+101+117+118+119+120+116+... NaN 14
7269692 2008-09-13 en False True 1836.0 123+124+125+127+128+13+129+14+130+131+132+133+... NaN 21

7269690 rows × 8 columns

nlargest() Method

The pandas.DataFrame.nlargest returns the first n rows with the largest values in columns, in descending order.

We use the method to find the top 10 works with most tags.

# Find the index of the top 10 works
works.tag_len.nlargest(10).index
Int64Index([3892613, 3587796, 1883307, 4207234,  902616, 4203009, 1671832,
            1223902, 2128078, 3039343],
           dtype='int64')
# Find the actual subset of dataframe with above index
# Note .loc returns a label of the index. 
# This use is not an integer position along the index.

works.loc[works.tag_len.nlargest(10).index]
creation date language restricted complete word_count tags Unnamed: 6 tag_len
3892613 2018-01-27 en False False 9584.0 12+58950+10861138+8191561+21029181+21029184+49... NaN 678
3587796 2018-06-10 en True False 9086.0 12+1909262+23913063+3247121+15425106+5149081+9... NaN 629
1883307 2020-01-17 en False False 1825.0 6541412+6874891+6679306+11004304+10959982+6799... NaN 599
4207234 2017-09-04 en False False 77028.0 13+576450+4182+245530+143056+1487281+1050015+1... NaN 548
902616 2020-09-01 en False False 5100.0 12+1002903+6841411+2467008+1775227+5556795+116... NaN 518
4203009 2017-09-06 en False True 124766.0 2692+18335742+18335451+18335454+18335457+18335... NaN 517
1671832 2020-03-14 en False False 64835.0 13+448284+9607162+38719528+38712538+292740+746... NaN 497
1223902 2020-06-27 en False False 31814.0 2692+14+12+2927+479641+494522+844577+475944+47... NaN 462
2128078 2019-11-04 en False False 21716.0 11+758208+1786102+770175+23+14+3393344+4696920... NaN 441
3039343 2019-01-05 en False True 66413.0 10+26034884+565209+27204215+27175544+60+2472+1... NaN 435

The above subset displays the top 10 works with most tags. The first work has a total of 678 tags.

In the following posts, we’re going to replace tag ids with tag names in order to find more information about the works.