AO3 Trivia: Works With Most Tags
In this series of posts, we look at some fun facts on AO3, such as:
-
works with most tags;
-
works with most words;
-
top fandoms;
etc.
Loading File
# Load python libraries
import pandas as pd
import gc
# Load data
chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
# Preview
works
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388.0 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN |
1 | 2021-02-26 | en | False | True | 1638.0 | 10+20350917+34816907+23666027+23269305+2326930... | NaN |
2 | 2021-02-26 | en | False | True | 1502.0 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN |
3 | 2021-02-26 | en | False | True | 100.0 | 10+15322+54862755+20595867+32994286+663+471751... | NaN |
4 | 2021-02-26 | en | False | True | 994.0 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
7269688 | 2008-09-13 | en | True | True | 705.0 | 78+77+84+101+104+105+106+23+13+16+70+933 | NaN |
7269689 | 2008-09-13 | en | False | True | 1392.0 | 78+77+84+107+23+10+16+70+933+616 | NaN |
7269690 | 2008-09-13 | en | False | True | 1755.0 | 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 | NaN |
7269691 | 2008-09-13 | en | False | True | 1338.0 | 112+113+13+114+16+115+101+117+118+119+120+116+... | NaN |
7269692 | 2008-09-13 | en | False | True | 1836.0 | 123+124+125+127+128+13+129+14+130+131+132+133+... | NaN |
7269693 rows × 7 columns
Finding The Number Of Tags
Each row in the tags column is a string with the plus sign separating tag ids. To find the number of tags, we simply split the string into a list by the plus sign, and find the length of the list.
# Drop na value
works = works.dropna(subset=["tags"])
# Add a tag length column
works["tag_len"] = works["tags"].apply(lambda x: len(x.split("+")))
works
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | tag_len | |
---|---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388.0 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN | 9 |
1 | 2021-02-26 | en | False | True | 1638.0 | 10+20350917+34816907+23666027+23269305+2326930... | NaN | 17 |
2 | 2021-02-26 | en | False | True | 1502.0 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN | 21 |
3 | 2021-02-26 | en | False | True | 100.0 | 10+15322+54862755+20595867+32994286+663+471751... | NaN | 14 |
4 | 2021-02-26 | en | False | True | 994.0 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN | 17 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
7269688 | 2008-09-13 | en | True | True | 705.0 | 78+77+84+101+104+105+106+23+13+16+70+933 | NaN | 12 |
7269689 | 2008-09-13 | en | False | True | 1392.0 | 78+77+84+107+23+10+16+70+933+616 | NaN | 10 |
7269690 | 2008-09-13 | en | False | True | 1755.0 | 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 | NaN | 14 |
7269691 | 2008-09-13 | en | False | True | 1338.0 | 112+113+13+114+16+115+101+117+118+119+120+116+... | NaN | 14 |
7269692 | 2008-09-13 | en | False | True | 1836.0 | 123+124+125+127+128+13+129+14+130+131+132+133+... | NaN | 21 |
7269690 rows × 8 columns
nlargest() Method
The pandas.DataFrame.nlargest returns the first n rows with the largest values in columns, in descending order.
We use the method to find the top 10 works with most tags.
# Find the index of the top 10 works
works.tag_len.nlargest(10).index
Int64Index([3892613, 3587796, 1883307, 4207234, 902616, 4203009, 1671832,
1223902, 2128078, 3039343],
dtype='int64')
# Find the actual subset of dataframe with above index
# Note .loc returns a label of the index.
# This use is not an integer position along the index.
works.loc[works.tag_len.nlargest(10).index]
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | tag_len | |
---|---|---|---|---|---|---|---|---|
3892613 | 2018-01-27 | en | False | False | 9584.0 | 12+58950+10861138+8191561+21029181+21029184+49... | NaN | 678 |
3587796 | 2018-06-10 | en | True | False | 9086.0 | 12+1909262+23913063+3247121+15425106+5149081+9... | NaN | 629 |
1883307 | 2020-01-17 | en | False | False | 1825.0 | 6541412+6874891+6679306+11004304+10959982+6799... | NaN | 599 |
4207234 | 2017-09-04 | en | False | False | 77028.0 | 13+576450+4182+245530+143056+1487281+1050015+1... | NaN | 548 |
902616 | 2020-09-01 | en | False | False | 5100.0 | 12+1002903+6841411+2467008+1775227+5556795+116... | NaN | 518 |
4203009 | 2017-09-06 | en | False | True | 124766.0 | 2692+18335742+18335451+18335454+18335457+18335... | NaN | 517 |
1671832 | 2020-03-14 | en | False | False | 64835.0 | 13+448284+9607162+38719528+38712538+292740+746... | NaN | 497 |
1223902 | 2020-06-27 | en | False | False | 31814.0 | 2692+14+12+2927+479641+494522+844577+475944+47... | NaN | 462 |
2128078 | 2019-11-04 | en | False | False | 21716.0 | 11+758208+1786102+770175+23+14+3393344+4696920... | NaN | 441 |
3039343 | 2019-01-05 | en | False | True | 66413.0 | 10+26034884+565209+27204215+27175544+60+2472+1... | NaN | 435 |
The above subset displays the top 10 works with most tags. The first work has a total of 678 tags.
In the following posts, we’re going to replace tag ids with tag names in order to find more information about the works.