In part I, we focus on how to find the rating tags in Tags file and how to add a new rating column in Works file.

Loading File

# Load python libraries
import pandas as pd
# Load works file
works= pd.read_csv("/home/pi/Downloads/works-20210226.csv")
# Load entire tags file
tags = pd.read_csv("/home/pi/Downloads/tags-20210226.csv")
# preview file
tags
id type name canonical cached_count merger_id
0 1 Media TV Shows True 910 NaN
1 2 Media Movies True 1164 NaN
2 3 Media Books & Literature True 134 NaN
3 4 Media Cartoons & Comics & Graphic Novels True 166 NaN
4 5 Media Anime & Manga True 501 NaN
... ... ... ... ... ... ...
14467133 55395603 Freeform Redacted False 0 NaN
14467134 55395606 Freeform Redacted False 0 NaN
14467135 55395609 Freeform Redacted False 0 NaN
14467136 55395612 Freeform Redacted False 0 NaN
14467137 55395615 Freeform Redacted False 0 NaN

14467138 rows × 6 columns

# preview file
works
creation date language restricted complete word_count tags Unnamed: 6
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN
... ... ... ... ... ... ... ...
7269688 2008-09-13 en True True 705.0 78+77+84+101+104+105+106+23+13+16+70+933 NaN
7269689 2008-09-13 en False True 1392.0 78+77+84+107+23+10+16+70+933+616 NaN
7269690 2008-09-13 en False True 1755.0 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 NaN
7269691 2008-09-13 en False True 1338.0 112+113+13+114+16+115+101+117+118+119+120+116+... NaN
7269692 2008-09-13 en False True 1836.0 123+124+125+127+128+13+129+14+130+131+132+133+... NaN

7269693 rows × 7 columns

From the preview, we see that:

  • The tags column in works contains tag ids for each work, separated by plus sign
  • Tags file has information about tag ids, types, names, etc

From previous post, we’ve found what tag type looks like:

  • Media
  • Rating
  • ArchiveWarning
  • Category
  • Character
  • Fandom
  • Relationship
  • Freeform
  • UnsortedTag

In this post, we want to find more information about Rating tags.

Rating Tags

# Find rating tags in tags file
rating = tags[tags['type'] == 'Rating']
rating
id type name canonical cached_count merger_id
8 9 Rating Not Rated True 825385 NaN
9 10 Rating General Audiences True 2115153 NaN
10 11 Rating Teen And Up Audiences True 2272688 NaN
11 12 Rating Mature True 1151260 NaN
12 13 Rating Explicit True 1238331 NaN
3686791 12766726 Rating Teen & Up Audiences False 333 NaN
# Save rating tags in a csv file
rating.to_csv('rating.csv', index=False)

There are 5 types of ratings on AO3. The last tag “Teen & Up Audiences” is a duplicate of “Teen And Up Audiences”. Because it has a low cached_count compared to others, we discard it in our analysis.

To simplify the data cleaning process in Rating Tags in Works Part II, we export the rating DataFrame to a local csv file.

Tags Column in Works

The tags column in works is a long string containing tag ids separated by plus sign. From observation, we find that the first id in the string is most likely a rating id.

To extract the rating id from the tags column in works, we’re going to:

  • Create a new column named “rating” in works
  • Extract the first id (which is also the minimum number) from the tags column, add the id to rating column
# Check the type of first row in tags column 
# The first row in tags column is a string

print(works['tags'].iloc[0])
type(works['tags'].iloc[0])
10+414093+1001939+4577144+1499536+110+4682892+21+16





str
# Check if every row in tags column is string
# Result shows there're NA values in tags column

works[works['tags'].apply(lambda x: isinstance(x,str)) == False]
creation date language restricted complete word_count tags Unnamed: 6
867445 2020-09-09 en False True 63.0 NaN NaN
2066414 2019-11-25 zh False True 6593.0 NaN NaN
4581595 2017-03-13 en False True 1012.0 NaN NaN
# The NA values do not interfere with our analysis
# Drop NA value in tags column

works = works.dropna(subset = ['tags'])

To extract the smallest number from a string, we first split the string into a list using .split() method; then we iterate the list in order to change the object type from string to interger; lastly, we’re able to select the minimum number from the list with min() function.

To apply the above steps on a Series (the tags column in works), we use pandas.Series.apply function.

# Function to find the mimnimum value in the string, and return that value
# First we split the string into a list by the plus sign
# Then we iterate the list, change the object type from string to integer
# Finally we find the minimum value of the list

def find_rating(x):
    return min([int(n) for n in x.split('+')])
# Create a new column named 'rating'
# Use apply() to apply a function to each row
# The function returns minimum value in tags column
# The minimun value should be rating id, from observation

works['rating'] = works['tags'].apply(lambda x: find_rating(x))
works
creation date language restricted complete word_count tags Unnamed: 6 rating
0 2021-02-26 en False True 388.0 10+414093+1001939+4577144+1499536+110+4682892+... NaN 10
1 2021-02-26 en False True 1638.0 10+20350917+34816907+23666027+23269305+2326930... NaN 10
2 2021-02-26 en False True 1502.0 10+10613413+9780526+3763877+3741104+7657229+30... NaN 10
3 2021-02-26 en False True 100.0 10+15322+54862755+20595867+32994286+663+471751... NaN 10
4 2021-02-26 en False True 994.0 11+721553+54604+1439500+3938423+53483274+54862... NaN 11
... ... ... ... ... ... ... ... ...
7269688 2008-09-13 en True True 705.0 78+77+84+101+104+105+106+23+13+16+70+933 NaN 13
7269689 2008-09-13 en False True 1392.0 78+77+84+107+23+10+16+70+933+616 NaN 10
7269690 2008-09-13 en False True 1755.0 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 NaN 9
7269691 2008-09-13 en False True 1338.0 112+113+13+114+16+115+101+117+118+119+120+116+... NaN 13
7269692 2008-09-13 en False True 1836.0 123+124+125+127+128+13+129+14+130+131+132+133+... NaN 13

7269690 rows × 8 columns

We assumed that the first id in the tags string is a rating id, however, extra steps should be taken to check if our assumption is correct.

From tags file, we extracted all correct rating ids. If any row in rating column in works falls outside, we’ll know there’re outliers.

# Check if rating column is indeed rating id
works['rating'].isin(rating['id']).all()
False
# Find the row in rating column that is not rating id
# ~ is negative operator
works[~works['rating'].isin(rating['id'])]
creation date language restricted complete word_count tags Unnamed: 6 rating
896014 2020-09-02 en True True 11852.0 641+101375+14+640+18417+176+60+20225004+76 NaN 14
896017 2020-09-02 en True True 3092.0 641+101375+14+640+18417+60+76 NaN 14
896018 2020-09-02 en True True 4674.0 641+101375+14+640+18417+60+20225004+76 NaN 14
896019 2020-09-02 en True True 1835.0 641+101375+14+640+18417+176+60+20225004+76 NaN 14
896025 2020-09-02 en True True 3969.0 641+101375+14+640+18417+61+60+76 NaN 14
... ... ... ... ... ... ... ... ...
5806946 2015-04-22 en False True 4111.0 116+16+1635478+251330+4240694+4205153+270282+1... NaN 16
6197248 2014-07-04 en False True 875.0 23+14+1285452+1433626+2332773+2332776+663+2332779 NaN 14
6299245 2014-04-14 en False True 2395.0 23+16+586439+55409+7267+17067+30326+501842+237... NaN 16
6404542 2014-01-10 en False True 1350.0 23+14+721553+976131+897092+989133 NaN 14
6456251 2013-11-23 en False True 1271.0 21+14+872785+993549+1009491+984790+984786+1212... NaN 14

488 rows × 8 columns

There are 488 works with no rating. This is actually a know issue that the volunteers are actively working on behind-the-curtain. Thus, we simply drop these works from our data set for now.

# Drop works with no rating
works = works[works['rating'].isin(rating['id'])]

Now we have a file that has all the rating information extracted to a single column. We’ll create graphs based on the information in the next post.