In part I, we focus on how to find the rating tags in Tags file and how to add a new rating column in Works file.

Loading File

# Load python libraries
import pandas as pd

# Load works file
works= pd.read_csv("/home/pi/Downloads/works-20210226.csv")

# Load entire tags file
tags = pd.read_csv("/home/pi/Downloads/tags-20210226.csv")

# preview file
tags

	id	type	name	canonical	cached_count	merger_id
0	1	Media	TV Shows	True	910	NaN
1	2	Media	Movies	True	1164	NaN
2	3	Media	Books & Literature	True	134	NaN
3	4	Media	Cartoons & Comics & Graphic Novels	True	166	NaN
4	5	Media	Anime & Manga	True	501	NaN
...	...	...	...	...	...	...
14467133	55395603	Freeform	Redacted	False	0	NaN
14467134	55395606	Freeform	Redacted	False	0	NaN
14467135	55395609	Freeform	Redacted	False	0	NaN
14467136	55395612	Freeform	Redacted	False	0	NaN
14467137	55395615	Freeform	Redacted	False	0	NaN

14467138 rows × 6 columns

# preview file
works

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2021-02-26	en	False	True	388.0	10+414093+1001939+4577144+1499536+110+4682892+...	NaN
1	2021-02-26	en	False	True	1638.0	10+20350917+34816907+23666027+23269305+2326930...	NaN
2	2021-02-26	en	False	True	1502.0	10+10613413+9780526+3763877+3741104+7657229+30...	NaN
3	2021-02-26	en	False	True	100.0	10+15322+54862755+20595867+32994286+663+471751...	NaN
4	2021-02-26	en	False	True	994.0	11+721553+54604+1439500+3938423+53483274+54862...	NaN
...	...	...	...	...	...	...	...
7269688	2008-09-13	en	True	True	705.0	78+77+84+101+104+105+106+23+13+16+70+933	NaN
7269689	2008-09-13	en	False	True	1392.0	78+77+84+107+23+10+16+70+933+616	NaN
7269690	2008-09-13	en	False	True	1755.0	77+78+69+108+109+62+110+23+9+111+16+70+10128+4858	NaN
7269691	2008-09-13	en	False	True	1338.0	112+113+13+114+16+115+101+117+118+119+120+116+...	NaN
7269692	2008-09-13	en	False	True	1836.0	123+124+125+127+128+13+129+14+130+131+132+133+...	NaN

7269693 rows × 7 columns

From the preview, we see that:

The tags column in works contains tag ids for each work, separated by plus sign
Tags file has information about tag ids, types, names, etc

From previous post, we’ve found what tag type looks like:

Media
Rating
ArchiveWarning
Category
Character
Fandom
Relationship
Freeform
UnsortedTag

In this post, we want to find more information about Rating tags.

Rating Tags

# Find rating tags in tags file
rating = tags[tags['type'] == 'Rating']
rating

	id	type	name	canonical	cached_count	merger_id
8	9	Rating	Not Rated	True	825385	NaN
9	10	Rating	General Audiences	True	2115153	NaN
10	11	Rating	Teen And Up Audiences	True	2272688	NaN
11	12	Rating	Mature	True	1151260	NaN
12	13	Rating	Explicit	True	1238331	NaN
3686791	12766726	Rating	Teen & Up Audiences	False	333	NaN

# Save rating tags in a csv file
rating.to_csv('rating.csv', index=False)

There are 5 types of ratings on AO3. The last tag “Teen & Up Audiences” is a duplicate of “Teen And Up Audiences”. Because it has a low cached_count compared to others, we discard it in our analysis.

To simplify the data cleaning process in Rating Tags in Works Part II, we export the rating DataFrame to a local csv file.

Tags Column in Works

The tags column in works is a long string containing tag ids separated by plus sign. From observation, we find that the first id in the string is most likely a rating id.

To extract the rating id from the tags column in works, we’re going to:

Create a new column named “rating” in works
Extract the first id (which is also the minimum number) from the tags column, add the id to rating column

# Check the type of first row in tags column 
# The first row in tags column is a string

print(works['tags'].iloc[0])
type(works['tags'].iloc[0])

10+414093+1001939+4577144+1499536+110+4682892+21+16

str

# Check if every row in tags column is string
# Result shows there're NA values in tags column

works[works['tags'].apply(lambda x: isinstance(x,str)) == False]

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
867445	2020-09-09	en	False	True	63.0	NaN	NaN
2066414	2019-11-25	zh	False	True	6593.0	NaN	NaN
4581595	2017-03-13	en	False	True	1012.0	NaN	NaN

# The NA values do not interfere with our analysis
# Drop NA value in tags column

works = works.dropna(subset = ['tags'])

To extract the smallest number from a string, we first split the string into a list using .split() method; then we iterate the list in order to change the object type from string to interger; lastly, we’re able to select the minimum number from the list with min() function.

To apply the above steps on a Series (the tags column in works), we use pandas.Series.apply function.

# Function to find the mimnimum value in the string, and return that value
# First we split the string into a list by the plus sign
# Then we iterate the list, change the object type from string to integer
# Finally we find the minimum value of the list

def find_rating(x):
    return min([int(n) for n in x.split('+')])

# Create a new column named 'rating'
# Use apply() to apply a function to each row
# The function returns minimum value in tags column
# The minimun value should be rating id, from observation

works['rating'] = works['tags'].apply(lambda x: find_rating(x))
works

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6	rating
0	2021-02-26	en	False	True	388.0	10+414093+1001939+4577144+1499536+110+4682892+...	NaN	10
1	2021-02-26	en	False	True	1638.0	10+20350917+34816907+23666027+23269305+2326930...	NaN	10
2	2021-02-26	en	False	True	1502.0	10+10613413+9780526+3763877+3741104+7657229+30...	NaN	10
3	2021-02-26	en	False	True	100.0	10+15322+54862755+20595867+32994286+663+471751...	NaN	10
4	2021-02-26	en	False	True	994.0	11+721553+54604+1439500+3938423+53483274+54862...	NaN	11
...	...	...	...	...	...	...	...	...
7269688	2008-09-13	en	True	True	705.0	78+77+84+101+104+105+106+23+13+16+70+933	NaN	13
7269689	2008-09-13	en	False	True	1392.0	78+77+84+107+23+10+16+70+933+616	NaN	10
7269690	2008-09-13	en	False	True	1755.0	77+78+69+108+109+62+110+23+9+111+16+70+10128+4858	NaN	9
7269691	2008-09-13	en	False	True	1338.0	112+113+13+114+16+115+101+117+118+119+120+116+...	NaN	13
7269692	2008-09-13	en	False	True	1836.0	123+124+125+127+128+13+129+14+130+131+132+133+...	NaN	13

7269690 rows × 8 columns

We assumed that the first id in the tags string is a rating id, however, extra steps should be taken to check if our assumption is correct.

From tags file, we extracted all correct rating ids. If any row in rating column in works falls outside, we’ll know there’re outliers.

# Check if rating column is indeed rating id
works['rating'].isin(rating['id']).all()

False

# Find the row in rating column that is not rating id
# ~ is negative operator
works[~works['rating'].isin(rating['id'])]

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6	rating
896014	2020-09-02	en	True	True	11852.0	641+101375+14+640+18417+176+60+20225004+76	NaN	14
896017	2020-09-02	en	True	True	3092.0	641+101375+14+640+18417+60+76	NaN	14
896018	2020-09-02	en	True	True	4674.0	641+101375+14+640+18417+60+20225004+76	NaN	14
896019	2020-09-02	en	True	True	1835.0	641+101375+14+640+18417+176+60+20225004+76	NaN	14
896025	2020-09-02	en	True	True	3969.0	641+101375+14+640+18417+61+60+76	NaN	14
...	...	...	...	...	...	...	...	...
5806946	2015-04-22	en	False	True	4111.0	116+16+1635478+251330+4240694+4205153+270282+1...	NaN	16
6197248	2014-07-04	en	False	True	875.0	23+14+1285452+1433626+2332773+2332776+663+2332779	NaN	14
6299245	2014-04-14	en	False	True	2395.0	23+16+586439+55409+7267+17067+30326+501842+237...	NaN	16
6404542	2014-01-10	en	False	True	1350.0	23+14+721553+976131+897092+989133	NaN	14
6456251	2013-11-23	en	False	True	1271.0	21+14+872785+993549+1009491+984790+984786+1212...	NaN	14

488 rows × 8 columns

There are 488 works with no rating. This is actually a know issue that the volunteers are actively working on behind-the-curtain. Thus, we simply drop these works from our data set for now.

# Drop works with no rating
works = works[works['rating'].isin(rating['id'])]

Now we have a file that has all the rating information extracted to a single column. We’ll create graphs based on the information in the next post.