Rating Tags in Works Part I
In part I, we focus on how to find the rating tags in Tags file and how to add a new rating column in Works file.
Loading File
# Load python libraries
import pandas as pd
# Load works file
works= pd.read_csv("/home/pi/Downloads/works-20210226.csv")
# Load entire tags file
tags = pd.read_csv("/home/pi/Downloads/tags-20210226.csv")
# preview file
tags
id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|
0 | 1 | Media | TV Shows | True | 910 | NaN |
1 | 2 | Media | Movies | True | 1164 | NaN |
2 | 3 | Media | Books & Literature | True | 134 | NaN |
3 | 4 | Media | Cartoons & Comics & Graphic Novels | True | 166 | NaN |
4 | 5 | Media | Anime & Manga | True | 501 | NaN |
... | ... | ... | ... | ... | ... | ... |
14467133 | 55395603 | Freeform | Redacted | False | 0 | NaN |
14467134 | 55395606 | Freeform | Redacted | False | 0 | NaN |
14467135 | 55395609 | Freeform | Redacted | False | 0 | NaN |
14467136 | 55395612 | Freeform | Redacted | False | 0 | NaN |
14467137 | 55395615 | Freeform | Redacted | False | 0 | NaN |
14467138 rows × 6 columns
# preview file
works
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388.0 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN |
1 | 2021-02-26 | en | False | True | 1638.0 | 10+20350917+34816907+23666027+23269305+2326930... | NaN |
2 | 2021-02-26 | en | False | True | 1502.0 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN |
3 | 2021-02-26 | en | False | True | 100.0 | 10+15322+54862755+20595867+32994286+663+471751... | NaN |
4 | 2021-02-26 | en | False | True | 994.0 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
7269688 | 2008-09-13 | en | True | True | 705.0 | 78+77+84+101+104+105+106+23+13+16+70+933 | NaN |
7269689 | 2008-09-13 | en | False | True | 1392.0 | 78+77+84+107+23+10+16+70+933+616 | NaN |
7269690 | 2008-09-13 | en | False | True | 1755.0 | 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 | NaN |
7269691 | 2008-09-13 | en | False | True | 1338.0 | 112+113+13+114+16+115+101+117+118+119+120+116+... | NaN |
7269692 | 2008-09-13 | en | False | True | 1836.0 | 123+124+125+127+128+13+129+14+130+131+132+133+... | NaN |
7269693 rows × 7 columns
From the preview, we see that:
- The tags column in works contains tag ids for each work, separated by plus sign
- Tags file has information about tag ids, types, names, etc
From previous post, we’ve found what tag type looks like:
- Media
- Rating
- ArchiveWarning
- Category
- Character
- Fandom
- Relationship
- Freeform
- UnsortedTag
In this post, we want to find more information about Rating tags.
Rating Tags
# Find rating tags in tags file
rating = tags[tags['type'] == 'Rating']
rating
id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|
8 | 9 | Rating | Not Rated | True | 825385 | NaN |
9 | 10 | Rating | General Audiences | True | 2115153 | NaN |
10 | 11 | Rating | Teen And Up Audiences | True | 2272688 | NaN |
11 | 12 | Rating | Mature | True | 1151260 | NaN |
12 | 13 | Rating | Explicit | True | 1238331 | NaN |
3686791 | 12766726 | Rating | Teen & Up Audiences | False | 333 | NaN |
# Save rating tags in a csv file
rating.to_csv('rating.csv', index=False)
There are 5 types of ratings on AO3. The last tag “Teen & Up Audiences” is a duplicate of “Teen And Up Audiences”. Because it has a low cached_count compared to others, we discard it in our analysis.
To simplify the data cleaning process in Rating Tags in Works Part II, we export the rating DataFrame to a local csv file.
Tags Column in Works
The tags column in works is a long string containing tag ids separated by plus sign. From observation, we find that the first id in the string is most likely a rating id.
To extract the rating id from the tags column in works, we’re going to:
- Create a new column named “rating” in works
- Extract the first id (which is also the minimum number) from the tags column, add the id to rating column
# Check the type of first row in tags column
# The first row in tags column is a string
print(works['tags'].iloc[0])
type(works['tags'].iloc[0])
10+414093+1001939+4577144+1499536+110+4682892+21+16
str
# Check if every row in tags column is string
# Result shows there're NA values in tags column
works[works['tags'].apply(lambda x: isinstance(x,str)) == False]
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
867445 | 2020-09-09 | en | False | True | 63.0 | NaN | NaN |
2066414 | 2019-11-25 | zh | False | True | 6593.0 | NaN | NaN |
4581595 | 2017-03-13 | en | False | True | 1012.0 | NaN | NaN |
# The NA values do not interfere with our analysis
# Drop NA value in tags column
works = works.dropna(subset = ['tags'])
To extract the smallest number from a string, we first split the string into a list using .split() method; then we iterate the list in order to change the object type from string to interger; lastly, we’re able to select the minimum number from the list with min() function.
To apply the above steps on a Series (the tags column in works), we use pandas.Series.apply function.
# Function to find the mimnimum value in the string, and return that value
# First we split the string into a list by the plus sign
# Then we iterate the list, change the object type from string to integer
# Finally we find the minimum value of the list
def find_rating(x):
return min([int(n) for n in x.split('+')])
# Create a new column named 'rating'
# Use apply() to apply a function to each row
# The function returns minimum value in tags column
# The minimun value should be rating id, from observation
works['rating'] = works['tags'].apply(lambda x: find_rating(x))
works
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | rating | |
---|---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388.0 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN | 10 |
1 | 2021-02-26 | en | False | True | 1638.0 | 10+20350917+34816907+23666027+23269305+2326930... | NaN | 10 |
2 | 2021-02-26 | en | False | True | 1502.0 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN | 10 |
3 | 2021-02-26 | en | False | True | 100.0 | 10+15322+54862755+20595867+32994286+663+471751... | NaN | 10 |
4 | 2021-02-26 | en | False | True | 994.0 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN | 11 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
7269688 | 2008-09-13 | en | True | True | 705.0 | 78+77+84+101+104+105+106+23+13+16+70+933 | NaN | 13 |
7269689 | 2008-09-13 | en | False | True | 1392.0 | 78+77+84+107+23+10+16+70+933+616 | NaN | 10 |
7269690 | 2008-09-13 | en | False | True | 1755.0 | 77+78+69+108+109+62+110+23+9+111+16+70+10128+4858 | NaN | 9 |
7269691 | 2008-09-13 | en | False | True | 1338.0 | 112+113+13+114+16+115+101+117+118+119+120+116+... | NaN | 13 |
7269692 | 2008-09-13 | en | False | True | 1836.0 | 123+124+125+127+128+13+129+14+130+131+132+133+... | NaN | 13 |
7269690 rows × 8 columns
We assumed that the first id in the tags string is a rating id, however, extra steps should be taken to check if our assumption is correct.
From tags file, we extracted all correct rating ids. If any row in rating column in works falls outside, we’ll know there’re outliers.
# Check if rating column is indeed rating id
works['rating'].isin(rating['id']).all()
False
# Find the row in rating column that is not rating id
# ~ is negative operator
works[~works['rating'].isin(rating['id'])]
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | rating | |
---|---|---|---|---|---|---|---|---|
896014 | 2020-09-02 | en | True | True | 11852.0 | 641+101375+14+640+18417+176+60+20225004+76 | NaN | 14 |
896017 | 2020-09-02 | en | True | True | 3092.0 | 641+101375+14+640+18417+60+76 | NaN | 14 |
896018 | 2020-09-02 | en | True | True | 4674.0 | 641+101375+14+640+18417+60+20225004+76 | NaN | 14 |
896019 | 2020-09-02 | en | True | True | 1835.0 | 641+101375+14+640+18417+176+60+20225004+76 | NaN | 14 |
896025 | 2020-09-02 | en | True | True | 3969.0 | 641+101375+14+640+18417+61+60+76 | NaN | 14 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
5806946 | 2015-04-22 | en | False | True | 4111.0 | 116+16+1635478+251330+4240694+4205153+270282+1... | NaN | 16 |
6197248 | 2014-07-04 | en | False | True | 875.0 | 23+14+1285452+1433626+2332773+2332776+663+2332779 | NaN | 14 |
6299245 | 2014-04-14 | en | False | True | 2395.0 | 23+16+586439+55409+7267+17067+30326+501842+237... | NaN | 16 |
6404542 | 2014-01-10 | en | False | True | 1350.0 | 23+14+721553+976131+897092+989133 | NaN | 14 |
6456251 | 2013-11-23 | en | False | True | 1271.0 | 21+14+872785+993549+1009491+984790+984786+1212... | NaN | 14 |
488 rows × 8 columns
There are 488 works with no rating. This is actually a know issue that the volunteers are actively working on behind-the-curtain. Thus, we simply drop these works from our data set for now.
# Drop works with no rating
works = works[works['rating'].isin(rating['id'])]
Now we have a file that has all the rating information extracted to a single column. We’ll create graphs based on the information in the next post.