Unique Values, Groupby, and 5 Most Popular Languages on AO3
Continue with data cleaning, and a little bit of data analysis. Find the 5 most popular languages on AO3.
- Loading File
- Selecting Columns
- Unique Values and Language Options
- Value_counts for Series
- Groupby for DataFrame
- Top 5 Most Popular Languages on AO3
- Simple Graph with Seaborn Library
Loading File
# Load Python library
import pandas as pd
# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
Selecting Columns
We need the language column to find out top 5 most popular languages on AO3. Let’s create a new Series called “language”.
# Select language col
language = works.language
# Drop NA values
language = language.dropna()
# Data set preview
language
0 en
1 en
2 en
3 en
4 en
..
7269688 en
7269689 en
7269690 en
7269691 en
7269692 en
Name: language, Length: 7269603, dtype: object
There are 7269603 works in this data set. Although the works displayed in the preview are all in English, there are other languages as well which we’re going to find using unique()
.
Unique Values and Language Options
We should keep in mind that there are more language options available on AO3. What we have here are languages with at least one work created in that language. Some languages do not currently have a work thus are not included in this data set.
# Find all language options
language.unique()
array(['en', 'zh', 'de', 'fr', 'es', 'ptBR', 'id', 'ru', 'yue', 'bos',
'pl', 'ptPT', 'fil', 'vi', 'it', 'ms', 'ja', 'nl', 'hu', 'hak',
'be', 'ro', 'cs', 'et', 'ko', 'th', 'wuu', 'fi', 'sv', 'el', 'afr',
'sq', 'qtp', 'fa', 'hr', 'bg', 'ca', 'uk', 'sco', 'lv', 'ga', 'tr',
'ar', 'hy', 'fur', 'lt', 'eu', 'so', 'mk', 'si', 'he', 'da', 'sk',
'bn', 'arc', 'ia', 'slv', 'chn', 'br', 'no', 'hi', 'eo', 'mnc',
'cy', 'hau', 'gl', 'ta', 'qkz', 'bod', 'mr', 'tlh', 'la', 'zu',
'tqx', 'qya', 'gem', 'sjn', 'fcs', 'kat', 'sw', 'sr', 'is', 'pa',
'gd', 'jv'], dtype=object)
# Find the total number of languages
len(language.unique())
85
To summarize, we have 85 languages present in the data set. In comparison, you can browse the entire language options on AO3 website.
Value_counts for Series
Value_counts()
is limited to a Series and returns the frequencies of values. We can use it fo find the number of works in each language. For more complex DataFrame structures, Groupby()
is used. More on that later.
# Find number of works in each language
language.value_counts()
en 6587693
zh 335179
ru 136724
es 70645
fr 32145
...
hau 1
jv 1
zu 1
mnc 1
fur 1
Name: language, Length: 85, dtype: int64
# Combine the results into a new DataFrame
# Disable key as index with reset_index()
# Rename columns
top_list = language.value_counts().reset_index()
top_list.columns = ['language', 'work_count']
top_list
language | work_count | |
---|---|---|
0 | en | 6587693 |
1 | zh | 335179 |
2 | ru | 136724 |
3 | es | 70645 |
4 | fr | 32145 |
... | ... | ... |
80 | hau | 1 |
81 | jv | 1 |
82 | zu | 1 |
83 | mnc | 1 |
84 | fur | 1 |
85 rows × 2 columns
Groupby for DataFrame
Groupby
is used to split the data set into groups, compute a summary statistic for each group, and combine the results into a new data structure.
Here, we are going to split the data set into different languages, compute how many works (instances) under each language, and sort the data set so we can find top 5 most popular languages.
# group the language column by counting each language
works.groupby(['language']).size()
language
afr 38
ar 49
arc 6
be 52
bg 67
...
vi 2178
wuu 46
yue 369
zh 335179
zu 1
Length: 85, dtype: int64
# Combine the results into a new DataFrame
# Disable key as index with reset_index()
# Rename columns
# Sort in descending order, modify the existing df with inplace=True
# Update index using ignore_index=True
top_list2 = works.groupby(['language']).size().reset_index()
top_list2.columns = ['language', 'work_count']
top_list2.sort_values(by=['work_count'], ascending=False, inplace=True, ignore_index=True)
top_list2
language | work_count | |
---|---|---|
0 | en | 6587693 |
1 | zh | 335179 |
2 | ru | 136724 |
3 | es | 70645 |
4 | fr | 32145 |
... | ... | ... |
80 | fcs | 1 |
81 | fur | 1 |
82 | kat | 1 |
83 | pa | 1 |
84 | zu | 1 |
85 rows × 2 columns
We have achieved the same results as using value_counts()
. We’ll use more groupby()
function when we add a creaton_date column and analyze the language trend. More on that later. Let’s find the top 5 most popular languages on AO3.
Top 5 Most Popular Languages on AO3
As shown in previous steps, we have prepared a clean, organized DataFrame called top_list for data analysis and visualization. Let’s extract the top 5 rows into a new DataFrame called top5.
# Top 5 rows
top5 = top_list[:5].copy()
top5
language | work_count | |
---|---|---|
0 | en | 6587693 |
1 | zh | 335179 |
2 | ru | 136724 |
3 | es | 70645 |
4 | fr | 32145 |
In top5, we have the all-time most popular languages on AO3 (at the time of this writing) and the number of works in each language. Let’s create a simple visualization to display the data.
Simple Graph with Seaborn Library
There are several ways to plot graphs in Python, such as Matplotlib, Pandas Plot, and Seaborn, the latter two are based on matplotlib. Depending on the complexity of the graph, you can choose to use either one of the libraries.
# Import libraries
# Top line is Jupyter Notebook specific
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Plot using Seaborn library
# Prevent scientific notation with ticklabel_format()
ax = sns.barplot(x="language", y="work_count", data=top5)
ax.ticklabel_format(style='plain', axis='y')