png

In this section, we’ll work with the date and language columns in the data set, and briefly touch base with the datetime object in Pandas.

Loading File

# Load Python library
import pandas as pd

# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
# Select two columns that we're interested in
lang_date = works[['creation date','language']].copy()

# Drop NA values
lang_date = lang_date.dropna()

# Preview of the DataFrame
lang_date
creation date language
0 2021-02-26 en
1 2021-02-26 en
2 2021-02-26 en
3 2021-02-26 en
4 2021-02-26 en
... ... ...
7269688 2008-09-13 en
7269689 2008-09-13 en
7269690 2008-09-13 en
7269691 2008-09-13 en
7269692 2008-09-13 en

7269603 rows × 2 columns

Datetime in Pandas

Now we want to do some data manipulation. Fistly, we want to find the number of works posted in each language every month from the year of 2008 to 2021. To do this, we are going to group the DataFrame by month and by language.

# Make sure the date column is in datetime format
lang_date['creation date'] = pd.to_datetime(lang_date['creation date']) 
# Group the date column by "month" (freq="1M")
# Pd.Grouper(key, freq) is used instead of pd.Series.dt.year
# because it does not aggregate month over multiple years
# Group the language column by counting each language

lang_count = lang_date.groupby([pd.Grouper(key='creation date',freq='1M'),'language']).size().reset_index()
lang_count.columns = ['date','language','count']
lang_count
date language count
0 2008-09-30 en 928
1 2008-10-31 de 1
2 2008-10-31 en 480
3 2008-11-30 en 337
4 2008-12-31 en 239
... ... ... ...
3423 2021-02-28 uk 28
3424 2021-02-28 vi 131
3425 2021-02-28 wuu 16
3426 2021-02-28 yue 41
3427 2021-02-28 zh 7261

3428 rows × 3 columns

Reshaping Dataframe with Pivot_table()

Continue with data manipulation. Here we’re only interested in the trend of 5 most popular languages on AO3. From previous post, the 5 languages are ‘en’,’zh’,’ru’,’es’, and ‘fr’.

To make our selection process easier and to pave the road for visualization, we use pivot_table() in Pandas to reshape the DataFrame, using date as the index, and the languages as columns.

Let me show you how the new DataFrame looks like.

# Use pivot_table() for easy visualization
# 5 most popular languages as subset
# The list of language is obtained from previous analysis

subset = lang_count.pivot_table(index='date',columns='language',values='count')[['en','zh','ru','es','fr']]
subset
language en zh ru es fr
date
2008-09-30 928.0 NaN NaN NaN NaN
2008-10-31 480.0 NaN NaN NaN NaN
2008-11-30 337.0 NaN NaN NaN NaN
2008-12-31 239.0 NaN NaN NaN NaN
2009-01-31 499.0 NaN NaN NaN NaN
... ... ... ... ... ...
2020-10-31 141120.0 6359.0 2092.0 2116.0 673.0
2020-11-30 122796.0 5201.0 1591.0 1791.0 770.0
2020-12-31 154417.0 5724.0 1576.0 1860.0 739.0
2021-01-31 147813.0 6567.0 3995.0 2519.0 655.0
2021-02-28 137125.0 7261.0 4265.0 1947.0 803.0

150 rows × 5 columns

Plotting The Trend in Language

Having prepared our data, we can now move on to visualization.

# Import libraries
# Top line is Jupyter Notebook specific

%matplotlib inline

import matplotlib.pyplot as plt
# Plotting the graph with Pandas
# The .plot() method is a simple wrapper around plt.plot()
# Use figsize to control the size of the graph

ax = subset.plot(subplots=True,figsize=(12,10),grid=False,title="Language Trend on AO3 \n 2008-2021")

png

Note that the graph displays monthly postings in each language from 2008 to 2021. You can see from the graph when users started to post in particular language, and when the interest in that language peaked throughout the years.