png

In this section, we analyze the trend and seasonality of users’ posting habits.

The number of works posted to the Archive is time-series data, which is a set of obervations obtained in different time periods. We are not going into details of time-series models in this section. Instead, we’ll briefly showcase two components of time-series analysis: autocorrelation, and decomposing trend and seasonality from the data set.

Loading File
Data Cleaning
Plotting The Data
Autocorrelation
Decomposing A Time Series Data

Loading File

# Load Python library
import pandas as pd

# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)

Data Cleaning

Since majority of the works posted to AO3 are in English, we choose English works in the data set for analysis. The data set contains the creation date of each work, and we’re going to sum up the number of works posted per month from 2008 to 2021. All of the Python functions used here are discussed in previous blog posts.

# Select two columns that we're interested in
# Select English works for analysis
eng = works[['creation date','language']][works['language'] == 'en']

# Drop NA values
eng = eng.dropna()

# Preview of the DataFrame
eng

	creation date	language
0	2021-02-26	en
1	2021-02-26	en
2	2021-02-26	en
3	2021-02-26	en
4	2021-02-26	en
...	...	...
7269688	2008-09-13	en
7269689	2008-09-13	en
7269690	2008-09-13	en
7269691	2008-09-13	en
7269692	2008-09-13	en

6587693 rows × 2 columns

# Make sure date column is in datetime format
eng['creation date'] = pd.to_datetime(eng['creation date'])

# Group by monthly posting 
# Use pd.Grouper because it aggregates throughout the years
eng = eng.groupby([pd.Grouper(key='creation date',freq='1M')]).count()

# Preview of the DataFrame
eng

	language
creation date
2008-09-30	928
2008-10-31	480
2008-11-30	337
2008-12-31	239
2009-01-31	499
...	...
2020-10-31	141120
2020-11-30	122796
2020-12-31	154417
2021-01-31	147813
2021-02-28	137125

150 rows × 1 columns

Plotting The Data

Line Plot

# Import libraries
# Top line is Jupyter Notebook specific

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

# Line plot using seaborn library
# Orignal data
ax = sns.lineplot(data=eng)

# Aesthetics 
ax.set_title('Works Posted Per Month \n 2008-2021')
ax.legend(title='Language', labels=['English'])
plt.ylabel('works')

Text(0, 0.5, 'works')

png

Seasonal Plot

# Add a year column
eng_annual = eng.copy().reset_index()
eng_annual['year'] = eng_annual['creation date'].dt.year

# Rename columns
eng_annual.columns = ['creation date', 'works', 'year']

# Preview data
eng_annual

	creation date	works	year
0	2008-09-30	928	2008
1	2008-10-31	480	2008
2	2008-11-30	337	2008
3	2008-12-31	239	2008
4	2009-01-31	499	2009
...	...	...	...
145	2020-10-31	141120	2020
146	2020-11-30	122796	2020
147	2020-12-31	154417	2020
148	2021-01-31	147813	2021
149	2021-02-28	137125	2021

150 rows × 3 columns

# Visualizaion
ax = sns.lineplot(data=eng_annual, x=eng_annual['creation date'].dt.month, y='works', hue='year', ci=None)

# Aethetics
plt.xlabel('Month')
plt.ylabel('Works')
plt.title('Works Posted Per Month \n 2008-2021')

Text(0.5, 1.0, 'Works Posted Per Month \n 2008-2021')

png

Boxplot Distribution

# Monthly Posting
ax = sns.boxplot(data=eng_annual, x=eng_annual['creation date'].dt.month, y='works')

# Aethetics
plt.xlabel('Month')
plt.ylabel('Works')
plt.title('Works Posted Per Month \n 2008-2021')

Text(0.5, 1.0, 'Works Posted Per Month \n 2008-2021')

png

# Annual Posting
ax = sns.boxplot(data=eng_annual, x=eng_annual['creation date'].dt.year, y='works')

# Aethetics
plt.xlabel('Year')
plt.ylabel('Works')
plt.title('Works Posted Per Year \n 2008-2021')

Text(0.5, 1.0, 'Works Posted Per Year \n 2008-2021')

png

Autocorrelation

Autocorrelation shows the correlation of the data in one period to its occurrence in the previous period. When data are both trended and seasonal, we will see a pattern in our ACF (AutoCorrelation Function) graph.

# Import Statsmodels library
import statsmodels.api as sm

# AFC plot
ax = sm.graphics.tsa.plot_acf(eng, lags=37)

#Aesthetics
plt.xlabel('Lag')
plt.ylabel('Correlation')

Text(0, 0.5, 'Correlation')

png

The ACF decreases when the lags increase, indicating that the data is trended.

Decomposing A Time Series Data

Time series data comprise three components: trend, seasonality, and residual (noise). An additive decomposition assumes that: data = trend + seasonality + residual; alternatively, a multiplicative decomposition assumes that: data = trend x seasonality x residual.

# Import Statsmodels library
from statsmodels.tsa.seasonal import seasonal_decompose

# Adjust figure size
plt.rcParams['figure.figsize'] = [7, 5]

# Additive Decomposition
result_add = seasonal_decompose(eng, model='additive')
ax=result_add.plot()

png

# Multiplicative Decomposition
result_mul = seasonal_decompose(eng, model='multiplicative')
ax = result_mul.plot()

png