Background

Using data from hundreds of thousands of Wikia communities, we present a data set which aggregates contributions. The data are spaced out into weekly bins starting from the first edit. The temporal nature of the data mean they can be used to understand how these wiki communities evolve over time.

Fandom, originally known as Wikia, is a wiki hosting service founded in 2004. Today it hosts hundreds of thousands of individual wikis, including some popular ones like Wookieepedia (160,000 pages) and Logopedia (114,000 pages). Under the hood, it uses Wikimedia, the same software as Wikipedia. As such, it is possible to request and download a data file representing the entire history of individual wikis.

About the data

We present a data source to represent these communities over time taken from two different slices of time. The first represents a complete history of Wikia as of 2010, while the second represents a large though incomplete set of Wikia wikis through 2020. If the data dump is available, it is linked on the Special:Statistics page. Wikia initially created these dumps automatically, but they had to be requested manually after 2010. Furthermore, it become impossible to request data dumps for deleted wikis after this point, meaning that Wikia data past this point does not capture a complete picture of all wiki communities.

From the data collection from 2010 and 2020, each wiki is processed though Wikiq, a tool to convert XML Mediawiki data output as tabular datasets. Wikiq produces one TSV (tab separated values) file per wiki with each row in the file corresponding to a single edit. A Python script then takes these files and aggregates them. We chose to set the time cutoff at a week. In total, the entire Wikiq data job for the 2020 data took ten days to run in parallel on a 28-core high performance computing machine.

A second Python script also runs on the original XML data. This script parses namespace identities and descriptions using regex and also extracts the wiki’s URI. It should be noted that the identifier used throughout the dataset for each wiki represents its name in the database and not the URI. This can thus be used to find the accurate internet link for the wiki as of the date of the dump.

Tables

Main Data

The data are shaped as narrow data using a entity-attribute-value model. Since many wikis do not see any edits during the many weeks covered, this representation eliminates the need to store a significant amount of zeros for each entity-attribute-value.

Table consisting of aggregate counts and statistics for each community binned by week
Name Description
date Last modified date for the wiki.
db_name Name used an identifier.
namespace Number corresponding to wiki entry type.
variable Value for the namespace.
Variables
Name Description
count Count of total edits during the window.
controversial_revert Count of edits who revert a pervious edit.
tokens_removed Sum of tokens removed during the window to all pages.
tokens_added Sum of tokens added during the window to all pages.
new_pages Count of pages created during the window.
unique_editors Editor count.
new_editors Count of editors who have not previously made an edit n this wiki.
anon_editors Count of editors (IP addresses) making edits.
anon_edits Count of edits made by anonymous users.
token_revs Sum of revisions using PWR.

Namespaces

namespaces: Metadata table describing what the namespaces mean (Table 2).

name value
rows 1908516

Status

status: Not every wiki could be processed successfully. This table includes the status of the wiki (did it get processed?) and also links database names to URLs.

Examples

Loading data

Plot the number of edits in all wikia wikis by week.

R

all_df <- read_tsv("data/var_2010_all.tsv")
all_df %>%
  group_by(date_time) %>%
  summarize(value = sum(count)) %>%
  ggplot(., aes(x = date_time, y = value)) + geom_area() +
  ggtitle("Edit counts by week")

Python

import pandas as pd
pd.read_csv("data/var_2010_all.tsv")
##          date_time\tdb_name\trevert\tanon\tcount\ttokens_added\ttokens_removed\ttoken_revs\tnew_editors\tnew_pages\tcontroversial_revert
## 0         2006-09-17\theroeswithpowers\t1\t1\t1\t23\t0\t...                                                                             
## 1         2006-09-24\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## 2         2006-10-01\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## 3         2006-10-08\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## 4         2006-10-15\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## ...                                                     ...                                                                             
## 10315238  2009-09-13\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315239  2009-09-20\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315240  2009-09-27\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315241  2009-10-04\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315242  2009-10-11\tesdisneychanney\t6\t6\t6\t798\t0\t...                                                                             
## 
## [10315243 rows x 1 columns]

Julia

using DataFrames, CSV, Plots, StatsPlots
df = CSV.read("data/var_2010_all.tsv", DataFrame);
gr();
select(combine(groupby(df, :date_time), :count => sum) |>
  x -> plot(x.date_time, x.count_sum, fmt = :png)

Find the most active wikis

R

all_df %>% group_by(db_name) %>%
  summarize(value = sum(count)) %>%
  arrange(desc(value)) %>%
  head(10) %>% knitr::kable()
db_name value
runescape 2327287
wowwiki 1986083
wswiki 1272972
enmarveldatabase 1060364
enmemoryalpha 1044217
ffxi 803714
lostpedia 770255
finalfantasy 756141
kosova 683297
nlrunescape 510543

Julia

groupby(df, :db_name) |>
  x -> combine(x, :count => sum) |>
  x -> sort(x, :count_sum, rev = true) |>
  x -> x[1:10, :]
## 10×2 DataFrame
##  Row │ db_name           count_sum
##      │ String            Int64
## ─────┼─────────────────────────────
##    1 │ runescape           2327287
##    2 │ wowwiki             1986083
##    3 │ wswiki              1272972
##    4 │ enmarveldatabase    1060364
##    5 │ enmemoryalpha       1044217
##    6 │ ffxi                 803714
##    7 │ lostpedia            770255
##    8 │ finalfantasy         756141
##    9 │ kosova               683297
##   10 │ nlrunescape          510543