Background

Using data from hundreds of thousands of Wikia communities, we present a data set which aggregates contributions. The data are spaced out into weekly bins starting from the first edit. The temporal nature of the data mean they can be used to understand how these wiki communities evolve over time.

Fandom, originally known as Wikia, is a wiki hosting service founded in 2004. Today it hosts hundreds of thousands of individual wikis, including some popular ones like Wookieepedia (160,000 pages) and Logopedia (114,000 pages). Under the hood, it uses Wikimedia, the same software as Wikipedia. As such, it is possible to request and download a data file representing the entire history of individual wikis.

About the data

We present a data source to represent these communities over time taken from two different slices of time. The first represents a complete history of Wikia as of 2010, while the second represents a large though incomplete set of Wikia wikis through 2020. If the data dump is available, it is linked on the Special:Statistics page. Wikia initially created these dumps automatically, but they had to be requested manually after 2010. Furthermore, it become impossible to request data dumps for deleted wikis after this point, meaning that Wikia data past this point does not capture a complete picture of all wiki communities.

From the data collection from 2010 and 2020, each wiki is processed though Wikiq, a tool to convert XML Mediawiki data output as tabular datasets. Wikiq produces one TSV (tab separated values) file per wiki with each row in the file corresponding to a single edit. A Python script then takes these files and aggregates them. We chose to set the time cutoff at a week. In total, the entire Wikiq data job for the 2020 data took ten days to run in parallel on a 28-core high performance computing machine.

A second Python script also runs on the original XML data. This script parses namespace identities and descriptions using regex and also extracts the wiki’s URI. It should be noted that the identifier used throughout the dataset for each wiki represents its name in the database and not the URI. This can thus be used to find the accurate internet link for the wiki as of the date of the dump.

Tables

Main Data

The data are shaped as narrow data using a entity-attribute-value model. Since many wikis do not see any edits during the many weeks covered, this representation eliminates the need to store a significant amount of zeros for each entity-attribute-value.

Table consisting of aggregate counts and statistics for each community binned by week
Name	Description
date	Last modified date for the wiki.
db_name	Name used an identifier.
namespace	Number corresponding to wiki entry type.
variable	Value for the namespace.

Variables
Name	Description
count	Count of total edits during the window.
controversial_revert	Count of edits who revert a pervious edit.
tokens_removed	Sum of tokens removed during the window to all pages.
tokens_added	Sum of tokens added during the window to all pages.
new_pages	Count of pages created during the window.
unique_editors	Editor count.
new_editors	Count of editors who have not previously made an edit n this wiki.
anon_editors	Count of editors (IP addresses) making edits.
anon_edits	Count of edits made by anonymous users.
token_revs	Sum of revisions using PWR.

Namespaces

namespaces: Metadata table describing what the namespaces mean (Table 2).

name	value
rows	1908516

Status

status: Not every wiki could be processed successfully. This table includes the status of the wiki (did it get processed?) and also links database names to URLs.

Examples

Loading data

Plot the number of edits in all wikia wikis by week.

R

all_df <- read_tsv("data/var_2010_all.tsv")
all_df %>%
  group_by(date_time) %>%
  summarize(value = sum(count)) %>%
  ggplot(., aes(x = date_time, y = value)) + geom_area() +
  ggtitle("Edit counts by week")

Python

import pandas as pd
pd.read_csv("data/var_2010_all.tsv")

##          date_time\tdb_name\trevert\tanon\tcount\ttokens_added\ttokens_removed\ttoken_revs\tnew_editors\tnew_pages\tcontroversial_revert
## 0         2006-09-17\theroeswithpowers\t1\t1\t1\t23\t0\t...                                                                             
## 1         2006-09-24\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## 2         2006-10-01\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## 3         2006-10-08\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## 4         2006-10-15\theroeswithpowers\t0\t0\t0\t0\t0\t0...                                                                             
## ...                                                     ...                                                                             
## 10315238  2009-09-13\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315239  2009-09-20\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315240  2009-09-27\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315241  2009-10-04\tesdisneychanney\t0\t0\t0\t0\t0\t0\...                                                                             
## 10315242  2009-10-11\tesdisneychanney\t6\t6\t6\t798\t0\t...                                                                             
## 
## [10315243 rows x 1 columns]

Julia

using DataFrames, CSV, Plots, StatsPlots
df = CSV.read("data/var_2010_all.tsv", DataFrame);
gr();
select(combine(groupby(df, :date_time), :count => sum) |>
  x -> plot(x.date_time, x.count_sum, fmt = :png)

Find the most active wikis

R

all_df %>% group_by(db_name) %>%
  summarize(value = sum(count)) %>%
  arrange(desc(value)) %>%
  head(10) %>% knitr::kable()

db_name	value
runescape	2327287
wowwiki	1986083
wswiki	1272972
enmarveldatabase	1060364
enmemoryalpha	1044217
ffxi	803714
lostpedia	770255
finalfantasy	756141
kosova	683297
nlrunescape	510543

Julia

groupby(df, :db_name) |>
  x -> combine(x, :count => sum) |>
  x -> sort(x, :count_sum, rev = true) |>
  x -> x[1:10, :]

## 10×2 DataFrame
##  Row │ db_name           count_sum
##      │ String            Int64
## ─────┼─────────────────────────────
##    1 │ runescape           2327287
##    2 │ wowwiki             1986083
##    3 │ wswiki              1272972
##    4 │ enmarveldatabase    1060364
##    5 │ enmemoryalpha       1044217
##    6 │ ffxi                 803714
##    7 │ lostpedia            770255
##    8 │ finalfantasy         756141
##    9 │ kosova               683297
##   10 │ nlrunescape          510543

Wikia Data Descriptor (Docs)

Carl Colglazier

2022-05-04

Background

About the data

Tables

Main Data

Namespaces

Status

Examples

Loading data

R

Python

Julia

Find the most active wikis

R

Julia