Background
Using data from hundreds of thousands of Wikia communities, we
present a data set which aggregates contributions. The data are spaced
out into weekly bins starting from the first edit. The temporal nature
of the data mean they can be used to understand how these wiki
communities evolve over time.
Fandom, originally known as Wikia, is a wiki hosting service founded
in 2004. Today it hosts hundreds of thousands of individual wikis,
including some popular ones like Wookieepedia (160,000 pages) and
Logopedia (114,000 pages). Under the hood, it uses Wikimedia, the same
software as Wikipedia. As such, it is possible to request and download a
data file representing the entire history of individual wikis.
About the data
We present a data source to represent these communities over time
taken from two different slices of time. The first represents a complete
history of Wikia as of 2010, while the second represents a large though
incomplete set of Wikia wikis through 2020. If the data dump is
available, it is linked on the Special:Statistics page. Wikia
initially created these dumps automatically, but they had to be
requested manually after 2010. Furthermore, it become impossible to
request data dumps for deleted wikis after this point, meaning that
Wikia data past this point does not capture a complete picture of all
wiki communities.
From the data collection from 2010 and 2020, each wiki is processed
though Wikiq, a tool to convert XML Mediawiki data output as tabular
datasets. Wikiq produces one TSV (tab separated values) file per wiki
with each row in the file corresponding to a single edit. A Python
script then takes these files and aggregates them. We chose to set the
time cutoff at a week. In total, the entire Wikiq data job for the 2020
data took ten days to run in parallel on a 28-core high performance
computing machine.
A second Python script also runs on the original XML data. This
script parses namespace identities and descriptions using regex and also
extracts the wiki’s URI. It should be noted that the identifier used
throughout the dataset for each wiki represents its name in the database
and not the URI. This can thus be used to find the accurate internet
link for the wiki as of the date of the dump.
Tables
Main Data
The data are shaped as narrow data using a entity-attribute-value
model. Since many wikis do not see any edits during the many weeks
covered, this representation eliminates the need to store a significant
amount of zeros for each entity-attribute-value.
Table consisting of aggregate counts and statistics for each
community binned by week
date |
Last modified date for the wiki. |
db_name |
Name used an identifier. |
namespace |
Number corresponding to wiki entry type. |
variable |
Value for the namespace. |
Variables
count |
Count of total edits during the window. |
controversial_revert |
Count of edits who revert a pervious edit. |
tokens_removed |
Sum of tokens removed during the window to all
pages. |
tokens_added |
Sum of tokens added during the window to all
pages. |
new_pages |
Count of pages created during the window. |
unique_editors |
Editor count. |
new_editors |
Count of editors who have not previously made an edit n
this wiki. |
anon_editors |
Count of editors (IP addresses) making edits. |
anon_edits |
Count of edits made by anonymous users. |
token_revs |
Sum of revisions using PWR. |
Namespaces
namespaces: Metadata table describing what the
namespaces mean (Table 2).
Status
status: Not every wiki could be processed
successfully. This table includes the status of the wiki (did it get
processed?) and also links database names to URLs.
Examples
Loading data
Plot the number of edits in all wikia wikis by week.
R
all_df <- read_tsv("data/var_2010_all.tsv")
all_df %>%
group_by(date_time) %>%
summarize(value = sum(count)) %>%
ggplot(., aes(x = date_time, y = value)) + geom_area() +
ggtitle("Edit counts by week")
Python
import pandas as pd
pd.read_csv("data/var_2010_all.tsv")
## date_time\tdb_name\trevert\tanon\tcount\ttokens_added\ttokens_removed\ttoken_revs\tnew_editors\tnew_pages\tcontroversial_revert
## 0 2006-09-17\theroeswithpowers\t1\t1\t1\t23\t0\t...
## 1 2006-09-24\theroeswithpowers\t0\t0\t0\t0\t0\t0...
## 2 2006-10-01\theroeswithpowers\t0\t0\t0\t0\t0\t0...
## 3 2006-10-08\theroeswithpowers\t0\t0\t0\t0\t0\t0...
## 4 2006-10-15\theroeswithpowers\t0\t0\t0\t0\t0\t0...
## ... ...
## 10315238 2009-09-13\tesdisneychanney\t0\t0\t0\t0\t0\t0\...
## 10315239 2009-09-20\tesdisneychanney\t0\t0\t0\t0\t0\t0\...
## 10315240 2009-09-27\tesdisneychanney\t0\t0\t0\t0\t0\t0\...
## 10315241 2009-10-04\tesdisneychanney\t0\t0\t0\t0\t0\t0\...
## 10315242 2009-10-11\tesdisneychanney\t6\t6\t6\t798\t0\t...
##
## [10315243 rows x 1 columns]
Julia
using DataFrames, CSV, Plots, StatsPlots
df = CSV.read("data/var_2010_all.tsv", DataFrame);
gr();
select(combine(groupby(df, :date_time), :count => sum) |>
x -> plot(x.date_time, x.count_sum, fmt = :png)
Find the most active wikis
R
all_df %>% group_by(db_name) %>%
summarize(value = sum(count)) %>%
arrange(desc(value)) %>%
head(10) %>% knitr::kable()
runescape |
2327287 |
wowwiki |
1986083 |
wswiki |
1272972 |
enmarveldatabase |
1060364 |
enmemoryalpha |
1044217 |
ffxi |
803714 |
lostpedia |
770255 |
finalfantasy |
756141 |
kosova |
683297 |
nlrunescape |
510543 |
Julia
groupby(df, :db_name) |>
x -> combine(x, :count => sum) |>
x -> sort(x, :count_sum, rev = true) |>
x -> x[1:10, :]
## 10×2 DataFrame
## Row │ db_name count_sum
## │ String Int64
## ─────┼─────────────────────────────
## 1 │ runescape 2327287
## 2 │ wowwiki 1986083
## 3 │ wswiki 1272972
## 4 │ enmarveldatabase 1060364
## 5 │ enmemoryalpha 1044217
## 6 │ ffxi 803714
## 7 │ lostpedia 770255
## 8 │ finalfantasy 756141
## 9 │ kosova 683297
## 10 │ nlrunescape 510543