Open positions
Our group has an open research position for the summer quarter. More info here.

Complete Wikipedia edit history (up to January 2008)

Dataset information

The data contains the complete edit history (all revisions, all pages) of all Wikipedia since its inception till January 2008.

There are two parts to the dataset:

Complete Wikipedia edit history

File Description
enwiki-20080103-pages-meta-history.xml.7zComplete Wikipedia edit history (18GB!)

Note that the file decompresses to several (>3) Terabytes of text. Use 7zip to decompress the data on the fly.

See All revisions of Wikipedia and Latest complete dump for more information about different dumps of the Wikipedia dataset.

Parsed Wikipedia edit history

The data set contains processed metadata for all revisions of all articles extracted from the full Wikipedia XML dump as of 2008-01-03.

For each specified namespace, there is a bzipped file with pre-processed data and also a file with all redirects. The output data is in the tagged multi-line format (14 lines per revision, space-delimited). Each revision record contains the following lines:

For example:

REVISION 4781981 72390319 Steven_Strogatz 2006-08-28T14:11:16Z SmackBot 433328 CATEGORY American_mathematicians IMAGE MAIN Boston_University MIT Harvard_University Cornell_University TALK USER USER_TALK OTHER De:Steven_Strogatz Es:Steven_Strogatz EXTERNAL http://www.edge.org/3rd_culture/bios/strogatz.html TEMPLATE Cite_book Cite_book Cite_journal COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]] MINOR 1 TEXTDATA 229 [empty line]

Anonymous editors are listed by their ip address, e.g. ip:69.17.21.242.

The list of admins with simplified dates of adminship (disregarding demotions and reappointments of the same user) can be found at http://en.wikipedia.org/wiki/User:NoSeptember/List_of_Administrators and http://en.wikipedia.org/wiki/Wikipedia:Former_administrators

Bots can often (but neither necessarily nor exclusively) be identified by the string "bot" in the username. You can create a list of bots by using the bot status page at http://en.wikipedia.org/wiki/Wikipedia:Bots/Status

Sometimes Wikipedia editors change their user names, which may lead to misattribution of edits (it does not seem that name changes are retroactively applied to the previously generated content). This issue may be especially important for prolific contributors. To handle name changes properly, you want to use the logs at http://en.wikipedia.org/wiki/Special:Log/renameuser and/or http://en.wikipedia.org/wiki/Wikipedia:Changing_username

Data and the description was prepared by Gueorgi Kossinets.

Source (citation)

Files

File Description
enwiki-20080103.main.bz2Revisions in the main namespace (the Wikipedia articles) (8GB!)
enwiki-20080103.talk.bz2Talk namespace -- edits of discussion pages attached to each Wikipedia article) (<1GB)
enwiki-20080103.user.bz2Revisions of user personal pages (<1GB)
enwiki-20080103.user_talk.bz2Revisions of user talk pages (<1GB)
enwiki-20080103.wikipedia.bz2Wikipedia Wiki namespace (administrative procedures and pages) (3GB)
enwiki-20080103.wikipedia_talk.bz2Wikipedia Wiki namespace talk pages (<1GB)

To examine a part of the data file, use bzcat and pipe its output to a combination of head, tail, grep, awk, sed, and so on. For example, the command

$ bzcat enwiki-20080103.talk.bz2 | head -n 1414 | tail -n 14

will print lines 1401 through 1414 from the Talk namespace data file.

Similarly

$ 7z x -so enwiki-20080103-pages-meta-history.xml.7z | head -n 1414 | tail -n 14

will print lines 1401 through 1414 from pages-meta-history file.