James Linden

~# data ninja / linux guru / web dev geek / robotics nerd / idea machine / N6NRD

Datasets / Wikipedia.org XML Dump Importer for MongoDB

Wikipedia.org XML Dump Importer for MongoDB

Overview

Wikipedia.org XML Dump Importer for MongoDB is a script to import the standard Wikipedia XML dump into a simple MongoDB data structure, useful as a local cache for searching and manipulating Wikipedia articles. The data structure is designed for ease of use, and is not mediawiki-compatible.

Dataset Source

URL: http://dumps.wikimedia.org/

Updates: monthly

Environment

  • GNU/Linux
  • PHP 5.4 + (with mbstring, simplexml, mongodb extensions)
  • MongoDB 2.2 +

Notes

  • This script is designed to run on the command line - not a web browser.
  • This script reads the compressed file - there is no need to decompress it first.
  • enwiki download is approximately 9.5GB compressed and will require another 45GB of storage for the datastore - a total of approximately 55GB.
  • Import process required approximately 4 hours on a well configured quad core with 4GB of memory.

Howto

Download the proper pages-articles XML file - for example, enwiki-20130708-pages-articles.xml.bz2.

Download wikipedia.org-xmldump-mongodb.php and edit the configuration section at the beginning of the file.

$dsname = 'mongodb://localhost/wp20130708';
$file = 'enwiki-20130708-pages-articles.xml.bz2';
$log = './';

Run the script -- watch for a minute to make sure it starts correctly, then go eat/sleep/etc for a few hours.

License

This project is BSD (2 clause) licensed.

photo of James Linden
Founder / Head Geek
Digital Dock, LLC
aka kodekrash & N6NRD
Alexandria, LA USA

What I Do

Linux administration & virtualization
Data mining, storage & analysis
Web development

What I've Done

Rescued a skunk
Built Prime GNU/Linux
Contributed to Spidering Hacks