Download all wikipedia images with WikiX

There are scores of interesting projets to do with the data made available on Wikiepdia

I recently had the need to download all the images on Wikipedia, and an excellent project-- wikix -- was brought to my attention, which is the "best-practice" way of downloading Wikipedia image data.

It is an application written in C that parses the Wikipedia XML, extracts all the image links, and then creates a set of bash shell scripts that use common Unix utilities such as curl to actually fetch the images, while remaining respecting Wikipedia's guidelines on bots.

Compiling WikiX

Compiling wikix is pretty straight forward. It is important to note that the package is bzipped, then gzipped, so you'll run something similar to the following commands:

ftp://www.wikigadugi.org/wiki/MediaWiki/wikix.tar.gz.bz2

bzip2 -d wikix.tar.gz.bz2

tar xzvf wikix.tar.gz

cd wikix

chmod 775 * (Note: the downloaded files have very restrictive permissions by default, this opens them up a bit)

make all && make install

If you get the following error: cc1: error: unrecognized command line option "-Wno-pointer-sign" -- it probably means you are trying to compile using gcc3 and not the newer gcc4. There are two options:

1) If you have both installed you can update the CC environment variable

2) you can comment you the CFLAGS and CFLAGS_LIB lines, and uncomment the already commented out versions at the top of the file, so the Makefile goes from looking like this:

CFLAGS = -g

CFLAGS_LIB = -g -c

CFLAGS = -Wno-pointer-sign -g CFLAGS_LIB = -Wno-pointer-sign -g -c

To looking like this:

CFLAGS = -g CFLAGS_LIB = -g -c

CFLAGS = -Wno-pointer-sign -g

CFLAGS_LIB = -Wno-pointer-sign -g -c

Downloading Wikipedia Images

To get started, first download the xml database dump. At the time of writing, I issued the following command:

wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Then unpack the bzipped xml with:

bzip2 -d enwiki-latest-pages-articles.xml.bz2

Then run wikix, specifying the -p flag if you want the scripts to be parallelized ( I did ) or omit it if you don't:

wikix -p < enwiki-20070802-pages-articles.xml &

This took about 12 minutes on the machine I am working on. However the machine is a Dual Dual-Core Opterons (model 254 @ 2.8 Ghz), with 16G ram, and 4+ Terabytes of storage -- so your mileage may vary.

If you need to put the images in a different directory then your current working directory simply edit image_sh and changed the "OUTPUT" variable to the path where you want your images.

Then to actually start sucking down images over the internet run:

./image_sh -- this should be in the directory you were in when you ran wikix.

An added bonus is that the wikix script creates a file called "image.log" which contains each image found, one per line, which is an ideal format for writing a quick script to insert all those image names in a database, such as mysql.

The approximate size of all the images as of October, 2007 is approximately 406 gigabytes. So make sure you have lots of disk space!

A good place to start when looking for the dumps would be the official Wikipedia Database Dump page.

Resources http://en.wikipedia.org/wiki/Wikipedia:Database_download http://meta.wikimedia.org/wiki/Wikix