Download all wikipedia images with WikiX

There are scores of interesting projets to do with the data made available on Wikiepdia

I recently had the need to download all the images on Wikipedia, and an excellent project– wikix — was brought to my attention, which is the “best-practice” way of downloading Wikipedia image data.

It is an application written in C that parses the Wikipedia XML, extracts all the image links, and then creates a set of bash shell scripts that use common Unix utilities such as curl to actually fetch the images, while remaining respecting Wikipedia’s guidelines on bots.

Compiling WikiX

Compiling wikix is pretty straight forward. It is important to note that the package is bzipped, then gzipped, so you’ll run something similar to the following commands:

ftp://www.wikigadugi.org/wiki/MediaWiki/wikix.tar.gz.bz2

bzip2 -d wikix.tar.gz.bz2

tar xzvf wikix.tar.gz

cd wikix

chmod 775 * (Note: the downloaded files have very restrictive permissions by default, this opens them up a bit)

make all && make install

If you get the following error: cc1: error: unrecognized command line option “-Wno-pointer-sign” — it probably means you are trying to compile using gcc3 and not the newer gcc4. There are two options:

1) If you have both installed you can update the CC environment variable

2) you can comment you the CFLAGS and CFLAGS_LIB lines, and uncomment the already commented out versions at the top of the file, so the Makefile goes from looking like this:

#CFLAGS = -g
#CFLAGS_LIB = -g -c
CFLAGS = -Wno-pointer-sign -g
CFLAGS_LIB = -Wno-pointer-sign -g -c

To looking like this:

CFLAGS = -g
CFLAGS_LIB = -g -c
#CFLAGS = -Wno-pointer-sign -g
#CFLAGS_LIB = -Wno-pointer-sign -g -c

Downloading Wikipedia Images

To get started, first download the xml database dump. At the time of writing, I issued the following command:

wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Then unpack the bzipped xml with:

bzip2 -d enwiki-latest-pages-articles.xml.bz2

Then run wikix, specifying the -p flag if you want the scripts to be parallelized ( I did ) or omit it if you don’t:

wikix -p < enwiki-20070802-pages-articles.xml &

This took about 12 minutes on the machine I am working on. However the machine is a Dual Dual-Core Opterons (model 254 @ 2.8 Ghz), with 16G ram, and 4+ Terabytes of storage — so your mileage may vary.

If you need to put the images in a different directory then your current working directory simply edit image_sh and changed the “OUTPUT” variable to the path where you want your images.

Then to actually start sucking down images over the internet run:

./image_sh — this should be in the directory you were in when you ran wikix.

An added bonus is that the wikix script creates a file called “image.log” which contains each image found, one per line, which is an ideal format for writing a quick script to insert all those image names in a database, such as mysql.

The approximate size of all the images as of October, 2007 is approximately 406 gigabytes. So make sure you have lots of disk space!

A good place to start when looking for the dumps would be the official Wikipedia Database Dump page.

Resources
http://en.wikipedia.org/wiki/Wikipedia:Database_download
http://meta.wikimedia.org/wiki/Wikix

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • StumbleUpon
  • Technorati
  • Reddit

13 Responses to “Download all wikipedia images with WikiX”

  1. Hey Yousef,

    First, I’m very sorry for commenting in your About Me section. I was just reading over it today, and saw that I completely missed the little note where you said to comment on the actual page. Believe it or not, I never scrolled down this far because I could never get wikix to work! I know it’s a lame excuse, but you know what they say about the truth being stranger than fiction sometimes…

    Anyway, you have a very good tutorial on wikix. I’m currently doing comp sci research and it’s very important that I can download images from Wikpedia. For that, however, I need wikix to work (unless you know of another way besides writing something similar). I’m stuck at a point and stressed out about it (I’m still new to Linux…I’ve had it for about 2-3 weeks). Can you please help me?

    So far, I’ve done everything up to downloading the xml file. When I try to say “wikix -p …” it says “bash: command ‘wikix’ not found”. Even if I just type in “wikix” it says that. Do you know what’s wrong? Of course, the “/image.sh” won’t work (I presume I have to do the first one). Is the issue with wikix permissions or something else? I’d greatly appreciate your help.

  2. Billy,
    No worries.

    From what you are describing it sounds like your bash shell is not finding the wikix binary on your $PATH environment variable. Assuming this is indeed the problem there are several ways to fix it:

    1) If you are in the directory where wikix is compiled you need “./” in front of wikix so it would be like this “./wikix

    2) If you tweaked the path and installed it somewhere like /usr/local/bin/wikix you need to add that to your PATH variable, so something like this export PATH=$PATH:/usr/local/bin

    If you don’t know where wikix is on the filesystem you can use find like so: find / -name “wikix” -type f -print

    When I wrote that I assumed working knowledge with UNIX systems — your comment might inspire me to revise my language and make sure it is easy for non UNIX-Geeks to grok.

    If it is not the problem post another comment. In fact post a comment either way letting me know about your success.

    If you are setting up a Wikipedia mirror the next problem you will run into is that you’ll need to generate thumbnails for the actual article page. Apache-batik is your friend here. I can help you out with that too once you get to that point.

    Where are your studying by the way?

    Thanks,
    Yousef

  3. Thanks for answering Yousef. Your article is actually pretty easy to follow, if you had that one “./” in the wikix command part I wouldn’t had a problem at all.

    Except something is way wrong the the paralleization command. I did it, and it took roughly half a second. I know that can’t be right because I’m running a 1.86ghz Pentium M with 2gb RAM and a 250GB external hard drive (I really don’t need all of them right now, just 500-1000 to start), which is wayyy slower than what you’re on. Afterward, I try the ./image_sh, and that fails. Here’s what happens:

    ———————————————-
    billy@billy-laptop:/media/FreeAgent Drive/My Storage Data/wikix/wikix$ ./wikix -p < enwiki-latest-pages-articles.xml &
    [5] 8041
    billy@billy-laptop:/media/FreeAgent Drive/My Storage Data/wikix/wikix$ ls
    enwiki-latest-pages-articles.xml image04 image11 libcutf8.a utf8.c
    enwiki-latest-pages-articles.xml.bz2 image05 image12 Makefile utf8.h
    fragment.log image06 image13 platform.h utf8.o
    image00 image07 image14 reject.log wikix
    image01 image08 image15 temp wikix.c
    image02 image09 image.log thumb
    image03 image10 image_sh tmp
    billy@billy-laptop:/media/FreeAgent Drive/My Storage Data/wikix/wikix$ ./image_sh
    ./image_sh: 15: Syntax error: Bad fd number
    ./image_sh: 16: Syntax error: Bad fd number
    ./image_sh: 17: Syntax error: Bad fd number
    ./image_sh: 19: Syntax error: Bad fd number
    ./image_sh: 20: Syntax error: Bad fd number
    ./image_sh: 21: Syntax error: Bad fd number
    ./image_sh: 22: Syntax error: Bad fd number
    ./image_sh: 23: Syntax error: Bad fd number
    ./image_sh: 24: Syntax error: Bad fd number
    ./image_sh: 25: Syntax error: Bad fd number
    ./image_sh: 26: Syntax error: Bad fd number
    ./image_sh: 27: Syntax error: Bad fd number
    ./image_sh: 28: Syntax error: Bad fd number
    ./image_sh: 29: Syntax error: Bad fd number
    ./image_sh: 30: Syntax error: Bad fd number
    —————-

    Know what’s wrong from here? By the way, I am currently studying at the University of Mary Washington (Fredericksburg, VA) as a Mathematics/Computer Science double major.

  4. Billy,
    Let me guess, you are running Ubuntu on your laptop>

    If you look at the /bin/sh symlink you’ll see it is currently pointing to the /bin/dash shell ie: ls -la /bin/sh

    Try this:
    sudo unlink /bin/sh
    sudo ln -s /bin/bash /bin/sh

    Then try re-running image_sh

    If that doesn’t work post here again.

    Tell me a bit more about your project — interested in it. If I had to guess you were probably going to link at the link graph of Wikipedia articles?

    Also note: The thing that took 12 minutes was just parsing the xml into the script you actually run. Downloading the images takes a bit while and I’m on a massive pipe — I don’t remember exactly but I think it took something like 24 hours and I have 100Mb in bandwidth.

    Thanks,
    Yousef

  5. Nice guess on the Ubuntu, you must really be the UNIX-geek you said you were! haha. So far that’s worked, and I’m currently downloading lots of images! In fact, I’m pretty sure in 5 minutes I have enough…how do I make it stop? Pretty soon I’m going to take the “Turn off my laptop” approach (for lack of a more elegant method).

    My project isn’t going to save the world or anything…I’m scanning Wikipedia for evidence of steganography (data-hiding) in their images. I hypothesized how easy it would be for “bad” users to indirectly communicate (i.e. criminals, terrorists, etc.) through images that no one would notice on such a public site. From that, I’m hoping to derive a general methodology for steganography scanning (whether it’s feasible, worth the time, methods, etc.).

  6. Billy

    1) I believe image_sh forks off a number of child processes, to kill them all you could probably use this command: pkill -9 image_sh

    2) Interesting — Some thoughts: Many of the images are SVG — not rasterized (PNG, JPG) — meaning any stegotext would really stand out in the SVG XML — so I would limit your search to rasterized image formats.

    Some questions:
    How would you handle encrypted stegotext? How Would handled encrypted stegotext split across multiple images? Or would the mere confirmation be enough for your experiment?

    Good luck with your project. Keep me posted, sounds interesting.

    Thanks,
    Yousef

  7. Thanks again for the help.

    I’m not actually writing a steganalysis program myself (there are enough out there to where I don’t want to reinvent the wheel), so I will let the program take care of the encrypted stego-text. My project is more a synthesis of already written things than original coding. For this experiment, “suspected/confirmed” would be enough to generate some data, but that’s a good question on the split across multiple images. I’m not exactly sure what to do there, so for now I’ll let the analysis program do its thing while I read up on some literature.

    Question: I haven’t really analyzed it yet, but do you know how Wikix goes about downloading the images? It doesn’t seem to be in straight alphabetical order (the folder structure seems a bit funny), but maybe I’m wrong about that since I cut it off at about 10 minutes.

    I will keep you posted if anything interesting happens. If you have any ideas/suggestions, feel free to contact me at my e-mail: billy.ella [at] gmail.com.

  8. Hey Yousef,

    Sorry to bother you again, but I was wondering if you can help me (I got myself in another pickle). I was trying to do a retest of my original downloading, so I deleted everything that had downloaded in the wikix directory (except for the stuff that was there before). Unfortunately, I deleted the really handy image.log file, and when I tried to restart the download, it didn’t come back. I deleted the contents of the old one form my trash and put it back in the folder, but it wouldn’t update it (it downloads stuff, but doesn’t touch that file).

    Also, if I try to just install another copy of Wikix in another folder, it doesn’t work. I get this error:
    ———————————————-
    billy@billy-laptop:/media/FreeAgent Drive/My Storage Data/wikix/wikix2$ make all && make install
    make: Nothing to be done for `all’.
    install -m 755 wikix /usr/bin
    install: cannot create regular file `/usr/bin/wikix’: Permission denied
    make: *** [install] Error 1
    ———————————————-

    Do you have any idea what’s wrong there? I can’t figure out why I wouldn’t be able to reinstall it. Thanks,

    -Billy

  9. Billy,
    The error you are getting seems to be because you are not running “make install” with root privileges, and thus cannot write to /usr/bin.

    You shouldn’t need to be root to build, so you can split it up into two setsps:

    1) make all
    2) sudo make install

    The password for sudo will be the same as your “billy” account on your laptop.

    Hope this helps.
    -Yousef

  10. Thanks. I know about sudo (there’s an excellent xkcd comic using it), but when I tried I put it on the “make all” part, so it didn’t work. It works now. By the way, one last question…is there some way I can approximate the size of Wikipedia’s database? I imagine it’s a bit bigger than your Fall 2007 date, but I’m not sure how to get it without downloading everything (which I don’t have the space to do). Thanks,

    -Billy

  11. By the way, I’m not sure I told you about this, but I’m presenting my first batch of results at NCUR (National Conference on Undergraduate Research) on Friday morning. As such, I hope it’s ok that I include you in the Acknowledgments section because of your great help.

  12. Billy,

    When you say size of database I’m assuming you are referring to the images (not full text which is around 13G last time I checked… but don’t quote me on that — large margin of error )

    Images… I don’t even have a complete set and I’m at 340G … I would guess I have something like 75% coverage (including all the different sizes for thumbnails…etc)

    Also, very cool on presenting your first batch of findings, send me the link as soon as you can, and if you like I would love to get a proof read or help out (anything to get a sneak preview) — I would be totally pumped to be included in the acknowledgments section.

    Thanks,
    Yousef

  13. Hey Yousef,

    Sorry I couldn’t get back to you earlier. I’m presenting this tomorrow morning. Here’s the link to a pdf of my poster:

    http://www.sendspace.com/file/kf8vjd

    Remember, it’s not saving the world or anything, but it’s been fun undergraduate research. Thanks again for your help.

    -Billy

Leave a Reply