Offline copies of wikipedia
I have been involved for a number of years with Hilton Theunissen and the Shuttleworth Foundation and their efforts to bring computers to township schools. A part of that software suite was an offline copy of wikipedia.
Initially in 2003 I took the whole of the then-existing English wikipedia, installed a copy of the mediawiki software in conjunction with mysql and apache as database and webserver respectively. The whole thing was around 18 Gigabytes - quite a handful.
It worked well, but I had various complaints on the unsuitability of the material - it was a single snapshot, and had not (could not) be proofread, so it had vandalism, and quite explicit articles around sex. Oops. But - it had search, it had a vast amount of useful information on all manner of subjects.
Very soon the English wikipedia bloomed to hundreds of Gigabytes, making it completely unmanageable in terms of size. I couldn't download it, and I couldn't proof it. What to do ?
Wikipedia Version 1.0
Wikipedia has its own community of people, and among them I connected with Andrew Cates, of the SOS Children website, who came up with a selection of articles (1000 or so) as an HTML dump that he and some others painstakingly proof-read for suitability as a children's educational resource. (He has a larger article collection now). Jonathan Carter helped package this for the tuXlab installs for the Shuttleworth Foundation.
There is a lot of work to do preparing such a collection.
- Which articles should be selected ?
- all the article text in the HTML dump must be stripped of links to articles not in the dump
- it must be proof-read
- associated pictures must be incorporated
Other people became involved, in particular Martin Walker from the State University of New York at Potsdam. Systems were put in place to help on article selection. A project was started - the Wikipedia Version 1.0 Editorial Team. Articles were assessed - both for quality (from Featured Article down to Stub) and for importance (from Top to None). These assessments are placed on the article Talk page, and a robot goes through them all on a regular basis and collects the results on project pages like this and this - conveniently doing all the heavy lifting to present sortable tables of the state of all the articles.
Thus you can find Top Importance articles of poor quality, and can highlight that page for improvement. You can cherry-pick Featured Articles to add to the collection. These tools made the article selection process far more manageable.
Offline wikipedia was identified as a priority by the wikimedia foundation.
I assisted in the post-processing by writing a script that would search all the chosen articles for 'bad words' - an indication that the article has been vandalised - and then a cleanup crew has to go through these to check and possibly remove material.
Now to package all this conveniently. A French company called Linterweb came up with Okawix - all these articles and pictures packaged in a file, with a cross-platform reader to navigate the collection. Why do we need a reader ? To implement search.
For many of the places we put an offline wikipedia down, it became the 'Internet' for the children in the classroom. They had no net connection, but the principles of self-paced learning, hyperlinks for tangential information, and other net paradigms made it the 'killer app' of their little school. For Internet, you need search. For a wikipedia collection of a few thousand articles, you need search. Search needs a computer - you cannot put search on a CD or DVD or USB stick.
A related problem is how to organise all these thousands of articles. They are mushed together in a big web of information, but where is the structure ? Wikipedia proper has categories - useful for grouping similar articles, but the arbitrariness of the invented categories means that it has been a problem incorporating them into the static dump. Martin battled for days to make a river in Poland appear 'automatically' and conveniently under a Poland heirarchy.
In computer jargon, this is called metadata. It is structure beyond the mere linking of articles. There is other metadata - like the Importance assessment scale. We need to extract all that metadata and place it alongside the article tree so it can be used for indexes.
On the 24th November we had an IRC meeting - an online chat between all interested parties spread around the world. Much of this was discussed - and one thing became clear - the Wikimedia foundation needs to concentrate more on the process of generating a release, rather than the end product like Okawix. That means tools to work with the Metadata, tools to package the pictures and article references in such a way they can be optional. Perhaps targeted article collections, like Mathematics, Chemistry, Africa, Oceans. Let other organisations do the work of packaging and marketing.
To allow computers to do the work - we need good metadata. Assess articles. Rugby articles are not Top importance, except in the context of sport.
I think the article collection should consist of a number of different pieces, to be incorporated as necessary.
- The text of the selected articles
- Pictures for those articles
- metadata to support this collection
- a text search index, like one created by namazu, for those tools that can use it.
Though a lot of effort on offline wikipedia collections is targeted at schools, and Third World, there are other target markets. One we have not really addressed yet is the cellphone as a wikipedia platform. A cellphone implies connectivity, but these days it is becoming a universal platform - a camera, a music player, a gaming box, GPS. Personally, I would like a text-only wikipedia collection of lots of articles, but only the lede paragraph - the first section of a wikipedia page that introduces the subject. It is a song by Black-eyed Peas. It is a river in Poland. That way, I can carry all of this on my phone without paying airtime.
Cellphones have huge penetration in the Third World. I tell tourists I take to the townships that South Africans spend their money on cellphones and hair. Maybe we should concentrate there, as much as schools ?
Okawix offline reader standardises on OpenZIM data format.