Wiki2html
Dumping is REALLY time consuming! Depending on the wikipedia you want to prepare this can take DAYS to WEEKS!
All benchmark results I present here were done on an Intel Core2Quad 6600 overclocked to 3GHz.
Synopsis
You will import the wikipedia database snapshot into a local correctly configured and patched mediawiki, then you dump everything onto your harddrive as a postgresql data dump containing optimized and stripped down html.
install prerequisites
sudo apt-get install apache2 php5 php5-mysql mysql-server php5-xsl php5-tidy php5-cli subversion gij bzip2
or
yum install httpd php php-mysql mysql-server mysql-client php-xml php-tidy php-cli subversion java-1.5.0-gcj bzip2
apache2 is optional and only needed if you want to install via web interface or want to check check wether your data import looks correct.
get to run a local mediawiki
checkout latest mediawiki to whatever folder your webserver of choice publishes and install all you need to set mediawiki up on localhost.
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3 /var/www
delete the extensions dir and import the official extensions:
rm -rf /var/www/extensions svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions /var/www/extensions
optional: configure /etc/apache2/sites-enabled/000-default so that the mediawiki websetup loads when you access localhost. goto http://localhost and finish the mediawiki install via the webinterface installer. To copypasta the rest of this walkthrough use the root account for mysql. You only have to fill in the values marked red. When everything works proceed to the next step.
a more easy setup is done via manually setting up mediawiki:
echo "CREATE DATABASE wikidb DEFAULT CHARACTER SET binary;" | mysql -u root
then import the table structure
mysql -u root wikidb < wikidb.sql
and put LocalSettings.${LANG}.php in place
GRANT SELECT , INSERT , UPDATE , DELETE , CREATE TEMPORARY TABLES ON `${LANG}wiki` . * TO 'wikiuser'@'%';
configure/modify mediawiki
append to your LocalSettings.php
$wgLanguageCode = "${LANG}"; ini_set( 'memory_limit', 80 * 1024 * 1024 ); require_once( $IP.'/extensions/ParserFunctions/ParserFunctions.php' ); require_once( $IP.'/extensions/Poem/Poem.php' ); require_once( $IP.'/extensions/wikihiero/wikihiero.php' ); require_once( $IP.'/extensions/Cite/Cite.php' ); $wgUseTidy = true; $wgExtraNamespaces[100] = "Portal"; #also to be changed according to your language $wgSitename = "Wikipedia";
Edit AdminSettings.php and set mysql user and password so that you can run the maintenance scripts:
cp AdminSettings.sample AdminSettings.php vim AdminSettings.php
Patch the DumpHTML extension to produce correct output with MediawikiPatch:
patch -p0 < mediawikipatch.diff
You may also enable embedded LaTeX formulas as base64 png images. Just follow these instructions: EnablingLatex
import wikipedia to your mediawiki install
get the template for huge databases
gunzip -c /usr/share/doc/mysql-server-5.0/examples/my-huge.cnf.gz > /etc/mysql/my.cnf
additionally set the following in /etc/mysql/my.cnf
[...] [mysqld] [...] max_allowed_packet=16M [...] #log-bin=mysql-bin
and restart mysql-server
check out the available dump for your language at http://download.wikimedia.org/${WLANG}wiki/ $WLANG being de, en, fr and so on. set the appropriate language and the desired timestamp as variables.
export WLANG=<insert your language code here> export WTIME=<insert the desired timestamp YYYYMMDD>
clean existing tables:
echo "DELETE FROM page;DELETE FROM revision;DELETE FROM text;" | mysql -u root wikidb
add interwiki links
wget -O - http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-interwiki.sql.gz | gzip -d | sed -ne '/^INSERT INTO/p' > ${WLANG}wiki-${WDATE}-interwiki.sql mysql -u root wikidb < ${WLANG}wiki-${WDATE}-interwiki.sql
download and import database dump
wget http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-pages-articles.xml.bz2 bunzip ${WLANG}wiki-${WDATE}-pages-articles.xml.bz2 wget http://download.wikimedia.org/tools/mwdumper.jar java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 ${WLANG}wiki-${WDATE}-pages-articles.xml | mysql -u root wikidb
enwiki | 52h |
dewiki | 10h |
frwiki | 7h |
nlwiki | 3h |
add category links
wget -O - http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-categorylinks.sql.gz | gzip -d | sed -ne '/^INSERT INTO/p' > ${WLANG}wiki-${WDATE}-categorylinks.sql mysql -u root wikidb < ${WLANG}wiki-${WDATE}-categorylinks.sql
enwiki | 32h |
dewiki | 1.5h |
frwiki | 2.5h |
jawiki | 1h |
nlwiki | 0.25h |
if you installed and configured apache you can now access http://localhost and check if everything is set up as desired.
dump it all
get the id count to do estimations on how to best split the work over your cores
echo "SELECT MAX(page_id) FROM page" | mysql -u root wikidb -sN
with a multicore setup you can dump with multiple threads using the start and endid. be aware that the first articles take longer than the later ones because they are generally bigger.
The following splits were found to be useful:
enwiki | 1/32 | 4/32 | 11/32 | 16/32 |
dewiki | 2/16 | 4/16 | 5/16 | 5/16 |
how long it takes very much depends on your hardware. for example my core2quad Q6600@3GHz is overall four-times faster in dumping mokopedia than my old athlon 64 x2 3600+ when using all four as opposed to two cores.
Dumping is also independent of the speed of the harddisk - even when dumping with a quadcore - the bottleneck is still the processor. So there is also zero speed loss when processes are run in parallel on every core.
php extensions/DumpHTML/dumpHTML.php -d /folder/to/dump -s <startid> -e <endid> --interlang
enwiki | 193h |
dewiki | 14h |
frwiki | 16h |
jawiki | 8h |
nlwiki | 6h |
create categories
php extensions/DumpHTML/dumpHTML.php -d /folder/to/dump --categories --interlang
enwiki | 28h |
dewiki | 3h |
frwiki | 6h |
jawiki | 3h |
nlwiki | 1h |
Appendix
for debian you might want to remove the database check on every boot - this can take ages with german or english wikipedia. just comment out check_for_crashed_tables; in /etc/mysql/debian-start