Why you should consider basing your code on WikiTrust
There are two main difficulties in writing a Wikipedia analysis from scratch:
The English Wikipedia, which is the Wikipedia most of interest to
researchers, is huge. Tools that can work on the English Wikipedia must
be able to cope with the huge amount of information efficiently,
parsing and processing in a robust way. This is not trivial! The
Wikipedia is a sort of Library of Babel (as in the famous book by
Borges), where everything that can be written, has been written at most
once. Here are some examples:
- You think of putting consecutive word triples in a hash table, just
like that? Think again! Somewhere lies buried a revision that contains
million of times the word 'devil' (or some such; my memory is now
fuzzy). All these entries end up in the same hash bucket of course -
and this is no good for the processing time.
- You think of parsing the text a little bit? Good luck! - Wiki markup
language is not like a programming language, where there is right and
wrong. In wiki markup language, if the wiki engine renders it ok, then
it is ok. Any misformatted markup is fine, as long as it looks ok. The
official description is not what is used! Wiki markup is the wild west.
One is truly thanksful to the people in CS who developed parsing and
compiler theory, after looking at what happens when these are
- Tracking text in the Wikipedia is far from obvious. You cannot simply
compare a revision with the preceding one, and so forth. Text can be
deleted in going from rev0 to rev1, then stay deleted when going from
rev1 to rev2, and finally, be re-introduced in rev3. You cannot
consider the text reintroduced in rev3 as new: if you do so, then it
looks like people who undo spam also introduce text. We measured on the
Wikipedia, and we found very many instances where big blocks of text
remains deleted for a few revisions before being reintroduced. Also,
text often moves around across revisions, so that blocks of text can
change relative position: you cannot use text comparison algorithms
that can't deal with block moves (such as unix diff and wdiff). And
since there is a lot of text in the Wikipedia, the text tracking
algorithms need to be fast.
We spent some time getting the text tracking and edit analysis
algorithms right and fast, while developing WikiTrust. The result is an
infrasructure that can track text and compare revisions across huge
wikipedias in a way that is fast, robust, parallelizable, and deals
with all kinds of text block moves, deletions and reinsertions, and
How to write your analysis on top of WikiTrust
To write a new analysis in WikiTrust, you need to:
- Define a subclass of the Page class, in a file called for instance example_analysis.ml. This class must have the following methods:
Add a few lines of code to page_factory.ml to ensure that, if you use the command-line option -example_analysis, Page_factory creates pages of the new subclass you just defined in example_analysis.ml. Add a few lines in Makefile to ensure your analysis is compiled and linked with all the rest of WikiTrust.
- add_revision: this method is passed a new revision, with the main elements (user id, username, text, timestamp, and more) already extracted from the XML. The method can do what it wants, but a typical example consists in building a revision object, and then processing it. A revision object makes the pre-parsed text available, among other things. You can also use methods for comparing and tracking the text.
- print_id_title: this trivial methods prints the page id, and page title, of the page.
- eval is called once no more revisions are present, and does any last processing necessary.
Compile your code:
You will find the results in the destination directory selected above.
- ./evalwiki -d <destination_dir> -example_analysis <source_file.xml.gz>
Example: measuring the word-days contribution of users
As an example, suppose you want to compute the word-days
of all users. The word-days is a contribution measure which captures how text the user added, and for long the text was part of the most recent revision. For instance, if a user inserts 10 words in revision R1, and 6 words are then deleted in revision R2, and the last 4 words are deleted in revision R3, then the word-hours of the user are:
6 * [t(R2) - t(R1)] + 4 * [t(R3) - t(R1)]
Implementing this analysis in WikiTrust is simple: WikiTrust already provides for you all the ingredients needed, including xml text parsing, metadata parsing, text comparison, and text author tracking. Indeed, the code to compute this analysis has been introduced in the commit aa370290
to our repository. If you look at that commit, you will see exactly which changes were needed to create this new analysis. You can run the new analysis with the command:
./evalwiki -d <destination_dir> -eval_contrib <source_file.xml.gz>
Some explanation of the code is as follows.
WikiTrust parses pages from the compressed dump file. For every page, it creates an object of a class Page, which has two main methods:
- add_revision is used to add a new revision;
- eval is called when all the page revisions have been added.
The object of the class page, typically, creates an object of class Revision
for each revision it receives. The act of creating a revision object also causes all the text of the revision to be parsed, so that you can access the list of words, and the list of syntactic units, very easily. You can also see the list of useful functions
for more information.
The class Page is given an open file where it can write whatever output it wishes. The file has the same name as the wiki dump file passed in input, except that the extension is .out