Frequently asked questions (FAQ)

On user reputation

Why do you use a reputation system to compute how much text has been revised?  Why don't you just use a mix of text age / number of revisions?

We use a reputation system for two reasons:
  • When new text is introduced, we wish to mark text by known, reliable contributors with higher trust than text by unknown, anonymous, or novice contributors.  The user reputation system helps us assign trust to new text.
  • We want text to gain reputation when revised by known, reliable contributors.  If anonymous or novice users could raise text trust quickly, they could also cover up any sign of vandalism.
You give reputation to authors on the basis of how long their contributions last.  Won't authors who contribute to controversial topics, where reversions are more common, fail to gain the reputation they deserve?

We took great care when devising the specific algorithms that assign reputation to ensure that authors gain reputation also when their contributions are preserved only in part.  Also, users typically contribute to a range of pages; very controversial pages form a minority of pages, and even in these pages, outright revert battles between established users are rare.  For these reasons, we believe that even users who contribute to controversial pages gain the reputation they deserve.

If I get reverted by a vandal, will my reputation suffer?

Not much at all.  When a user A reverts a user B, the reputation of B suffers in proportion to the reputation of A. Vandals usually have no reputation (if they are anonymous), or very low reputation, so the reputation of B would suffer only minimally.

If I contribute a paragraph, and someone later improves the wording, what happens to my reputation?

You reputation would still increase.   The text analysis in WikiTrust is able to distinguish between contributions that are undone, and contributions that are reworded, adapted, reformatted, or improved.

I think an automated system that computes user reputations is evil!

We don't think so, but we understand that this is a controversial topic.  For this reason, in the spirit of friendliness and collaboration of wikis, WikiTrust does not explicitly display author reputation.  Author reputation is only used internally, in the computation of text trust.

Will there be some human control on reputations?

Author reputation values are stored in a database table, and so they can be inspected and modified, if desired.  However, our point of view is that the author reputation we compute is best viewed as an internal, purely mathematical quantity that helps us achieve a better text coloring.

Can users collude and raise their reputation without doing any good work?

Our algorithms make this very difficult; for the details, see our talks and papers.

On text trust

I don't believe your tool computes the real trust of text!  So what is it that it really computes?

WikiTrust computes a quantity (which it calls "trust") which measures to what extent the text has been revised, and left unchanged, by high-reputation authors.

So why do you still call it trust?

We like short, concise terms for our mathematical notions.  This is common in science.  Author "reputation" is a similar term: it has nothing to do with the reputation of a person in real life; it is a mathematical concept.  We are aware that many people are sensitive to the word "trust", and at some future point we may start calling it "text reputation" rather than "text trust", but again, we consider these names as labels for mathematical notions.

How would this be useful to me?

When you look at a page, the text coloring tells you which pieces have changed recently, and have not been subsequently revised.  By clicking on those and other portions of text, you can figure out who inserted the text, and in which context.

Can a vandal insert false information, then revise it until the text backround becomes white?

No.  A user can only increase the trust of text up to their current reputation value.  So, a novice, or an anonymous user,  can increase very little the trust of text.  A reputable user can increase more the trust of text, but only revision by multiple, distinct, high-reputation users can lead to fully trusted text.

Can an author increase the trust of the same portion of text multiple times?

No.  Once a user has caused the trust of a word to increase, multiple distinct users need to increase the trust of the word for said user to be able to increase it again.  And as only users of higher reputation than the word can increase the trust of the word, this is not easy to achieve.  Internally, this safeguard is implemented by associating with each word the list of users who have recently increased the trust of the word.  This list is maintained using compression and hashing techniques, so that in practice, very little storage is required.

Why do contiguous words occasionally have very slightly different color for no apparent reason?

The hashing algorithm we use to compress the list of authors who increased the trust of a word (see above) has a probability of collisions of about one in a thousand.  The hash is computed both on the basis of the author, and the word.  So, in about one in a thousand words, we falsely believe that the author has already increased the word trust.  This causes occasional small differences in text color.  As text color is only a rough visual hint, we do not believe that these variations are problematic.

What? Hash collisions?  Randomization?  How can trust depend on randomization, of all things?

Life is random.  A little bit of randomness never hurt anybody.  Look up Heisenberg's principle.  Relax :-)

On text author and origin

Why don't you just use one of the fast diff systems, such as unix diff, or the one implemented in git?

These standard text diff systems lack several features we thought were important:
  • Many of these systems do not deal with block moves: if two blocks of text swap positions, one of them is considered deleted in a place, and re-inserted in the other.  We wanted to track text across block moves, to be able to attribute it to its original author.
  • These systems do not track text that is deleted, and then reinserted in a later revision.  Thus, when a vandal deletes a paragraph, which is later re-inserted, they attribute the paragraph to the user who re-inserted it.  We wanted to attribute it to its original author.
  • These systems are often very brittle with respect to very minor text changes, such as text punctuation, capitalization, and spacing.
What happens when pages are merged, or split?

When pages are merged or split, due to the way in which we track deleted text, our author and origin attribution work in a reasonable way, and our trust system copes as well.  In fact, when we analyze a revision, we consider not only the immediately preceding revision, but also the revision that is most similar to the current one among the last 10 or so past revisions.  Hence, when a merge or split occurs, it is likely that many pages are correctly analyzed.  We could devise better algorithms if the Mediawiki API offered a way to notify of the merge / split events.

On the system

How much computing power is required to run WikiTrust?

A single CPU core of a modern PC can process several revisions per second.  How many, it depends on the size of the revisions, but typical speeds vary between 2 and 40 revisions per second.  The code is fully parallel, so the more CPU cores, the faster the analysis.  Thus, keeping up with new revisions is normally not an issue.  To analyze an existing wiki, you have two alternatives:
  • If your wiki contains less than a million revisions, you just install WikiTrust and tell it to analyze the existing revisions.
  • If your wiki contains more than a million revisions, the best is to obtain a dump of the wiki, and use our batch code.  The analysis of the Italian, or French, Wikipedias from scratch takes about two days on an 8-core machine.  Again, the code is fully parallelizable (have a cluster?  Use it!).
How much extra storage is required?

As an example, the Italian Wikipedia, as of October 2009, requires 110 GB of disk space for the storage of the revisions, and a few GB of space in the database for the storage of the metadata.

Where is the additional information used by WikiTrust stored?

WikiTrust stores various types of additional information: metadata on pages and revisions, the reputation of users, and the analyzed revisions, to name a few.  All the metadata is stored in database tables, in the same database as the Mediawiki tables.  The tables used by WikiTrust have the prefix "wikitrust_".  The revision text is stored in compressed blobs: every blobs consists of a number of consecutive revisions of a page, so that it compresses well.  For small wikis, the blobs can be stored in the database.  For large wikis, we recommend you store the blobs in the filesystem (this can be done using a simple configuration option).

I am interested in using WikiTrust on a Wikipedia.  Can I do it?  Can you do it?  What does it take?

Using WikiTrust on a Wikipedia (for instance, the Wikipedia for a given language) is a little bit more involved than using it on your own wiki, since it is a special setup (people edit the Wikipedia, which resides on MediaWiki Foundation servers, and WikiTrust just serves the analyzed text).  However, if you would like to experiment with WikiTrust on some Wikipedia, let us know, and we may set it up for you.  In the long run, we hope the Wikimedia Foundation will make the process easier (see below).

Will WikiTrust be installed at the WikiMedia Foundation?

We hope, but it is not decided.  The goal would be to make it available as an optional extension, which users can activate in their profile.  Users who activate the option would see the extra "wikitrust" tab (or menu item, or whatever the Foundation will deem appropriate) and be able to access the information computed by WikiTrust.  Note that if that happens, the Foundation will obviously dictate which wording, access method, etc, is used, not us.  This is obvious, but worth remembering.  We just produce open-source code, and we are willing to assist the Foundation in using it, but any decision is theirs to take.

Why is most of WikiTrust written in Ocaml?

Because Luca thinks that Ocaml is a great language.  Yes, really.
More in detail, WikiTrust started its existence in Python.  This made it very easy to get started, but soon two problems emerged:
  • Speed and memory management.  Python was not always able to deallocate memory that should have been deallocated, so that long-running analysis processes had a tendency to blow up.  Also, the speed left something to be desired: this is, after all, a performance-critical application.
  • Slow development cycle.  Python is very fast at the beginning, in a small project.  However, the trouble with Python is that you are never sure that your code works, unless you test it with full coverage.  It has no type checking, no compilation sanity check.  After you make a change, you must test it all: it is not sufficient to just test the new piece you made, because due to the lack of type checking, errors can sneak in from one part, and cause another part to blow up only much later.  This is not amusing, when you must run long-lasting batch jobs.  In Python, once you have a system of size n, a change of size epsilon causes you to do epsilon + n testing.  This leads to a development time that is quadratic in the size of the code base.
So, one Winter, we took the Python code and we completely rewrote it in Ocaml, and we have been happy ever since (well, especially Luca).  Now, when we make a change, we just needs to test the thing we changed: we know that it's extremely unlikely that we have broken something else, due to the strength of the type system.  Memory management is also superb.

But I don't know Ocaml!  Could you not have written it in Perl?

Perl!  Aaack!! (says Luca; Bo and Ian managed to sneak in some Perl while Luca was distracted).  But jokes aside, we needed a language that is:
  • Fast
  • Excellent memory management
  • Strong type system
  • Concise (we are busy types, we get impatient)
  • Libraries for mysql, http, xml, and all basic data structures.
We are sure that there are alternatives to Ocaml, but Perl, Php, Java, Python, C++, Ruby, are not among them.

How do I contribute to WikiTrust?

Finally someone asks this question!  Welcome!!
You can contribute in several ways:
  • If you have your own idea of how to improve it, let us know, and we can help you get started.
  • If you don't have ideas, but like to contribute, well, WE can give you ideas!
We appreciate contributions not only for the core Ocaml code, but also for the php, javascript, web ideas, user interface, and much more.  In fact, due to our being hard-core computer scientists, I suspect we stand to benefit a lot especially from help with more UI-oriented aspects of the system.  But all contributions are welcome!

On the demo

Why this demo?

We needed to test our code, and we wished to show that it can do something useful.

Why is the demo so slow?

The demo requires a lot of back-and-forth between your Firefox browser, the servers at UCSC, and the servers at the Wikimedia Foundation.  When you ask for a page, this is what happens:
  • First, the Firefox add-on asks the UCSC servers for the text information.
  • If the UCSC servers have this information, it is sent back to the extension.  The extension then processes it a bit, to turn the annotated markup language into proto-HTML, and then sends the proto-HTML to the Fondation servers.  Those servers turn the proto-HTML into full-fledged HTML, complete with pictures and all, and send it back to the add-on.
  • If the UCSC servers do not have this information, they request the missing revisions from the Foundation servers.  The revisions are then analyzed.  When you try again after a dozen seconds or so, the information should be available.
As you can imagine, this is not really the most streamlined way of serving the information!  We hope that one day, the WikiTrust information will be available directly from the Foundation servers (see above).

Please contribute to this FAQ by sending your questions to!