WikiTrust APIs

Text Origin and Trust APIs

The request:

returns a JSON object containing a string, consisting of the WikiTrust text with some additional markup. This is an example of the output:

    Since {{#t:10,84893431,Habitual gardner}}its inception in {{#t:10,86765634,Bassemkhalifa}}1928 the movement

The tag {{#t:10,84893431,Habitual gardner}} specifies that the following words, up to the next tag, have trust 10, and were written by "Habitual gardner" in revision 86765634.

Quality APIs

We have developed a web API based on WikiTrust, which returns the likelihood that each revision of the English Wikipedia is vandalized.  The API works on revisions that have been already analyzed by WikiTrust: thus, there may be a small delay between the time a revision is added to the Wikipedia, and the time the API is able to provide results, since WikiTrust analyzes revisions only when someone requests to view the trust coloring, or in batches that are done every few minutes.  If there is sufficient interest in the API, we might improve on this timeliness. 

Using the API is very simple: if you wish to know the likelihood that the revision with id 1234 is vandalism, you just call:

You can also extract the raw signals we use for the vandalism classification, with the following call:

This vandalism API is easily able to filter out over 90% of the vandalism present in Wikipedia.  For a detailed discussion of how the classification works, and of the classification accuracy, see: 
Note: please do not use this API at more than 1 QPS (query per second), and even then, do so only intermittently.  If you want to donate funds or equipment to UC Santa Cruz, so that we might provide a more robust service, do let us know!

Selection APIs

Another web API we are exposing is our Selection API, which returns a list of the top ranked revisions of an article if you are trying to pick a single revision to show to users.  This API is useful for the problem of creating an offline version of Wikipedia (such as book or DVD formats), where the preference is for a recent revision, but the restriction is that you can choose only one revision and it must not have vandalism.

Using the API is very simple: if you wish to know the best revisions to select for page 1234, you call:

This call will return a JSON array of objects describing the top three revisions for the page.  The information about each revision that is returned is:
  • quality - similar to the vandalism API, but 1.0 means "good" and 0.0 means "likely to be vandalism".  (This is opposite to the scores in the vandalism API.)  Note that we are using a different "model" here, so the scores won't match exactly what is generated by the vandalism API.
  • Page_id
  • Revision_id
  • Risk: this is a measure of the fraction of change from the previous page.  In practice, this measure does not seem to be very interesting.  There are several pages where risk = 1.0, and this is just because the revision follows a vandalism that blanked the page -- so the change is equal to the length itself of the revision.  Our algorithms, which track text across many revisions, know that the text that has been added was there before, and is trusted, so sometimes we select such revisions.  I am not sure it is so useful to look at this column.
  • Forced: this indicates how much our choice of revision was dictated by the fact that it was the only recent revision available.  I added this to catch the possible case of a vandal that edits a very old revision (e.g. a redirect page that has not changed since 2005), adding a revision that is low quality, but so much newer than the old one that it might look like a good choice to our selection algorithm.
  • Date: date of the revisions
  • Days ago: how many days ago the revision was made.
  • Revisions ago.
  • Rank: For each page, we propose 3 revisions, ranked 1, 2, 3.  You should choose the revision with rank = 1 unless for some reason you don't like it; the revisions with rank 2 and 3 are fall-back choices.
  • Url