An overview of the algorithms

The algorithms have been described in various talks and papers, but here is a more intuitive, and informal, description.
Please also see the list of frequently asked questions (FAQ).

The goal of WikiTrust is to annotate the text of a wiki with three pieces of information:
  • The revision where each word originated (if you click on a word, you are taken to the diff corresponding to the edit where the word has first been inserted).
  • The author of each word (visible, for now, only on the diff page that can be reached by clicking on the word).
  • The degree with which every word has been revised (as indicated by the shade of orange in which the word appears).

Text origin and author

WikiTrust computes the origin and author of each word by tracking text across the revisions of a page.  Whenever a new revision R is introduced, WikiTrust compares the text of the new revision with:
  • The text of the preceding revision
  • The text of the revision most similar to R among a certain number of most recent revisions (at least 8 are considered; the detailed algorithm that may choose additional comparison points is involved, and aims at reducing the effects of vandalism).
  • Any text that used to be present in previous revisions, but that has since been deleted.
WikiTrust does an optimal matching of the text in R with the text in these previous revisions, and with the deleted text.  This implies that WikiTrust tracks both the normal text, and the text that has been deleted.  This has important implications:
  • If a user A inserts text in revision 1, deleted by a vandal B in revision 2, and re-inserted by user C in revision 3, the text in revision 3 is still attributed to its original author A.
  • A user cannot "steal" text by simply removing it, and re-inserting it in a later revision: WikiTrust will track the deleted text, and attribute it to its original author.

Text trust

WikiTrust computes an index of text trust, which is a measure of how much the text has been revised by reliable users.   The goal of this text trust is to alert visitors to recent changes, and possibly vandalism, that has been introduced in a page.  This trust is encoded via background text coloring.  Text with white background is trusted, and visitors can assume that at least a few high-reputation authors have seen the page since the text has been inserted, and have left the text unchanged.  Text with orange background is untrusted (the darker the orange, the lower the trust), and visitors can assume that it has been inserted, or modified, recently.

It is to be noted that WikiTrust uses the term "trust" in a mathematical way: trust is a number meant to encode a measure.  This is similar to what is done in the literature about reputation systems: the "reputation" of a user is not really meant to capture the real-life reputation of a person, but is simply a mathematical quantity computed on the basis of the person's activity and other inputs.

Content-driven user reputation

WikiTrust aims at computing how much the text has been revised by reliable users.  Thus, at the basis of WikiTrust is a reputation system for users.  The reputation system is content-driven, as it measures the reputation of users on the basis of how their contributions fare, rather than on the basis of user-to-user feedback.  The main idea is as follows:
  • Authors whose contributions are preserved in full or in part gain reputation.  In particular, users gain reputation when their reputations are preserved unchanged, but also (in slightly lower measure) when they are adapted, reworded, or re-formatted.
  • Authors whose contributions are reverted, or almost entirely undone, lose reputation.
We took great care to ensure that authors gain reputation even from revisions that are revised or reworded.  Here are some details on the computation of reputation.

To validate our notion of user reputation, we have measured how well the reputation of a user predicts how long the future contributions of that user will persist.  We have found a very strong correlation: users who are of high reputation, are much more likely to make long-lasting contributions than users of low reputation.  For the details, see our talks and papers.

From user reputation to text trust

The text "trust" is computed as a measure of how much the text has been revised by high-reputation authors.  Thus, the text trust is computed using author reputations, which are rescaled to be in the interval [0, 10].   The goals of the algorithm are as follows:
  • Text should gain full trust only when revised by multiple high-reputation authors.   This prevents anyone from single-handedly inserting fully-trusted content: full trust requires consensus.
  • Every modification to the text (insertions, deletions, text moves) should leave some trace in the form of lower text trust where the modification occurred.  This ensures that every attempt to tamper with a page leaves some mark that can be noticed by subsequent visitors.
  • Contributions by higher-reptutation authors should be trusted more than contributions by low-reputation authors.
To achieve these goals, our trust algorithm works as follows (see the  talks and papers for the details):
  • Inserted text. When text is first inserted by a user U, it has a trust that is equal to about 30-40% of the reputation of U.  Thus, the text created by novice users, or by anonymous users, will initially have reputation 0; the text created by top-reputation users will have reputation around 3-4 (we sometimes modify these values slightly in the code, so the above values are only indicative).
  • Deleted text. When text is deleted, the "margins" of the cut where the text has been deleted are marked with a trust equal to 30-40% of the reputation of the user performing the deletion.  That is, the margins of the cut have the same trust as text that is inserted anew.  The motivation for this is that, by cutting some portions of text, one can drastically alter its meaning, so that we should assign the same trust to the cut point, as to the rest of the text.
  • Displaced text. When text is moved, the margins of the cuts delimiting the block moves are marked with the same low trust as new text.  Again, the idea is that since the meaning can be drastically altered by moving text around, the cut points need to be marked with low trust.
  • Unchanged text. When a user edits a page, the text that is left unchanged by the user gains a little bit of trust, up to the reputaton of the user.  For instance, a user of reputation 7 can cause text on the page that has trust less than 7 to become a little bit more trusted.  The idea is that the user, by leaving the text unchanged, has implicitly provided at least a weak approval of the text.  The increment in trust is a function of three factors:
    • The higher the reputation of the user, the greater the increment.
    • The closer the text to the point where the edit occurred, the greater the increment (as we assume that users pay more attention to text closer to the point where the edit occurs).
    • A user cannot raise the trust of text twice in a row: once a user has raised the trust of a piece of text, several different users must edit the same page before the user can again raise its trust.
To validate our notion of text trust, we have conducted experiments that show that high trust is correlated with high future longevity for the text: text that is high trust in a revision lasts longer.   While this correlation provides a quantitative validation of our algorithm, the goal of our notion of text trust is not limited to predicting the future longevity of text.  The main goal of our notion of text trust is to alert visitors to recent, unrevised, text changes.


Comments