Computing wiki statistics

Computing wiki statistics with WikiTrust has been fully automatized.  All you need to do is:
  1. Obtain a dump of a wiki from [http://download.wikimedia.org http://download.wikimedia.org]; you need a ''pages-meta-history.xml.7z'' or ''pages-meta-history.xml.bz2'' dump.
  2. Give the single command (in the case of a .7z dump; for other options, do ./batch_process.py --help):
cd WikiTrust/util
./batch_process.py --cmd_dir <path>/WikiTrust/analysis --dir <process_dir> --do_compute_stats <dump_file_name.xml.7z>
Where:
  •   <path> is the path to the ''WikiTrust/analysis'' directory (of course, you need to have compiled the code!).
  • <process_dir> is the name of a directory where you want the processing to take place.
The result will be a directory tree, containing files with a ''.stats'' extension.  These files are text files containing two kind of lines: EditLife lines and EditInc lines. These lines specify, respectively, the properties of each revision, and a feedback graph that specifies whether the author of a revision has preserved, or undone, the work done in a previous revision.

Note that the command above carries out a parallel computation, using all CPU cores available on a particular machine. If you wish to use only some of the CPU cores, use the ''--n_cores'' option.  Do ''./batch_process.py --help'' for a list of all available options.

EditLife lines

As an example, here are two EditLife lines:

EditLife 1194133082 PageId: 13382652 rev0: 168966752 uid0: 511814 uname0: "Canuck85" NJudges: 3 Delta: 17.50 AvgSpecQ: 0.82857
EditLife 1190472646 PageId: 13382616 rev0: 159558136 uid0: 0 uname0: "121.120.43.189" NJudges: 1 Delta: 179.50 AvgSpecQ: -0.99721

The meaning of these fields is as follows (with reference to the first line; the second line has a similar format):
  • ''EditLife'' identifies the line type (in this case, an EditLife line)
  • ''1194133082'' is the timestamp when the revision was made, in Unix seconds (since 1970). 
  • ''PageId: 13382652'' is the page id.
  • ''rev0: 168966752'' is the revision id.
  • ''uid0: 511814'' is the user id of the author of the revision. If the user id is 0 (as for the second line), then the user was not logged into Wikipedia when the revision was made.
  • ''uname0: "Canuck85"'' is the username of the author of the revision. If the user was not logged in, as in the second line, then the username is the IP address from which the edit originated.
  • ''NJudges: 3'' indicates how many subsequent revisions were considered, to form an opinion of the quality of the contribution of the current revision. There is a fixed maximum (usually, 8; you can change it by changing the value of n_judges in online_types.ml), and users are not allowed to judge themselves.
  • ''Delta: 17.50'' is a measure of how big the change done by the user was. In fact, ''Delta'' is the edit distance between the revision and the previous one; this edit distance is computed as follows:
    • Each word inserted or deleted contributes 1 to the distance.
    • Each replaced word contributes 0.5 to the distance.
    • Each word that is moved a fraction of x of the page length (0 < x < 1) contributes x to the distance.
    • In tracking text from one revision to the next, capitalization and some punctuation is ignored.
  • ''AvgSpecQ: 0.82857'' indicates the quality of the edit performed by the user. This is an average, computed with respect to ''NJudges'' subsequent revisions. A quality of -1 indicates that the edit was entirely reverted, and a quality of +1 indicates that the edit was entirely preserved. In the two lines we provide as example, the first edit was substantially preserved, while the second was almost entirely reversed.
In our work on measuring author contribution, we advocate using the sum of ''Delta * AvgSpecQ'' over all revisions contributed by a user, as a measure of how much a user has contributed to a page or to a wiki.

EditInc lines

As an example, here is an EditInc line (this is a single line; it is broken up to make reading easier):

EditInc 1198049984 PageId: 13382652 Delta: 3.50
rev0: 159538231 uid0: 751502 uname0: "Cope0023"
rev1: 161318875 uid1: 1123105 uname1: "Shotgun pete"
rev2: 178830218 uid2: 1344935 uname2: "After Midnight"
d01: 4.50 d02: 8.00 d12: 3.50 dp2: 7.00
n01: 3 n12: 5 t01: 725144 t12: 6864534

Before explaining the fields, it helps to understand the general idea behind these lines.
These lines compare three revisions: rev0, rev1, rev2. These revisions are not necessarily consecutive, but are in increasing temporal ordering; the number of revisions comprised between rev0 and rev2 is at most 8 (again, you can change this constant by editing online_types.ml).
  • The revision rev0 is taken to be a reference revision in the past.
  • The revision rev1 is the revision being judged.
  • The revision rev2 is a reference revision in the future of rev1.
The idea is that the quality of rev1 is being judged by rev2, with rev0 as reference point.  Essentially, it is as if the author of rev2 thought:

My revision is surely good, and rev0 is how the page used to look like. Hence, the author of rev1 did a good job if se helped make the page more similar to what I consider good, that is, rev2.

Hence, the author of rev2 computes the following quantity to measure the quality of rev1 wrt. rev0:
  • The amount of work done by the author of rev1 is roughly d(rev0, rev1), or d01, where d(rev0, rev1) indicates the edit distance between rev0 and rev1.
  • The improvement is equal to how much closer the revision became to rev2, that is, d(rev0, rev2) - d(rev1, rev2), or d02 - d12.
  • Hence, the quality q of rev1 with respect to rev0, as seen from rev2, is q = (d02 - d12) / d01. One can verify that -1 <= q <= 1, since the edit distance d we use satisfies the triangular inequality (well, more or less, since we use some tricks in the computation of d() that in rare case break the triangular inequality, but this holds to a very good approximation).
The information in an EditInc line helps computing the above quality.  The fields have the following meaning.
  • ''EditInc'' is the type of line.
  • ''1198049984'' is the timestamp, in Unix seconds (since 1970).
  • ''PageId: 13382652'' is the page id.
  • ''Delta: 3.50'' is the size of the edit (see above for the explanation of how edit distance is computed).
  • ''rev0: 159538231'' is the revision id of a previous revision, taken as reference point.
  • ''uid0: 751502'' is the user id of the user who did rev0. Again, this user id is 0 if the author was not logged in at the time.
  • ''uname0: "Cope0023"'' is the username of the user who did rev0. Again, this username is the IP address from where the revision originated, if the user was not logged in at the time of the edit.
  • ''rev1: 161318875 uid1: 1123105 uname1: "Shotgun pete"'' is the corresponding data for rev1, the revision that is being judged.
  • ''rev2: 178830218 uid2: 1344935 uname2: "After Midnight"'' is the same information for rev2, the judging revision. Note that we enforce that uid1 is different from uid2: it would not make sense for a user to judge hemself.
  • ''d01: 4.50 d02: 8.00 d12: 3.50 dp2: 7.00'' are the distances. Here, ''dp2'' is the distance from the revision preceding rev1, to rev2. This distance is needed by some of our internal algorithms, and may not be of general interest or use.
  • ''n01: 3 n12: 5'' is the number of revision between rev0 and rev1, and between rev1 and rev2, respectively.
  • ''t01: 725144 t12: 6864534'' is the number of seconds between rev0 and rev1, and between rev1 and rev2.

How to compute a user-to-user feedback graph

A common desire is to compute a user-to-user feedback graph on the Wikipedia, where the feedback is +1 if a user appreciates the contributions by another user, and -1 otherwise.  In fact, I will explain here how to do better, based on the statistics computed above.  I will explain how to compute a graph where:
  • Vertices are revisions. If you are interested in the user-to-user graph, you can obtain the edge from u1 to u2 by averaging all the edges of revisions done by u1 to revisions done by u2. The advantage of revision granularity is that you can compute feedback graphs that are restricted to certain pages, intervals of time, categories of contributions, etc.
  • Edges from revision rev2 to revision rev1 represent feedback. They have a quality, which ranges from -1 (when rev2 thinks what rev1 did should be reverted), to +1 (when rev2 kept all the improvements done in rev1). Edges also have a (non-negative) weight, which represents the amount of work done by r1 to deserve the feedback.
These edges can be computed as follows:
  • The quality q of an edge from rev2 to rev1 can be computed by averaging, over all rev0, the ratio (d02 - d12) / d01 mentioned above.
  • The weight is simply Delta, the amount of work done in rev1.
Comments