Computing wiki statistics with WikiTrust has been fully automatized. All you need to do is:
- Obtain a dump of a wiki from [http://download.wikimedia.org
http://download.wikimedia.org]; you need a
''pages-meta-history.xml.7z'' or ''pages-meta-history.xml.bz2'' dump.
- Give the single command (in the case of a .7z dump; for other options, do ./batch_process.py --help):
./batch_process.py --cmd_dir <path>/WikiTrust/analysis --dir
<process_dir> --do_compute_stats <dump_file_name.xml.7z>
<path> is the path to the ''WikiTrust/analysis'' directory (of course, you need to have compiled the code!).
<process_dir> is the name of a directory where you want the processing to take place.
The result will be a directory tree, containing files with a ''.stats'' extension. These files are text files containing two kind of lines: EditLife lines and EditInc lines. These lines specify, respectively, the properties of each revision, and
a feedback graph that specifies whether the author of a revision has
preserved, or undone, the work done in a previous revision.
Note that the command above carries out a parallel computation, using
all CPU cores available on a particular machine. If you wish to use
only some of the CPU cores, use the ''--n_cores'' option. Do ''./batch_process.py --help'' for a list of all available options.
As an example, here are two EditLife lines:
1194133082 PageId: 13382652 rev0: 168966752 uid0: 511814 uname0:
"Canuck85" NJudges: 3 Delta: 17.50 AvgSpecQ: 0.82857
EditLife 1190472646 PageId: 13382616 rev0: 159558136 uid0: 0 uname0:
"188.8.131.52" NJudges: 1 Delta: 179.50 AvgSpecQ: -0.99721
The meaning of these fields is as follows (with reference to the first line; the second line has a similar format):
''EditLife'' identifies the line type (in this case, an EditLife line)
- ''1194133082'' is the timestamp when the revision was made, in Unix seconds (since 1970).
- ''PageId: 13382652'' is the page id.
- ''rev0: 168966752'' is the revision id.
- ''uid0: 511814'' is the user id of the author of the revision. If the
user id is 0 (as for the second line), then the user was not logged
into Wikipedia when the revision was made.
- ''uname0: "Canuck85"'' is the username of the author of the revision.
If the user was not logged in, as in the second line, then the username
is the IP address from which the edit originated.
- ''NJudges: 3'' indicates how many subsequent revisions were
considered, to form an opinion of the quality of the contribution of
the current revision. There is a fixed maximum (usually, 8; you can
change it by changing the value of n_judges in online_types.ml), and
users are not allowed to judge themselves.
- ''Delta: 17.50'' is a measure of how big the change done by the user
was. In fact, ''Delta'' is the edit distance between the revision and
the previous one; this edit distance is computed as follows:
- Each word inserted or deleted contributes 1 to the distance.
- Each replaced word contributes 0.5 to the distance.
- Each word that is moved a fraction of x of the page length (0 < x < 1) contributes x to the distance.
- In tracking text from one revision to the next, capitalization and some punctuation is ignored.
- ''AvgSpecQ: 0.82857'' indicates the quality of the edit performed by
the user. This is an average, computed with respect to ''NJudges''
subsequent revisions. A quality of -1 indicates that the edit was
entirely reverted, and a quality of +1 indicates that the edit was
entirely preserved. In the two lines we provide as example, the first
edit was substantially preserved, while the second was almost entirely
In our work on measuring author contribution, we advocate using the sum
of ''Delta * AvgSpecQ'' over all revisions contributed by a user, as a
measure of how much a user has contributed to a page or to a wiki.
As an example, here is an EditInc line (this is a single line; it is broken up to make reading easier):
EditInc 1198049984 PageId: 13382652 Delta: 3.50
rev0: 159538231 uid0: 751502 uname0: "Cope0023"
rev1: 161318875 uid1: 1123105 uname1: "Shotgun pete"
rev2: 178830218 uid2: 1344935 uname2: "After Midnight"
d01: 4.50 d02: 8.00 d12: 3.50 dp2: 7.00
n01: 3 n12: 5 t01: 725144 t12: 6864534
Before explaining the fields, it helps to understand the general idea behind these lines.
These lines compare three revisions: rev0, rev1, rev2. These revisions
are not necessarily consecutive, but are in increasing temporal
ordering; the number of revisions comprised between rev0 and rev2 is at
most 8 (again, you can change this constant by editing
The revision rev0 is taken to be a reference revision in the past.
The revision rev1 is the revision being judged.
The revision rev2 is a reference revision in the future of rev1.
The idea is that the quality of rev1 is being judged by rev2, with rev0 as reference point. Essentially, it is as if the author of rev2 thought:
My revision is surely good, and rev0 is how the page used to look
like. Hence, the author of rev1 did a good job if se helped make the
page more similar to what I consider good, that is, rev2.
Hence, the author of rev2 computes the following quantity to measure the quality of rev1 wrt. rev0:
The amount of work done by the author of rev1 is roughly d(rev0,
rev1), or d01, where d(rev0, rev1) indicates the edit distance between
rev0 and rev1.
The improvement is equal to how much closer the revision became to
rev2, that is, d(rev0, rev2) - d(rev1, rev2), or d02 - d12.
Hence, the quality q of rev1 with respect to rev0, as seen from rev2,
is q = (d02 - d12) / d01. One can verify that -1 <= q <= 1, since
the edit distance d we use satisfies the triangular inequality (well,
more or less, since we use some tricks in the computation of d() that
in rare case break the triangular inequality, but this holds to a very
The information in an EditInc line helps computing the above quality. The fields have the following meaning.
''EditInc'' is the type of line.
''1198049984'' is the timestamp, in Unix seconds (since 1970).
''PageId: 13382652'' is the page id.
''Delta: 3.50'' is the size of the edit (see above for the explanation of how edit distance is computed).
''rev0: 159538231'' is the revision id of a previous revision, taken as reference point.
''uid0: 751502'' is the user id of the user who did rev0. Again, this
user id is 0 if the author was not logged in at the time.
''uname0: "Cope0023"'' is the username of the user who did rev0.
Again, this username is the IP address from where the revision
originated, if the user was not logged in at the time of the edit.
''rev1: 161318875 uid1: 1123105 uname1: "Shotgun pete"'' is the
corresponding data for rev1, the revision that is being judged.
''rev2: 178830218 uid2: 1344935 uname2: "After Midnight"'' is the
same information for rev2, the judging revision. Note that we enforce
that uid1 is different from uid2: it would not make sense for a user to
''d01: 4.50 d02: 8.00 d12: 3.50 dp2: 7.00'' are the distances. Here,
''dp2'' is the distance from the revision preceding rev1, to rev2. This
distance is needed by some of our internal algorithms, and may not be
of general interest or use.
''n01: 3 n12: 5'' is the number of revision between rev0 and rev1, and between rev1 and rev2, respectively.
''t01: 725144 t12: 6864534'' is the number of seconds between rev0 and rev1, and between rev1 and rev2.
How to compute a user-to-user feedback graph
A common desire is to compute a user-to-user feedback graph on the
Wikipedia, where the feedback is +1 if a user appreciates the
contributions by another user, and -1 otherwise. In fact, I will explain here how to do better, based on the statistics computed above. I will explain how to compute a graph where:
Vertices are revisions. If you are interested in the user-to-user
graph, you can obtain the edge from u1 to u2 by averaging all the edges
of revisions done by u1 to revisions done by u2. The advantage of
revision granularity is that you can compute feedback graphs that are
restricted to certain pages, intervals of time, categories of
Edges from revision rev2 to revision rev1 represent feedback. They
have a quality, which ranges from -1 (when rev2 thinks what rev1 did
should be reverted), to +1 (when rev2 kept all the improvements done in
rev1). Edges also have a (non-negative) weight, which represents the
amount of work done by r1 to deserve the feedback.
These edges can be computed as follows:
The quality q of an edge from rev2 to rev1 can be computed by
averaging, over all rev0, the ratio (d02 - d12) / d01 mentioned above.
The weight is simply Delta, the amount of work done in rev1.