Task Author: Gordon Cormack (CAN)
One important observation is that excerpts from the same language version of Wikipedia will share some characteristics in a statistical sense. Because many random excerpts are offered, the variability between excerpts from the same language play a negligible role. It has been confirmed that the statistical resemblance between the provided test input and the official grader input is highly predictable.
Note that because of the random re-coding of the language codes and symbol codes, there is no opportunity to hard code any specific (personal) language knowledge into a solution.
There are many approaches possible. Rocchio's method, which was informally described in the task description, suffices to solve Subtask 1.
For Subtask 2, one needs to do more than simply look at symbol frequencies. Collecting statistics on bigrams (pairs of neighboring symbols), trigrams (three consecutive symbols) will yield higher accuracies.