Вы находитесь на архивной версии сайта лаборатории, некоторые материалы можно найти только здесь.
Актуальная информация о деятельности лаборатории на lex.philol.msu.ru.
MSU-LGCLL:: O.V. Kukushkina, A.A. Polikarpov, D.V. Khmelev - Authorship Attribution through Using Information on Letters and Grammatical Constructions

O.V. Kukushkina, A.A. Polikarpov, D.V. Khmelev


USING LETTERS AND GRAMMATICAL STATISTICS 

FOR AUTHORSHIP ATTRIBUTION

 

 

1  Introduction

In this paper the problem of identification of text's author is presented as follows. Some large fragments of prose works by a number of authors are given. These texts are in Russian or in different written phonological (non-hieroglyph, non-character) language1. An anonymous text known to be from one of these authors is disputed between all of them. One have to predict the correct author. D.V. Khmelev in [] has offered and tested statistically a technique giving correct predictions with quite a high probability. The technique chosen is based on the data about usage of subsequent elements in the text (letters, morphemes etc).

To the best of our knowledge, first attempts in the search of technique for author authorship attribution was given in []. Markov [] have almost immediately replied to []. This shows that the founder of theory of Markov chains was quite interested in this field. Notice also that the first application of "events which are linked to a chain'' Markov described in [], where he studied the distribution of vowels and consonants among initial 20000 letters of "Evgenij Onegin''.

Modern methods of authorship attribution are reviewed for our country in [,Chapter. 1] and good review of foreign papers is given in []. Despite the huge variety of methods described, none of them have been applied to a large number of texts. Often, these methods are not automatic and require some human intervention which makes the computing of a large number of large texts almost impossible. Hence a question of generalization raises: can they be used outside the situation they were developed for?

Until recent times the only exception was paper [], where the chosen technique has been tested on a number of texts. They examine the proportion of function words used by the author and find that this proportion is stable for each author among a large number of Russian writers of XVIII-XX century. This technique has been applied in [] to the problem of determining plagiarism.

New method of authorship attribution for natural language texts (actually, independent on a language), have been offered for the first time by Khmelev in [].

New technique is based on the Markov model for the sequence of letters (and any other elements) of texts, i.e., the sequence of letters is considered as a Markov chain, where each letter (element) depends on only on the preceding letter (element).

The matrices of transition frequencies of element (letter, grammatical code, etc) pairs are calculated over all texts of each author. Therefore we know (approximately) the probability of transition from one letter to another for each author. The true author of an anonymous text is calculated with the principle of maximal likelihood, i.e., for each matrix we calculate the probability of the anonymous text and we choose the author with the maximal corresponding probability and the chosen author is supposed to be the true author.

This method is amazingly precise as was shown in [], where this method was applied to a large number of various texts. The result becomes even more amusing if we recall that only frequencies of letter pairs are taken into account.

Markov models of first and higher orders have been used in a large number of works of 50-60 years of XX century in order to estimate an entropy of various kinds of texts. A lot of those works are described in [,Ch. 4.3].

But, none of works mentioned have raised a question of application Markov chains for the problem of authorship attribution. Moreover, a generally accepted viewpoint was that characteristics of any fiction text measured through of letters and letter pairs frequencies are close to average characteristics of the language. Hence these characteristics are indistinguishable from practical viewpoint (see [,footnote on the p.181] and [,,]). A principle of maximal likelihood also have never been applied in authorship attribution because of the following reasons. Firstly, computing of a large data was awkward until appearance of computers and a large number of electronic texts. Secondly, a kind of psychological barrier was in absence of underlying theory, since Markov chain of first order is just a first and a very bad approximation for a natural language text. This fact is pointed out in many papers concerning estimating the texts entropy via Markov chains of high order [,p.187] Finally, in authorship attribution it was believed that there exists stable quantitative characteristics for grammatical information, and these characteristics are useful in distinguishing of writers. To the best of our knowledge, until recent time the only significant result for Russian language in this direction was obtained in [].

In this work we develop further a validation procedure of method [] and apply the method of [] for different units of analysis:

(a) letter pairs in their natural sequences in a text, i.e., in words (as they appeared in the text) and spaces between them;

(b) letter pairs in sequences of letters in vocabulary form of words; for example, the previous sentence is reduced as follows "letter pair in sequence of letter in vocabulary form of word''; for Russian texts this reduction is much more significant, since Russian words have a lot of various forms.

(c) pairs of most generalized ("incomplete'') grammatical classes of words and parts of speech in their sequences in sentences of the text. In Russian, 14 parts of speech are traditionally assigned: nouns, verbs, adjectives etc. We introduce the other 4 categories: "end of sentence'', "shortening'', "unclear class'', "dash sign''. A category "unclear class'' was introduced, since grammatical classes were assigned automatically for more than 99% of words, but some words (for example, misprinted) was not assigned automatically, and hence, their grammatical class have not been clear.

(d) pairs of less generalized ("complete'') grammatical classes of words, for example, animated noun, unanimated noun, qualitative adjective, relative adjective, possessive adjective etc.

A full cross-validation was carried out over 385 texts of 82 writers. Results are presented in Tables  and . The details of authors and texts are given in Table ; the size of texts and the number of texts per author are included.

To gauge the precision of the method one can calculate a percent of correctly classified texts. The best results were obtained in cases (a) and (b) (73% and 62% of correct classification, resp.). 61% of texts were correctly classified in case (c). For case (d) results are much worse (4%).

In Section 2 principles and results of preprocessing are describes. In Section 3 the procedure of cross-validation is described. A detailed description of results is given in Section 4 with presenting conclusions in Section 5. In Appendix written by D.V. Khmelev, another approach to authorship attribution is presented. This approach uses data compression algorithms.

2  Pre-processing

After pre-processing the source corpora of texts was presented in all forms (a)-(d).

In case (a) all words with unclear grammatical class were omitted, i.e., words, not recognized automatically, were ignored (and this is the main difference of present research from []). These words are ignored, since we are now able to compare results of case (a) with results of case (b). Also, text is converted into a sequence of words and delimiting spaces without any formatting. All punctuation is removed. Finally, words beginning with a capital letter are also omitted (in particular, all first words of sentences are removed). It was shown in [] that the last trick significantly increases the precision of authorship attribution. There is a conjecture that this improvement is implied by ignoring of names of characters, that are usually not related to the style of the author of the fiction text. A letter "ë'' in Russian alphabet was glued with letter "e'', and hence we had 33 characters including the space. Each letter was encoded with its number: letter "a'' corresponding to 1, letter "ya'' corresponding to 32. The space is corresponding to 0. The total number of letters in all texts is 96209964. The total number of different pairs of letters found in the corpora of text is 1011 (out of possible 33×33=1089). Clearly, 1011 is larger than an actual number of different letter pairs to be found in Russian texts. This is a result of a number of misprints in electronic versions of books and this fact can lead to errors in computations. In order to obtain an estimate of this noise, there were selected 121 letter pairs such that their appearance in Russian text is unlikely. In total, they were used 38495 times or 0,04% of total size of the corpora, and hence they could be disregard.

Pre-processing of source corpora in cases (b), (c) and (d) is carried out with help of automatic classifier developed by O.V. Kukushkina at the Laboratory for General and Computational Lexicology and Lexicography of the Philological faculty at Lomonosov Moscow State University. The classifier is based on the grammar vocabulary of Zaliznyak [] and Academic grammar [,].

Case (c) is based on the information obtained from the most general grammatical class of word form, i.e., we use only information about part of the speech (particles, prepositions, interjections, copulatives and adversative conjunctions, other conjunctions, verbs, pronouns, adverbs, adjectives, nouns, numerals, predicate nouns, comparatives, modal adverbs).

Case (d) is based on the information obtained from lexical and grammatical category of the given part of speech (animated noun, unanimated noun etc)

Let us present some data about cases (b)-(d).

In case (b) the total number of letter is 110704464. There is 1029 different pairs of letters out of 33×33=1089. As one can see, figures are increased w.r.t. case (a). This effect is related to conversion of oblique forms to vocabulary ones.

In cases (c) and (d) the number of elements is 20262449. In case (c) the total number of different pairs of elements is 302 out of possible 18×18=324. In case (d) the number of different pairs of elements is 8124 out of possible 112×112=12544.

3  Technique and its cross-validation

An analysis of texts in each case was carried out basing on software developed by Khmelev. For each case we present results of cross-validation of the method of []. Cross-validation is carried out as follows.

Let us recall that elements of texts are encoded with numbers from 0 to 32 (in cases (a) or (b)), 17 or 111 (in cases (c) and (d), respectively). Code 0 always corresponds to a delimiter between large blocks, i.e., in cases (a) and (b) code 0 corresponds to the space between words and in cases (c) and (d) code 0 corresponds to delimiter between sentences ("end of sentence'').

Given W writers each of which has Nw texts, where w=0, ..., W-1, we count Qijwn which is the number of transitions from letter i to letter j for text n (n=0, ..., Nw-1) from writer w (w=0, ..., W-1). In order to find the predicted author for text [^n] (of known author [^w]) using information about authorship of the other texts for all authors including [^w], we have


Qijk= Nw-1
е
n=0 
Qijkn,    Qik=
е
j 
Qijk
for authors k [^w], and for author [^w] we exclude text [^n] from a training set
Qij[^w]=
е
n [^n] 
Qij[^w]n,    Qi[^w]=
е
j 
Qij[^w].
We then have
Lk (
^
w
 
,
^
n
 
) = -
е
i: Qik > 0 

е
j: Qijk > 0 
Qij[^w] [^n] ln  Qijk

Qik
and
L[^w] (
^
w
 
,
^
n
 
) = -
е
i: Qi[^w] > 0 

е
j: Qij[^w] > 0 
Qij[^w] [^n] ln  Qij[^w]

Qi[^w]
.

Ignoring degenerate cases Qkij=0 and Qki=0, one can see that each Lk([^w],[^n]) is the minus logarithm of the probability of the text [^n] from writer [^w], when the text [^n] from writer [^w] is generated by the Markov chain with transition probabilities Pkij=Qkij/Qki. The hint to ignore the degenerate summands is given by results about optimal maximal likelihood estimate, given in [,p.224].

We also define rank Rk([^w],[^n]) to be the rank of Lk([^w],[^n]) in {Lk([^w],[^n]), k=0,ј, W-1}, where the smallest rank is 0, i.e., Rk([^w],[^n]) О {0, ј,W-1}, and the smallest number has the smallest rank. If the text is assigned to the correct author, then R[^w]([^w],[^n])=0.

If the text is assigned to the other author, and the correct author is second among other pretenders, then R[^w]([^w],[^n])=1 etc.

A result of cross-validation is a set of ranks

{R[^w](
^
w
 
,
^
n
 
)}[^w] О {0,ј,W-1},[^n] О {0,ј,N[^w]-1}.

The precision of the technique for authorship attribution is measured by this set of numbers. The proportion of correct predictions is the proportion of zero ranks. When the true author was close to be correctly predicted, we have small ranks. Another measure of the precision of the technique for authorship attribution is given by average rank


M=  1

W-1
е
[^w]=0 
N[^w]
W-1
е
[^w]=0 
N[^w]-1
е
[^n]=0 
R[^w](
^
w
 
,
^
n
 
).
(1)

We shall also present results of cross-validation in case of analysis of individual letters (in cases (a) and (b)) and individual grammatical classes (in cases (c) and (d)). All the calculation are the same, but we additionally calculate

Qk=
е
i 
Qki   and   Q[^w]=
е
i 
Q[^w]i

and the following quantaties Gk([^w], [^n]) are used instead of Lk([^w], [^n]):

Gk(
^
w
 
,
^
n
 
) = -
е
i: Qik > 0 
ж
и

е
j 
Qij[^w] [^n] ц
ш
ln  Qik

Qk
and
G[^w](
^
w
 
,
^
n
 
) = -
е
i: Qi[^w] > 0 
ж
и

е
j 
Qij[^w] [^n] ц
ш
ln  Qi[^w]

Q[^w]
.

4  Description of the results

Table 1: Full cross-validation of the authorship attribution based on the sequence of letters

 

Case (a)Case (b)
Words as appeared in the textWords in vocabulary form
RLetters pairs IndividualRLetter pairs Individual
0 282/385 27/3850 240/385 12/385
1 21/385 55/3851 29/385 40/385
2 9/385 25/3852 17/385 19/385
3 5/385 24/3853 9/385 16/385
4 5/385 17/3854 6/385 16/385
і 5 63/385 237/385 і 5 84/385 282/385
M3,38 12,69M4,77 17,88

Table 2: Full cross-validation of the authorship attribution based on the sequence of grammatical classes

 

Case (c)Case (d)
Generalized grammatical classes"Complete'' grammatical classes
RPairs IndividualRPairs Individual
0 235/385 128/3850 15/385 6/385
1 31/385 43/3851 21/385 12/385
2 16/385 29/3852 9/385 6/385
3 8/385 15/3853 12/385 6/385
4 11/385 17/3854 19/385 8/385
і 5 84/385 153/385 і 5, 309/385 347/385
M5,43 10,13M17,76 31,93

The results of the research are presented in Tables 12. The column R corresponds to ranks 0, ..., 4 and ranks, exceeding 4. The row with rank 0 contains the proportion of correctly assigned texts. The row with rank 1 contains the proportion of texts such that the correct author was the second among the other pretenders etc. Finally, the row і 5 contains the proportion of texts such that the correct author was on the place not better than 6.

The row M contains the average rank, determined by (1).

Straight away notice that the frequencies of individual letters and individual grammatical classes gives low but significantly non-random level of correct authorship attribution (except the case of individual "complete'' grammatical classes, the reasons of bad results are to be studied).

All calculation with pairs of elements (letters or grammatical classes) gives a better result w.r.t. calculations with individual letters or grammatical classes.

Let us now study the difference in results of analysis using information about usage of letter pairs in words as they appeared in the text and letter pairs in vocabulary forms of words. Comparison shows that the success rate of the authorship attribution is better in the case of natural sequences of letters in the text (there are 73% against 62% of correct authorship predictions and average rank is 3,38 against 4,77). Perhaps, an "equalization'' of different forms for a word reduced to vocabulary form eliminates some information useful for authorship attribution and leads to fall in success rate in case (b) w.r.t. case (a).

Comparison of results of authorship attribution using information of pairs of letters and pairs of generalized ("incomplete'') grammatical classes (cases (a) and (c)) shows that the success rate in both cases is quite high, i.e., we have 73% and 61% correctly classified texts, respectively. An average rank of correct authorship attribution falls in 2 ranks from letter pairs to generalized grammatical classes. Perhaps, a relatively lower efficiency of pairs of grammatical classes (in comparison with letter pairs for words as they appeared in the text) on the same data set is concerned with smaller magnitude of sample size on the grammatical classes (let us recall that on the same corpora of texts there are 96 millions of letter usages in case (a) and there are 20 millions of grammatical classes usages in case (c)).

Also, it pays attention that the usage of individual grammatical classes in case (c) (see Table 2) is significantly more effective than usage of individual letters: there are 33% of correct authorship attributions in case (c) against 7% in case (a). The average rank in case (c) is two and a half times less than the average rank in case (a). Perhaps, this result is concerned with more specific information (more precisely characterising stable structure characteristics of texts for each author) given by individual grammatical codes in contrast to individual letters.

"Complete'' grammatical classes were the most ineffective in authorship attribution either in case of pairs frequencies or in case of individual frequencies. The reasons for this results are to be studied.

The details about authors of the corpora of texts, the number texts per author, the dispersion of text sizes with minimal, average (in brackets) and maximal size, results of authorship attribution in all four cases ((a)-(d)) with minimal, average (in brackets) and maximal rank of the text examined on its author are given in Table .

The corpora also containes some translated texts (S. Lem, I. Khmelevskaja, B. Rajnov). It is interesting that the quality of attribution of these texts is not worse than texts, written by authors whose mothers language is Russian (although corpora contains Lem's translation by different interpreters)

Table 3: The number of and size of texts in corpora by author
Continued on the next page
0O. Avramenko7223,7(279,5)395,10(0,0)0 0(0,0)0 0(0,0)0 10(13,9)22
1A. Bol'nykh70,8(185,0)298,80(4,1)24 0(5,9)37 1(12,1)79 4(15,3)75
2K. Bulychev163,3(129,5)458,90(6,8)59 0(6,8)53 0(9,0)65 3(16,3)69
3A. Volkov95,2(186,8)610,50(20,4)50 0(26,3)51 0(12,3)57 5(23,1)65
4G. Glazov6184,5(263,7)326,10(0,0)0 0(0,0)0 0(0,0)0 9(13,2)19
5O. Grinevskij296,1(127,4)158,60(0,0)0 0(0,0)0 0(0,0)0 0(0,5)1
6N.V. Gogol'497,7(213,3)334,00(1,0)4 0(1,5)6 0(8,0)32 14(22,8)42
7N. Gumilev270,1(70,6)71,00(0,0)0 0(0,0)0 0(0,0)0 0(0,0)0
8F.M. Dostoevskij488,6(175,5)268,90(0,0)0 1(2,0)3 0(0,3)1 10(13,3)18
9M. and S. Djachenko623,3(325,1)553,20(0,0)0 0(0,2)1 0(0,0)0 7(11,5)17
10S. Esenin244,6(131,5)218,40(0,0)0 0(0,0)0 0(0,0)0 0(0,5)1
11A. Etoev62,7(57,9)114,80(1,0)4 0(4,3)19 0(2,2)13 2(3,0)5
12I. Efremov2256,5(396,5)536,50(0,0)0 0(0,0)0 0(0,0)0 1(2,0)3
13A. Zhitinskij3253,6(793,2)1207,60(0,3)1 0(0,0)0 0(9,0)26 18(26,3)42
14A. Kabakov569,0(225,5)418,40(0,0)0 0(0,2)1 0(2,6)11 15(19,2)26
15S. Kazmenko5132,8(400,5)1148,30(1,2)5 0(0,8)4 0(0,2)1 16(21,8)28
16V. Kaplan719,3(91,9)305,20(4,1)25 0(5,6)24 0(5,0)23 9(23,0)40
17A. Kac281,7(461,4)841,00(0,0)0 0(0,0)0 0(0,0)0 1(2,5)4
18V. Klimov458,5(107,5)179,90(7,0)15 0(7,0)20 0(1,5)6 3(6,8)9
19E. Kozlovskij2848,6(868,4)888,20(0,0)0 0(2,0)4 15(42,0)69 40(56,5)73
20I. Krashevskij3380,6(555,2)803,10(0,0)0 0(0,0)0 0(0,0)0 6(8,3)10
21I. Kublickaja2170,2(226,2)282,30(0,0)0 0(0,0)0 0(0,0)0 1(2,0)3
22L. Kudrjavcev4108,3(190,5)348,20(0,3)1 0(1,5)5 0(0,0)0 8(13,5)24
23V. Kunin4296,3(407,9)610,30(0,0)0 0(2,5)7 0(3,5)5 10(15,5)23
24A. Kurkov717,5(121,0)276,90(3,1)10 0(11,6)28 0(1,9)3 4(11,9)19
25A. Lazarevich411,3(101,3)274,70(14,3)47 5(20,3)54 2(11,0)18 4(9,3)15
26A. Lazarchuk6141,4(434,2)786,90(0,0)0 0(0,8)2 0(0,0)0 19(25,5)34
27Ju. Latynina3116,8(970,7)2511,80(3,3)10 0(13,0)36 0(0,0)0 4(15,3)23
28S. Lem811,6(238,6)535,20(0,9)5 0(1,1)8 0(5,1)27 11(26,5)44
29N. Leonov3273,1(282,7)295,70(0,0)0 0(0,0)0 0(0,0)0 3(4,0)5
30S. Loginov141,3(153,4)916,20(15,9)36 4(18,1)37 0(18,9)49 14(35,6)59
31E. Lukin526,9(144,6)367,90(3,2)15 0(0,0)0 0(4,4)19 8(16,0)39
32L. and E. Lukiny4105,2(239,9)564,70(0,3)1 2(3,8)6 0(0,5)2 4(12,8)20
33S. Luk'janenko156,0(277,6)542,90(3,0)22 0(6,9)76 0(9,1)58 9(25,4)73
34N. Markina293,6(179,8)266,00(0,0)0 0(0,0)0 1(2,5)4 0(1,5)3
35A. Melikhov2457,6(536,4)615,20(0,0)0 0(2,5)5 0(0,0)0 17(17,5)18
36V. Mikhajlov284,2(169,3)254,50(0,0)0 0(0,0)0 0(0,0)0 1(2,5)4
37A. Molchanov2206,5(302,4)398,30(0,0)0 0(0,0)0 0(0,5)1 4(5,5)7
38V. Nabokov6102,0(310,6)599,80(2,0)11 0(0,8)3 0(3,0)15 5(12,7)18
39M. Naumova45,2(161,1)337,80(7,8)31 0(11,8)47 0(17,8)69 4(10,0)17
40Ju. Nesterenko271,1(212,0)352,80(1,0)2 1(3,5)6 0(0,0)0 1(3,5)6
41Ju. Nikitin3656,9(681,4)702,20(11,3)34 0(17,0)51 0(0,7)1 5(6,7)8
42S. Pavlov2375,6(414,5)453,40(0,0)0 0(0,5)1 0(0,0)0 6(6,5)7
43A.S. Pushkin257,1(113,7)170,30(0,0)0 0(0,0)0 0(0,0)0 0(0,0)0
44B. Rajnov5267,7(363,6)420,30(0,0)0 0(0,0)0 0(0,6)3 12(13,8)15
45L. Reznik279,6(97,8)115,90(0,0)0 0(0,0)0 0(0,0)0 1(1,0)1
46N. Rerikh484,5(305,6)608,70(5,5)22 0(3,5)14 0(2,3)9 0(2,5)9
47N. Romaneckij75,5(203,2)530,60(5,4)21 0(7,7)20 0(1,0)4 15(27,7)45
48A. Romashov287,7(88,1)88,40(0,0)0 0(0,0)0 2(3,0)4 0(0,0)0
49V. Rybakov79,7(119,5)366,10(9,0)24 0(9,0)21 0(16,4)36 15(25,6)41
50M.E. Saltykov-Schedrin2101,6(170,4)239,10(0,0)0 0(0,0)0 0(0,0)0 1(2,5)4
51R. Svetlov329,2(241,0)425,40(0,0)0 3(13,3)20 0(0,7)2 3(8,7)17
52A. Sviridov413,4(224,0)601,510(27,5)65 0(11,8)41 0(11,0)44 6(26,5)56
53V. Segal'360,5(132,0)259,70(0,0)0 0(0,0)0 0(0,3)1 1(4,7)8
54K. Serafimov275,3(130,8)186,40(0,0)0 0(0,0)0 0(0,0)0 1(1,0)1
55I. Sergievskaja250,7(79,8)108,90(0,0)0 0(1,0)2 0(0,0)0 0(0,5)1
56K. Sitnikov813,0(66,1)274,30(0,0)0 0(2,3)18 0(0,5)4 3(8,3)22
57S. Snegov3385,8(411,1)438,40(0,0)0 0(0,0)0 0(0,0)0 5(5,0)5
58S.M. Solov'ev2159,9(1251,6)2343,30(0,0)0 0(0,0)0 0(0,0)0 0(2,5)5
59A. Stepanov683,7(219,6)390,30(0,0)0 0(0,0)0 0(0,0)0 4(5,0)8
60A. Stoljarov2137,2(241,9)346,70(5,0)10 9(11,0)13 0(7,5)15 1(2,5)4
61A. and B. Strugackie3037,1(230,4)579,50(1,9)24 0(2,5)23 0(5,9)54 15(36,4)79
62A. Strugackij251,9(101,6)151,30(0,0)0 0(0,0)0 0(0,5)1 1(1,0)1
63B. Strugackij2260,7(279,6)298,40(0,0)0 0(0,0)0 0(0,0)0 7(8,0)9
64E. Til'man3307,8(390,0)464,70(0,0)0 0(0,0)0 0(0,0)0 11(11,3)12
65A. Tolstoj297,9(113,8)129,70(0,0)0 0(0,0)0 0(0,0)0 0(0,0)0
66L.N. Tolstoj2199,9(712,5)1225,10(0,0)0 0(0,0)0 0(0,0)0 1(1,5)2
67D. Truskinovskaja982,6(235,9)478,60(0,8)3 0(2,7)12 0(3,8)28 13(23,8)53
68A. Tjurin191,3(222,0)832,70(2,3)20 0(1,2)13 0(2,6)25 24(34,0)55
69E. Fedorov2221,3(667,2)1113,10(0,0)0 0(0,0)0 0(0,0)0 1(1,5)2
70E. Khaeckaja3204,1(309,0)414,31(10,3)22 12(31,3)42 54(57,0)62 28(39,3)54
71D. Kharms313,9(104,1)185,50(0,0)0 0(0,0)0 0(12,0)29 5(16,7)30
72V. Khlumov4183,3(242,9)395,50(3,8)15 0(11,3)38 6(15,3)38 26(34,0)46
73I. Khmelevskaja5203,7(345,7)459,10(0,0)0 0(0,0)0 0(0,4)2 9(12,8)18
74V. Chernjak3201,6(373,7)501,00(0,0)0 0(2,7)8 0(11,7)35 2(11,0)25
75A.P. CHekhov3247,9(335,3)414,50(0,0)0 0(0,0)0 4(13,3)20 18(18,7)20
76V. Shinkarev356,2(78,9)100,10(11,7)29 6(13,3)22 4(23,7)61 5(5,7)6
77V. Shukshin266,7(187,7)308,80(0,0)0 0(0,0)0 0(0,0)0 1(2,5)4
78S. Scheglov355,2(103,0)146,10(4,0)12 0(13,7)41 0(3,0)9 2(2,3)3
79A. Schegolev3105,6(318,0)561,70(0,0)0 0(0,0)0 0(0,0)0 12(18,7)32
80V. Jugov666,7(149,2)304,30(0,5)2 0(0,7)2 0(1,8)10 8(10,7)18
81V. Jan2507,3(553,9)600,40(0,0)0 0(0,0)0 0(0,0)0 2(2,0)2

5  Conclusion

The main result of carried-out research is that the usage of grammatical information in authorship attribution is not only useful, but is efficient and even is comparable with usage of information about letter pairs frequencies (the effectiveness of the last method was shown before in []).

At the same time it is still amazing that the usage of such a seemingly simple unit as a pair of subsequent letters in the text gives more precise results than the usage of such a language categories as individual grammatical codes and their pairs. Perhaps, letter pairs contain a kind of converted and incomplete information about structure of morphemes of words as they appeared in the text (prefixes, roots, suffixes and inflexions). Therefore, a lot of information about words changing and formation is contained in statistics of usage of letter pairs and this leads to quite a high efficiency of letter pairs statistics for authorship attribution.

In other words, the letter pairs frequencies take into account the vocabulary used by the author and by implication it takes into account information about preferred grammatical structures. Although the differences in usage of particular pairs of letters are likely to be non-significant (since they are converging to average frequencies for the language, as it was noticed by Markov [] long ago), the maximal "likelihood'' takes into account the "total'' effect in changing of usage of pair of letters and nevertheless it provides a high precision in assignment of the text to the correct author, as it was shown before in [] and as it was approved in this research by full cross-validation.

However, the further experiments with grammatical classes of words with more precise grammatical analysis would probably lead to a higher success rate in assigning of correct author than it was achieved in this research. Perhaps, the usefulness of usage of grammatical information is shown in the observation that the usage of information on individual generalized grammatical classes is significantly more effective than the usage of information in individual grammatical classes.

Since results with usage of different units (letters and generalized grammatical classes) are compatible with each other, one can assume that future well-developed methods for authorship attribution will use different representation of the text, obtained with this units, for mutual cross-validation of results.

Appendix. Application of Data Compression Algorithms in Authorship Attribution.

In this Appendix it is shown how one could use data compression algorithms for authorship attribution. Also results of cross-validation for this new method are given. Here we use the corpora of texts used in [] and in the main body of the article.

The corpora of texts in [] is obtained from 82 authors. One randomly-chosen text from each author is held out to make up a test set. The other texts are used as the learning sample. Afterwards all the control texts were classified and the correct author was assigned in 69 cases. To estimate how good is this result in terms of probability of correct authorship attribution let us consider the following hypothesise situation. Suppose that we have a black box such that given two texts it produces 1 if these texts are of the same author and 0 if these texts are definitely of different authors. Suppose that the level of alpha and beta errors for each trial is 0 < p < 1. One can apply the black box assigning an anonymous text to correct author as follows. Given 82 alternatives and one anonymous text, this black box should produce 81 zeros corresponding to comparison of the control text to learning samples of wrong authors and only one 1 corresponding to comparison of the control text to the correct author. Of course, all other outcomes are considered as misclassification. Assume that results for each of these pairwise comparisons are independent of each other. Then, the probability of correct classification is (1-p)82. For alpha-beta error p=0,05 we have (1-0.05)82 » 0.015, p=0,01 corresponds to (1-0,01)82 » 0,439 and for p=0,005 we obtain (1-0,005)82 » 0,663. Notice that 69/82 » 0,84 and if we want to surpass the method of Khmelev (2000) in quality of recognition, we should require that, for example, p=0,001 and it turns out that (1-0,001)82 » 0,921. Therefore 69 correct classifications with choice among 82 writers should be considered as an extremely good result. With certain reservations this argumentation is applicable to the results of the main text of the article.

Here by text we understand a sequence of letters from some alphabet A. Denote by |B| the length of text B. Let us call a concatenation of texts A and B the sequence S of length |B|+|A| such that first |B| letters of S coincide with B and the last |A| letters of S coincide with A. We shall write S=B.A.

Now let us give an "ideal'' definition of relative complexity following definition of Kolmogorov complexity (see [,]): the relative complexity K(A | B) of text A w.r.t. text B is the length of shortest program in binary alphabet which translates text B to the text A. Unfortunately, K(A | B) is not computable and it is not clear, how can one use it.

A first approximation to K(A | B) (however, it is enough for authorship attribution as we shall see later) could be obtained from data compression algorithms. Let us define the relative complexity C(A | B) of text A w.r.t. text B as follows. Compress text B to a text Bў and text S=B.A to a text Sў. Now put C(A | B)=|Sў|-|Bў|. This definition is ambiguous since we have not not fix the data compression algorithm. All algorithms used in this research are described later.

We shall apply C(A | B) for authorship attribution. Given texts from n authors, form a test set by holding out one control text from each author U1, ..., Un. The other texts of each author are concatenated in texts T1, ..., Tn.

The predicted author for text Ui is determined as follows. Firstly, find the rank Ri of a number C(Ui | Ti) in set {C(Ui | T1), ј, C(Ui | Tn)}, where ranks take values from 0 to n-1. If rank Ri equals 0, then the control text Ui is correctly assigned.

Like [] one can introduce different measures of precision for the method of disputed authorship resolution. For example,

1. The simplest measure - a number of zero ranks Ri;

2. More generalized measure is given by an average rank

M=  1

n
n
е
i=1 
Ri.

Cross-validation of different data compression algorithms is carried out on the corpora of texts used in [] and in the main body of this paper. Let us recall, that the corpora consists of 385 texts from 82 writers. The total size of texts is about 128 million letters. Some pre-processing of texts was carried out. Firstly, all words carried over to the next line were restored. Further, words beginning with capital letter were omitted. The other words were kept in the order as they appeared in the text. They were delimited by new-line character. A control text Ui was selected from texts of each of n=82 authors. The other texts of each author was concatenated into texts Ti, i=1, ..., 82. The size of each Ui was not less that 50-100 thousand letters.

Let us consider lossless data compression algorithms. The following algorithm have been most popular recently: Huffman coding, arithmetic coding, Burrows-Wheeler technique [] and lot of variations of Lempel-Zip coding []. Some algorithms are specially aimed on text coding: these are PPM [] (this algorithm uses Markov model of small order) and DMC [] (this algorithm uses dynamical Markov Coding). Each algorithm has huge number of variations and parameters (for example, there exists so-called dynamic Huffman coding, also the size of vocabulary varies etc). Moreover, there exists a lot of "mixed'' algorithms, when the text obtained with PPM compression is additionally compressed by Huffman coding.

All these algorithm are found in huge variety of compression programs, whose number exceeds 150. Each of those programs implements different data compression algorithm. More variety appears because of multiple versions of data compression programs with different data compression algorithms. All programs selected in this research are shown in Table .

Table 4: Data compression programs
ProgramAuthor Algorithm used
1. 7zip version 2.11Igor Pavlov Arithm. coding, LZ + arithm. coding, PPM
2. arj version 2.60 RAO Inc. LZSS + Huffman
3. bsa version 1.9.3 Sergey Babichev LZ
4. bzip2 Julian Seward Burrows-Wheeler + Huffman
5. compress Sun Inc. LZW
6. dmc Gordon V. Cormack DMC
7. gzip Jean-loup GaillyShannon-Fano, Huffman
8. ha version 0.999cHarri Hirvola Sliding window dictionary + Arithm. coding, Finite contex model + Arithm. coding
9. huff1 William Demas (LDS 1.1) static Huffman
10. lzari Haruhiko Okumura (LDS 1.1) LZSS+Arithm. coding
11. lzss Haruhiko Okumura (LDS 1.1) LZSS
12. ppm William Teahan PPM
13. ppmd5 version FDmitry Shkarin PPM
14. rar version 2.00Eugene Roshal LZ77 variant + Huffman
15. rar version 2.70Eugene Roshal LZ77 variant + Huffman
16. rk version 1.03a Malcolm Taylor Reduced-offset LZ, PPMZ

Most of these programs with their descriptions can be obtained from Archiver index2 supported by Jeff Gilchrist3. Program compress was taken from SunOS 5.6 operating system. Program dmc is available by ftp4. Notice that the program dmc has an option of maximal memory to use. In this research this optoin was set to 100000000 bytes. LDS 1.1 is a lossless data compression sources kit and it is also available by ftp5. Program ppm is available from personal page of its author6.

The results of application of these programs are given in Table . The last line of Table  contains results of application of Markov chains [] on the same corpora of texts. Computations for data shown in Table  were performed under different operation systems on different types of computers and took about three weeks of non-stop computing.

Table 5: The precision of authorship attribution through using data compression algorithms

 

ProgramRank
01234 і 5M
7zip399323266,43
arj465272205,16
bsa449311245,30
bzip2385513313,68
compress1211326324,37
dmc364344319,81
gzip504121244,55
ha478133205,60
huff110114425115,37
lzari1754264814,99
lzss1431136020,05
ppm22142134010,39
ppmd546662225,96
rar58111217,22
rarw7132151,44
rk52931174,20
Markov Chains (see [])69 3 2 1 7 2,35

It follows from the data of Table 5 that data compression algorithms assign correct author to control text quite often. Therefore they are undoubtfully useful. Notice that application of the program rarw yields even better results than results obtained with help of Markov chains in []. Although such a superiority could be related to some statistical mistake, it is the best result achieved recently.

Perhaps an explanation of such a splendid results is as follows. Data compression algorithms actually adapts well to the control text after pre-processing of texts of correct author and adaptation is not so good if the texts of different author have been pre-processed. The disadvantage of this method with respect to method of Markov chains [] is that data compression algorithms are not so clear and they are even not available for study in commercial software. Nevertheless, a lot of programs among presented in Table 4 has an open source code and well-described open algorithms. The further study of those programs could open the reasons for the efficiency of complexity approach to authorship attribution.

Notice that the authorship attribution technique described here does not require any special programs when one chooses the correct author among small number of pretenders. A great advantage of the method presented is its availability on almost every computer, since most of compression programs mentioned here are widely spread and some of them (like gzip or rar) are implemented for all types of computers and for all operation systems.



REFERENCES

 

  1. Khmelev D.V. Using Markov chains for authorship attribution, Vestn. MGU, ser. 9, Filolog., 2000, no. 2, pp. 115-126.
  2. Morozov N.A. Linguistic spectrums Izv. otd. russkogo jazyka i slovesnosti Imp.akad.nauk, 1915, vol. 20, no. 4.
  3. Markov A.A. On some application of statistical method, Izv.Imp.Akad.Nauk., Ser. 6, 1916, no. 4, pp. 239-242.
  4. Markov A.A., An example of statistical study on text of ®Eugeny OneginЇ illustrating the linking of events to a chain, Izv.Imp.akad.nauk, Ser. 6, 1913, no. 3, pp. 153-162.
  5. Ot Nestora do Fonvizina. Novye metody opredelenija avtorstva (From Nestor to Fonvizin. New methods for authorship attribution), Moscow: Progress, 1994.
  6. Holmes D.I. The Evolution of Stylometry in Humanities Scholarship, Literary and Linguistic Computing, 1997, vol. 13, no. 3, pp. 111-117.
  7. Fomenko V.P., Fomenko T.G. Author's Quantative Invariant for Russian Fiction Texts, Metody kolichestvennogo analiza tekstov narrativnykh istochnikov, Moscow: Inst. Istorii SSSR, 1983, pp. 86-109.
  8. Yaglom A.M. and Yaglom I.M. Verojatnost' i informacija. M.: Nauka, 1960. Translated under the title Probability and Information, Boston: Reidel, 1983.
  9. Dobrushin R.L. Mathematical methods in linguistics, Matematicheskoe prosveschenie, 1959. no.6.
  10. Shannon C.E. and Weaver W. The Mathematical Theory of Communication. Urbana: Univ. of Illinois Press, 1949
  11. Zaliznjak A.A. Grammaticheskij slovar' russkogo jazyka (Grammatical dictionary for Russian language), Moscow: Rus. jaz., 1977.
  12. Grammatika sovremennogo russkogo literaturnogo jazyka (The grammar of contemporary Russian language), Moscow: Nauka, 1970.
  13. Russkaja grammatika (Russian Grammar), 2 vols, Moscow: Nauka. 1980.
  14. Ivchenko G.I., Medvedev Yu.I. Matematicheskaja statistika (Mathematical Statistics) M.: Vyssh. shk., 1992.
  15. Li M., Vitányi P. An Introduction to Kolmogorov Complexity and Its Applications, New York: Springer, 1997.
  16. Kolmogorov A.N. Tri podkhoda k opredeleniju ponjatija "kolichestvo informacii'' (Three approaches to the quantitative definition of information), Probl. peredachi inform, 1965, vol. 1. no. 1, pp. 3-11.
  17. Burrows M. Wheeler D.J. A block-sorting lossless data compression algorithm. Digital SRC Research Report 124. 1994. ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
  18. Lempel A., Ziv J. On the Complexity of Finite Sequences, IEEE Trans. on Inform. Theory., 1976, vol. 22, no. 1, pp. 75-81.
  19. Cleary J.G., Witten I.H. Data Compression Using Adaptive Coding and Partial String Matching, IEEE Trans. on Commun., 1984, vol. 32, no. 4, pp. 396-402.
  20. Cormack G.V., Horspool R.N. Data Compression Using Dynamic Markov Modelling, Computer J., 1987, vol.30, no. 6, pp. 541-550.


Footnotes:

1 Written non-character (non-hieroglyph) texts are mentioned, since character written language reduces opportunities in analysis of subsequent elements because of very implicit phonological data (telling apart morphemes)

2 http://web.act.by.net/~act/act-index.html

3 Jeff Gilchrist, jeffg@cips.ca

4 http://plg.uwaterloo.ca/~ftp/dmc/

5 ftp://garbo.uwasa.fi/pc/programming/lds_11.zip

6 http://www.cs.waikato.ac.nz/~wjt/


File translated from TEX by TTH, version 3.01.
On 10 Oct 2001, 13:26.
^
| top |