Relating Zipf’s law to textual information
Zipf’s law is the main regularity of quantitative linguistics. Despite of many works devoted to foundations of this law, it is still unclear whether it is only a statistical regularity, or it has deeper relations with information-carrying structures of the text. This question relates to that of distinguishing a meaningful text (written in an unknown system) from a meaningless set of symbols that mimics statistical features of a text. Here we contribute to resolving these questions by comparing features of the first half of a text (from the beginning to the middle) to its second half. This comparison can uncover hidden effects, because the halves have the same values of many parameters (style, genre, author’s vocabulary etc). In all studied texts we saw that for the first half Zipf’s law applies from smaller ranks than in the second half, i.e. the law applies better to the first half. Also, words that hold Zipf’s law in the first half are distributed more homogeneously over the text. These features do allow to distinguish a meaningful text from a random sequence of words. Our findings correlate with a number of textual characteristics that hold in most cases we studied: the first half is lexically richer, has longer and less repetitive words, more and shorter sentences, more punctuation signs and more paragraphs. These differences between the halves indicate on a higher hierarchic level of text organization that so far went unnoticed in text linguistics. They relate the validity of Zipf’s law to textual information. A complete description of this effect requires new models, though one existing model can account for some of its aspects.
Quantitative and universally applicable relations are rare in social sciences. Each such relation acquires the status of law and paves the way towards bringing in methods and ideas of natural sciences. This is why Zipf’s law—discovered independently by stenographer Estoup in 1912 estoup () and physicist Condon in 1928 condon (), and later on advertized by linguist Zipf zipf (); joos (); wyllis (); baa ()—attracted so much inter-disciplinary attention. The law applies both to text mixtures (corpora), and to separate texts written in many natural and artificial alphabetic languages greek (); indian (); moreno (), as well as in Chinese characters epjb (). It states that in a given text the ordered and normalized frequencies for the occurrence of the word with rank hold with wyllis (); baa ().
Rank-frequency relations imply coarse-graining, e.g. since they are invariant with respect to permutation of words. This is one reason why there are many approaches towards deriving this law, but none of them is conclusive about its origin. Existing approaches can be roughly divided into two groups: (i) optimization principles shreider_sharov (); sole (); prokopenko (); dickman (); mandelbrot (); dunaev (); mitra (); manin (); mandel (); arapov (); shrejder (); dover (); vakarin (); liu (); baek (); (ii) statistical approaches li (); simon (); zane (); kanter (); hill (); pre (); latham ().
Note that (i) includes Zipf’s program—which is so far not conclusive prokopenko (); dickman ()—that the language trades-off between maximizing the information transfer and minimizing the speaking-hearing effort. The law can be also derived from various generalizations of the maximum entropy method mandel (); arapov (); shrejder (); dover (); vakarin (); liu (); baek (), though the choice of the entropy function to be maximized (and of relevant is neither unique nor clear. The general problem of derivations from (i) is that verifying the law for a frequency dictionary (or for a large corpus) does not yet mean to explain it for a concrete text. E.g. if the word frequencies obeying Zipf’s law are deduced from considerations related to the meaning of words manin (), then the applicability to a single text is unclear, since the fact that the words normally have widely different frequencies in different texts requires a substantial reconsideration of the word’s meaning in each text (this is not the case in real texts).
The first model within (ii) was a random text, where words are generated through random combinations of letters, i.e. the most primitive stochastic process mandelbrot (); li (). Its drawbacks howes (); seb (); cancho () (e.g. many words having the same frequency) are avoided by more refined models simon (); zane (); kanter (); hill (), though they also do not explain the region of rare words (hapax legomena). For instance, the text growth model studied in simon () was later on shown to fail in describing the hapax legomena minn (). Note that the Zipf’s law cannot explianed via fixed probability of words (e.g. estimated from those of a frequency dictionary), since the same word can have widely different probabilities in different texts that obey Zipf’s law arapov (). This drawback is absent in a recent probability model that is based on latent variables and deduces Zipf’s law together with its applicability to a single text and its extension to rare words pre ().
Despite of (or even due to) these efforts, there is a major open question orlov (); cancho (); pian (): is Zipf’s law only a statistical regularity, or it also reflects information-carrying structures of a meaningful text? This question relates to one of fundamental issues in linguistics: how a text written in an unknown system (e.g. the Voynich manuscript) can be efficiently distinguished from a meaningless collection of words baa ().
Here we contribute to resolving these questions by noting that natural texts evolve from beginning to end. This obviously important notion is absent from the rank-frequency relation, which is invariant with respect to any permutation of words. Thus we divide texts into halves, each one containing the same amount of words. This implies a semantic difference: the first half can be understood without the second one, but normally the second half is not easy to understand without the first half. The first part of the text normally contains the exposition (which sometimes can be up to 20 % of the text), where the background information about events, settings, and characters is introduced to readers. The first part also plots the main conflict (open issue), whose denouement (solution) comes in the second half 222Scientific texts contain closely related aspects: introduction, critique of existing approaches, statement of the problem, resolution of the problem, implications of the resolution etc. The discussion on differences between the halves applies also here..
Dividing the text into two halves neutralizes confound variables that are involved in a complex text-producing process (style, genre, subject, the author’s motives and vocabulary etc), because they are the same in both halves. Hence by comparing the two halves with each other we hope to see regularities that are normally shielded by above variables. In all texts we studied we noted the following regularities.
(1) For the first half Zipf’s law applies from smaller ranks than in the second half, i.e. the law applies better to the first half, than to the second half. The smallest rank is the major limiting factor in applicability of the law, as shown by section III.
(2) For the first half the words that hold the law are distributed more homogeneously (in the properly quantified sense) along the text; see section V.
These features are specific for meaningful texts and they can be employed for distinguishing meaningful texts from a random collection of words that happens to hold Zipf’s law, e.g. due to one of numerous stochastic mechanisms reviewed above mandelbrot (); li (); mandel (); arapov (); shrejder (); dover (); vakarin (); liu (); baek (); latham (); pre ().
We related the above results on Zipf’s law to textual information. Rendering more detailed discussion till section IV, we note that meaningful texts consists of several hierarchic levels: words, phrases, sentences (clauses), paragraphs hutchins (); valgina (); hasan (). We looked whether the first and second halves differ with respect to quantitative characteristics of these levels. We identified several such differences:
(3) The first half is lexically richer (contains more distinct words and more rare words), has longer and less repetitive words, shorter sentences, more punctuation signs and more paragraphs.
In contrast to (1) and (2), some features within (3) hold in most cases, but not strictly in all cases we studied. Despite of such minor exclusions, the results are suggestive in pointing out that the validity domain of Zipf’s law relates to features of a meaningful text.
This paper is organized as follows. The next section discusses our method of studying Zipf’s law and its validity range. In particular, we explain how the validity range of Zipf’s law behaves under mixing (joining together) two or more texts; see section II.3. Section III compares the halves of a text with respect to the validity range of Zipf’s law and the amount of rare words. These results are illustrated on Fig. 1. Section IV reminds several aspects of textual information known in linguistics, designs on their base several straightforward quantitative characteristics and checks them for two halves. The results are summarized in Table 1. Section V studies the distribution of words along the text. Here we study this distribution for different halves and relate it with the validity range of Zipf’s law; see Fig. 2. Section VI applies the theory for Zipf’s law and rare words proposed in pre () for describing several aspects of our findings. Here we emphasize that this theory is incomplete. Better theories are yet to be found. We summarize our results in the last section.
Ii Phenomenology of Zipf’s law and its validity range
ii.1 The method of searching for Zipf’s law
We explain how we recover Zipf’s law from the data; see Appendix II and epjb (); pre () for details. For a given text we extract the ordered frequencies of different words 333There is no universal definition of word muller (); e.g. there is a natural uncertainty on whether to count plurals and singulars as different words. Different definitions of word can produce numerically different results muller (). We mostly work with methods that assume singular and plural to be different words. But we also checked that our qualitative conclusions do hold as well when singular and plural are taken to be the same word. ( and are respectively the number of different words and the overall number of words in a text):
The data is fit to a power law:
Fitting parameters and were found from minimizing the sum of squared errors: ; see Appendix II. The fitting quality is found from by the minimized value of and from the coefficient of determination , which is the amount of variation in the data explained by the fitting; see Appendix II. Hence and mean good fitting. We minimize over and for and find the minimal and the maximal for which
These values of and also determine the final fitted values and of and , respectively; see Fig. 1 and Tables 2, 4. Thus and are found simultaneously with the validity range of the law. For simplicity, we refer to and as and , respectively. The fitting quality was confirmed via the Kolmogorov-Smirnov (KS) test; see Appendix III.
The above method is standard, it differs from others by more rigorous criteria (3), and by explicitly accounting for the validity range of the power law (2). The general idea of fitting ranked frequencies (1) to the power-law (2) can be rightly criticized on the ground that the definition of rank is not independent from the frequency, hence the small value of in (3) does not ensure against correlated errors of rank and frequency pian (). We stress that the several aspects the above fitting results will be recovered in section V, where the frequency will be given an alternative representation.
ii.2 Single texts
ii.2.1 Parameters of Zipf’s law (2): and
We studied 10 English texts written in different epochs and on different subjects; see Appendix I for their description. We were not able to increase the number of studied texts substantially (e.g. ten times), because for each single texts we studied many different features; see Table I. Studying them for 100 text will be time-consuming.
For all studied texts we obtain [see Fig. 1]
The exponent came itself, since we only imposed the power law in (2). As for the magnitude of in (2), we have two constraints. First note from Fig. 1 that apart of minor exclusions, (2) is an upper bound for the frequencies at all ranks. This holds more generally wyllis (); baa (); pre (), and implies a lower bound on : . Second, we note that for the obvious constraint for , the power-law (2) is very close to observed frequencies. Hence we have , which leads to an upper bound for . These bounds are consistent with actual values of , which for the studied texts hold ; see Tables 3 and 4.
ii.2.2 Minimal rank
Fig. 1 shows that words with ranks do not hold Zipf’s law (2). Among them, the most frequent 3-4 words relate to the author, since their frequencies coincide for both halves of the text; see Fig. 1. But other words in the range are different for both halves; they do not hold Zipf’s law due to their irregular behavior. Table III in Appendix I lists the values of for various texts.
The range range contains mainly function words. They serve for establishing grammatical constructions (e.g., the, and a, such, this, that, where, were) 444Functional words do have meaning, but it is a general one, e.g. and refers to joining and unification, while but to exclusion. . The majority of words in the Zipfian range do have a narrow meaning (content words). A subset of content words has a meaning that is specific for the text, i.e. they are key-words of this text. The fact that key-words are located in the Zipfian range was employed for automatic indexing of texts ibm (). Few keywords appear also in the range , e.g. love and miss for the romance novella DL and god and man for the theological AR; see Table 2. Some keywords are also located in the range , e.g. eloi for the science fiction TM, but the majority of them are in the range .
We stress that the applicability of Zipf’s law to a single text cannot be explained via text-independent probabilities of words, because even if the same word enters into different texts it typically has quite different frequencies there arapov (), e.g. among 83 common words in the Zipfian ranges of texts AR and DL [see Table 2], only 12 words have approximately equal ranks and frequencies.
ii.2.3 Maximal rank
Now is the maximal rank, where (2) holds according to criteria (3); see Table 3 for examples. It appears that is in a sense the largest possible rank, because for no smooth rank-frequency relation (including (2)) is expected to work due to words having the same frequency. Put differently, Zipf’s law cannot hold for , because now the rank-frequency relation consists of steps: many words having the same frequency baa (); see Fig. 1.
The empirical value of appeared to be such that the number of words having frequency is . The absolute majority of different words with ranks in have different frequencies; see Fig. 1. It is only at the vicinity of that words having the same start to appear. Now the quality of fitting (defined by (3)) does not dependend on small changes of , but it is sensitive to small changes of . Hence for simplicity we fixed such that the number of words with frequency is .
We thus emphasize that the status of is different from , though they both determine the applicability domain of the smooth rank-frequency (2).
ii.2.4 Hapax legomena
In rank-frequency relation, a sizable number of words appear only very few times (hapax legomena). These rare words amount to a finite fraction of (i.e. the number of different words). The existence and the (large) number of rare events is not peculiar for texts, since there are statistical distributions that can generate samples with a large number of rare events baa (); see section VI. One reason why many rare words should appear in a meaningful text is that a typical sentence contains functional words (which come from a small pool), but it also has to contain some rare words, which then necessarily have to come from a large pool latham () 555E.g. this sentence contains rare word typical and pool that in the present text are met only 3 and 2 times, respectively. It also contains frequent words words, since, large.. Though rare words cannot be described by a smooth rank-frequency relation (including Zipf’s law), their distribution is closely related to the proper Zipf’s law pre (); see section VI.
ii.3 Mixing of several texts
Mixing (joining together) two texts is a standard procedure in quantitative linguistics. Most of our knowledge on rank-frequency relations is verified on corpora, i.e. large mixtures of many texts; see willi () for a recent discussion. We shall now follow in detail how and behave under mixing.
where and are different texts, and means the text got by mixing them. The main reason for increasing is that the number of different words raises upon mixing two different texts, and then raises as well, because it is determined by a condition , as we saw above. Hence also increases sizably [see Table 3]:
Eqs. (5, 6) are expected if Zipf’s law is a statistical regularity. They ensure the applicability of Zipf’s law to corpora, where (2) applies to most of ranks 666In the context of text-mixing, we note that Ref. willi () that mixing together large corpora brings in—for sufficiently large ranks —an additional scaling regime that holds for the frequency versus rank with sizably different from . This second regime is naturally limited by ranks, where the rare words appear (hapax legomena). The Zipfian regime is thus confined to sufficiently small ranks . We note that this new scaling regime emerges only for mixing of large corpora from different authors and from different topics, whose length exceeds those of an average text. Hence the second critical regime is not relevant for single texts that are at focus of our investigation. This situation is somewhat similar to Chinese texts epjb (), where sufficiently small texts hold Zipf’s law, but mixtures of already several texts do not hold this law for large ranks, but before the regime of rare words sets in epjb ()..
However, the behavior of is less expected [see Table 3]:
Hence for certain texts can increase under mixing, i.e. limit the applicability of the Zipf’s law for small ranks. Whether will increase or decrease under mixing depends on the texts. At any rate, the change of does not have any significant influence on the increase of in (5).
Let us now return to the precision of Zipf’s law. We applied strict criteria (3) to obtaining its validity range. One can also look for weaker precision measures, e.g. that measures how the overall frequency of the Zipfian range is approximated by Zipf’s law; cf. (1, 2). Table 3 shows that is sufficiently small so that the applicability of the law is warranted. But can both increase and decrease upon mixing two texts; see Table 3.
Summarizing all the arguments, we can say that overall the validity range of Zipf’s law tends to increase under mixing. Hence the hope of Refs. orlov (); arapov () that Zipf’s law applies more precisely to a single text than to text mixtures do not hold. This conclusion is not an automatic consequence of Zipf’s law, and is specific for alphabetic writing systems; e.g. relatively short texts written in Chinese characters do hold Zipf’s law in the above sense, but their mixtures do not epjb (). But at least for alphabetical texts, the law holds also for text corpora, and hence reflects statistical regularities. The question is whether it reflects only statistical regularities. Mixing is not appropriate for answering this question. Below we shall do the opposite, i.e. divide a text into two halves.
|First half||Second half|
|Minimal rank of Zipf’s law ; see section II.1||–||+|
|Maximal frequency of the law; see (9)||+||–|
|Spatial homogeneuity of words that hold Zipf’s law||+||–|
|Number of different words||+||–|
|Number of rare words (absolute and relative)||+||–|
|Normalized prefactor of Zipf’s law ; see (33)||–||+|
|Repetitiveness of words (Yule’s constant); see (21)||–||+|
|Number of punctuation signs||+||–|
|Number of letters||+||–|
|Average length of words||+||–|
|Number of sentences||+||–|
|Average length of sentence||–||+|
|Entropy and variance of sentence length distribution in words||–||+|
|Number of paragraphs||+||–|
|Size in bytes||+||–|
|Compressibility of the size|
|Number of functional words|
|Exponent of Zipf’s law|
|Maximal rank of Zipf’s law|
Iii Dividing the text into two halves
iii.1 Validity range of Zipf’s law for each half
We divided the studied texts into two halves along the flow of the narrative, i.e. from the beginning to end. Several aspects of the text are left unchanged, e.g. they are still sufficiently large for statistics to apply, they have the same overall number of words, the same author, genre etc. They are different semantically, since the first half can be understood without the second half, but the second half generally cannot be understood alone. Also, the structure of narrative is different: the first half normally contains the exposition, where actors, situations and conflicts are set and defined, while the second half normally contains the denouement; cf. Footnote 2.
Our first observation is that in all texts we studied the rank —where Zipf’s law starts—is smaller for the first half, than the second half [see Table 1 for a qualitative summary of our results, and Tables 2, 3 and 4 for numeric data]:
We recall that the fitting quality of Zipf’s law depends strongly on . It does not depend much on , because the latter approximately coincides with a rank, where any smooth rank-frequency relation will stop to hold due to many words with the same frequency. Eq. (8) is consistent with (7), if the two halves [ and in (7)] are regarded as different texts that produce the full text when taken together. There is another relation closely related to (8) [see Table 4]
which shows that in the first half Zipf’s law applies from larger frequency values than in the second half; cf. (2, 4). Inequality (9) holds for all studied texts apart of one exception in 10 texts. We tolerate such exceptions taking into account various subjective factors that influence the text formation process.
Recall that the words with ranks are of two types; cf. the second paragraph after (4). The few most frequent ones are author-specific, since they have the same frequency in both halves; see Fig. 1. The remaining words with ranks from are not author-specific. Their amount is smaller in the first half of the text compared to the second half. This is the origin of (8, 9); see Fig. 1.
We stress that (8, 9) does not hold for a random selection of the half of words. Expectedly, in that case the sign of changes erratically from one text to another, while is smaller than for the natural division into the halves. Also all other differences between the two halves (discussed below) does not hold if the division is done randomly.
Table 1 shows that the parameter of Zipf’s law (2) does not show any systematic behavior between the halves 777E.g. for text OC [see Table 2], but for TS.. Likewise, the power-law exponent in (2, 4) is generally close to (); it slightly differs between the halves, but without a systematic trend.
Once the range contains many functional words, we checked whether the two halves differ from each other by different number of functional words. No systematic differences were found between the halves. We also divided the functional words into different categories (conjunctions, pronouns, determiners) and checked each category separately with the same negative result.
iii.2 Rare words
To describe the amount of rare words, we conventionally define a word as rare, if it appears in the text at most 3 times. Denote by the number of such words in a given text. We selected this threshold 3 so as to get a robust comparison: since there is no a universal definition of word, different methods of counting can lead to different results 888We found that the number of words that appear strictly once does not show a regular behavior across of the halves.; cf. Footnote 3.
For all studied texts we observed [see Table 2]:
where () is the number of words that appear at most 3 times in the first (second) half of a given text. Eq. (10) suggests that the first half uses more rare words, but such a conclusion is incomplete, since the two halves have different numbers of distinct words. Denote them as and , for the first and second half respectively. We saw a more refined criterion that again holds for all studied texts [see Table 2]:
Iv Textual information
iv.1 Quantifiers of textual information
We want to relate (8–11) with textual features developed qualitatively in linguistics; see hutchins () for a good review, valgina () for a textbook and hasan () for a monograph presentation. First of all, we recall that a text is a hierarchic999Refs. lrc_ebeling (); lrc_eckmann (); lrc_altmann () recently mentioned hierarchic features of text in the context of long-range correlations betweeen words and letters found in real texts. construct, i.e. it consists of several autonomous levels101010Autonomous means that features of a level are constrainted, but cannot be completely deduced, from those of lower levels.: words, phrases, clauses, sentences, paragraphs 111111Sometimes, they can coincide, i.e. a clause can coincide with sentence, and there can be a one-paragraph sentence. etc.
The first level is that of words. Neglecting phenomena of synonymy and homonymy (which are rare in English, but not at all rare e.g. in Chinese epjb ()), we can say that every word has several closely related meanings (polysemy). Neglecting also the difference between polysemic meanings, the number of independent meanings in a text can be estimated via the number of different words. Further distinction between words of the text can be made via their average length (in letters): content words—which express specific meaning—are normally longer than functional words that mostly serve for establishing grammatic connections zipf ().
The next level is that of clauses, which sometimes can (e.g. in simple texts) coincide with a sentence. A clause joins several phrases, where the (polysemic) meaning of separate words is clarified. Moreover, in clauses there are new means of expressing meaning. Within many clauses one can identify two types of phrases hutchins (); hasan (): themes present information that is already known from the preceding text; rhemes could not be inferred by the reader and thus amount to a new information 121212As an example, take the preceding sentence: Moreover, in clauses there are new means of expressing meaning. Here ”moreover” is a textual theme, since it relates with previous parts of the text; ”in clauses” is a topical theme, and the rest of the sentence (that contains the verb form) is the rheme. . Clauses joined in one sentence normally have the same theme 131313Theme and rheme are different from, respectively, subject and predicate that are grammatic constructs. But still in many English sentences, the subject and theme coincide hasan ().. Sometimes, clauses of the next sentence employ the rheme of the previous sentence as their theme. Hence the number of clauses will indicate on the new information contained in the text hutchins (). We shall estimate the number of clauses in a text via calculating the number of punctuation signs in the text 141414This is not absolutely precise, because there are clauses that are connected into a sentence without any punctuation sign; e.g. they can be connected without comma, though colons, hyphens and semi-colons normally indicate on a clause connection. Commas can be employed without clause-connections, e.g. when listing items or emphasizing a part of a sentence that is not a clause. But in other cases they do indicate on the clause connection, e.g. when a comma is put before but, and, or when marking an indirect speech. Also recalling that we compare two parts of the same text, we believe that in regular texts an overall number of punctuation signs does correlate with the number of clauses. .
A paragraph joins several sentences with closely related themes 151515Instead of paragraph linguists frequently look at segments or complex syntactic units valgina (). The main difference between such constructs and the paragraph is that the latter is more dependent on a specific writing style adopted by the text’s author. For our purposes this difference is not important, since we compare with each different halves of the same text that are written by the same author, and (normally) within the same style. . Among these sentences there is one (or few) that are autosemantic hutchins (); valgina (), i.e. they can be taken out of the text, and they still retain their full meaning. Frequently, the autosemantic sentence is the first one in the paragraph, as it is the case with the present paragraph. The majority of sentences in a paragraph are semantically dependent, i.e. they evolve around the autosemantic sentences detailizing their meaning and/or providing further information on them hutchins (). For this higher level, autosemantic (semantically dependent) sentences are analogues of theme (rheme). Thus a paragraph can also serve as a (higher-level) unit of textual meaning. Hence the number of paragraphs of a text is a relevant descriptor of textual information. In this context note that there are automatic routines that allow allow to fragment a given text over topically homogeneous parts marti ().
Results presented below show that there is another (in a sense highest) hierarchic level of the text organization that was so far not noted by linguists 161616In this context we recall results from the language processing literature showing that inter-sentence correlations between words get stronger for sentences that are located deeper inside of a paragraph charniak (). Let be a random variable denoting the ’th word that appears in the sentence of the paragraph. For example, in the previous sentence, which is the second sentence of the present paragraph, “let” is a value assumed by . Values of are denoted as . We also define for words of the same sentence, and let to define all words that appear in sentences . Given a sufficiently long text, one can estimate joint probabilities and hence calculate the mutual-conditional information , where are conditional probabilities. Now determines correlations between and given (i.e. conditioned upon) . Ref. charniak () shows that has a general trend of increasing with for a fixed . However, note that some of results in charniak () contradict to literature cancho_debo (), e.g. the information constancy statement that the conditional entropy is constant as a function of contradicts to the verified Hilberg’s conjecture cancho_debo (); debo (). . It amounts to differences between the first and second halves of the text; in particular, the first half contains more themes and more autosemantic sentences than the second half.
iv.2 Comparison between two halves of a text
We found that above characteristics can distinguish different halves of the same text, see Tables 1, 2 and 4. However, for some of these distinctions there are exceptions — in contrast to (8, 10, 11), where we did not see exceptions. They are rare in the sense that for 10 studied texts we got at most one exception.
The number of different words is larger in the first half ; see Table 2 171717One can anticipate here possible relations with Herdan-Heap’s law () that relates the number of different words in a text with the overall number of words . Finding a relation between Zipf’s law and Herdan-Heap’s law was attempted in kornai (). .
The total number of punctuation signs (where we included full points, colons and semi-colons, commas, question marks and exclamation points) is also larger in the first half: ; see Table 2.
The total number of letters employed is larger in the first half: ; see Table 2. Hence, the average length of words (in letters) is also larger in the first half.
We calculated the full distribution of sentences over the length (measured in words): the fraction of sentences with word-length (). Three specific characteristics of this distribution are worth looking at: the average , dispersion and entropy :
Dispersion quantifies deviations from the average, while the entropy measures uncertainty, it is minimal (maximal) for deterministic (homogeneous) probabilities.
Now Table 4 shows that all these characteristics are smaller in the first half:
In addition, Table 4 shows that the number of sentences in the first are larger. This result is consistent with both and , taking into account that the both halves have the same number of words.
The number of paragraphs is larger in the first half (again with one exclusion): ; see Table 2.
Altogether these features make intuitive sense, since they show that the first half contains more themes and more autosemantic sequences than the second half.
Note that applying notions of information compression does not indicate a robust difference between the two halves. Initially, the first half is larger in bytes; which is natural given that its words are longer in average; see Table 2. We compressed each half via zip, Lempel-Ziv and several other standard routines. If the second half would compress more than the first half (in absolute or relative units), we would conclude that the second half has less information in the Shannon’s sense. However, we did not observe any indication that the second half is compressed more than the first half.
Table 2 shows that the discussed features (e.g. and ) are normally closer to each for the halves of the same text, e.g. . Obviously, the features of the two halves are close to each other. Similar points were noted in baek () and stated as translation invariance of text features. However, we stress that even small differences between the halves can show systematic differences between them.
V Spatial distribution of words
v.1 Spatial frequency versus ordinary frequency
Let us now turn to features that reflect the distribution of words along the text. Studying this spatial distribution of words is traditional for quantitative linguistics zipf (); yngve (). More recently, Refs. ortuno (); pury (); carpena (); zano () investigated the spatial distribution of key-words versus functional words. The conclusion reached is that key-words are distributed less homogeneously ortuno (); pury (); carpena (); zano (). Here we employ the spatial distribution of words in the context of Zipf’s law.
Let denote all occurrences of a word along the text. Let denotes the number of words (different from ) between and . Define the average period of this word via
The averaging is conceptually meaningful only for sufficiently frequent words, though formally (15) is always well-defined. The inverse of this average quantity is the space-frequency :
Now if a sufficiently frequent word is distributed homogeneously, then we expect , where is the ordinary frequency, and where we assume that and . Hence the difference can tell how the distribution of deviates from the homogeneous one.
Fig. 2 shows the ranked frequencies for the two halves of the text AR. For each word we show its (average) space-frequency from (16) together with the (ordinary) frequency (1). We stress that in Fig. 2 the rank is defined via the ordinary frequency ordering. Note that roughly follows the general pattern of Zipf’s law.
We see that for frequent words of each text (i.e. AR1 or AR2) the difference is small. This range of frequent words includes the initial part of the Zipfian range ; cf. Fig. 2 with Fig. 1. It also includes all words with ranks , i.e. those frequent words that do not hold Zipf’s law. Hence frequent words are distributed more homogeneously zano ().
But starts to grow already in the initial part of the Zipfian range . Fig. 2 shows that this growth is larger for the second half of AR. To quantify this point, we defined normalized frequencies for the words in the Zipfian range:
Once these frequencies are properly normalized within the Zipfian range, we can quantify the distance between them via one of standard definitions of probability distances. We choose the variational distance:
Now Table 2 shows that for all studied texts there is a difference between the halves:
i.e. in the first half the words of the Zipfian domain are distributed more homogeneously.
Note that the restriction to the Zipfian range in (17) is crucial for the validity of (19). For instance, if we extend the definition to ranks (i.e. everywhere) relation (19) does not hold anymore, i.e. the sign of changes erratically from one text to another. This is an important point, because so far the Zipfian range was defined via strict, but still conventional criteria (3). Eq. (19) shows that this definition does capture an important feature of real texts.
v.2 Yule’s constant
The above result—i.e. words in the Zipfian range are distributed more homogeneously in the first half than in the second half—needs corroboration. To this end, we looked at the Yule’s constant baa (). Define to be the number of words that (in a fixed text) appear times. We get two obvious features:
where and are, respectively, the number of different words and the number of all words, and is the number of times the most frequent word appears in the text. Note that for a sufficiently small , is either zero or one. For instance, in the first half AR1 of AR, , , , , etc.
Take a word that appears times in a text with length . Now is the probability that a randomly taken word in the text will be . Likewise, is the probability that the second randomly taken word in the text will be again . Both probabilities refer to a word that appears times. The probability to take such a word among distinct words of the text is 181818Hence one can define the entropy that characterizes the inhomogeneity of distribution of distinct words. In contrast to the Yule’s constant, the difference of this entropy calculated between two halves of the same text changes erratically from one text to another. This entropy was employed in cohen_fractal () for distinguishing between natural and artificial texts.. Thus the average is a measure of repetitiveness of words. The Yule’s constant employs this quantity without the factor , since it wants to have something weakly dependent on baa (). For us this feature is not important, since we compare the halves of a text. Following the tradition, we also omit the factor , but we stress that including it does not change our conclusions. Using (20) and , the Yule’s constant reads baa ()
where is a conventional factor we applied to keep ; see Table 2.
We employ for comparing two halves of the same text with respect of the repetitiveness of words. Table 2 shows that apart of one exclusion we get . In this sense words in the second half repeat more frequently. This is consistent with previous findings, i.e. (the first half has more different words) and that the first half is fragmented more in the sense of a large number of paragraphs and more inhomogeneous distribution of the sentence length. It is also consistent with the fact established in section V.1, viz. frequent words are distributed more homogeneously in the first part.
Vi Theoretical description
vi.1 Remainder of the model
Below we show that a statistical physics theory of Zipf’s law proposed in pre () can describe some of above effects. We stress that this description is incomplete, e.g. the theory cannot explain why specifically in the first half the applicability range of Zipf’s is larger. This is not surprising, since the theory is based on purely statistical mechanisms of text-generation, i.e. it does not account for semantic issues. But the theory confirms (10, 11) and predicts new and useful relations that hold for halves. The theory starts with the following assumptions.
Given different words , the joint probability for to occur times in a text is assumed to be multinomial (the bag-of-words model madsen ())
where is the length of the text (overall number of words), is the number of occurrences of , and is the probability of . According to (22) the text is regarded to be a sample of word realizations drawn independently with probabilities .
Eq. (22) is incomplete, because it implies that each word has the same probability for different texts. In contrast, the same words do not occur with same frequencies in different texts.
To improve this point we make a random vector with a text-dependent density hof (); lazar (). With this assumption the variation of the word frequencies from one text to another will be explained by the randomness of the word probabilities. Since was introduced to explain the relation of with , it is natural to assume that the triple form a Markov chain: the text influences the observed only via . Then the probability of in a given text reads
Physically, refers to annealed disorder.
The text-conditioned density is generated from a prior density via conditioning on the ordering of in :
If different words of are ordered as with respect to the decreasing frequency of their occurrence in (i.e. is more frequent than ), then if , and otherwise.
Next, we assume what physically amounts to an ideal gas: the probabilities are distributed identically and the dependence among them is due to only:
where is the delta function and the normalization ensuring is omitted. Eq. (26) is a postulate that Ref. pre () motivated via relating it the mental lexicon (store of words) of the author. The factor in (26) is necessary, since is not normalizable for . Here will be related to the prefactor of Zipf’s law (2).
The meaning of (26, 27) is that the frequency and period—i.e., word choosing and period choosing—have the same distribution; cf. with section V.1, where we saw that the inverse period and the ranked frequency roughly follow the same relation 191919This feature holds for all densities with . We have chosen (26), where and , since it already provides a good fit to empiric data. This feature holds as well for , which is however not normalizable, and hence cannot be employed. .
Now note the following feature of real texts
where is the number of different words, while is the total number of words in the text. Eq. (28) is verified for all texts we studied.
where the effective probability is found from
If is sufficiently large (which is the case in the Zipfian domain), , the word with rank appears in the text many times and its frequency is close to its maximally probable value ; see (29). Hence the frequency can be obtained via the probability . Then (30) is Zipf’s law generalized by the factor at high ranks . This cut-off factor ensures faster [than ] decay of for large . In the Zipfian range and (30) reverts to Zipf’s law.
According to (29), the probability is small for and hence the occurrence number of the word with the rank is a small integer (e.g. 1 or 2) that cannot be approximated by a continuous function of . To describe this hapax legomena range, define as the rank, when jumps from integer to (hence the number of words that appear times is ). Since reproduces well the trend of even for , can be theoretically predicted from (30) by equating its left-hand-side to :
Eq. (31) is exact for , and agrees with for with a small relative error pre (). Hence a single formalism describes both Zipf’s law for short texts and the hapax legomena range. For describing the hapax legomena no new parameters are needed; it is based on the same parameters that appear in Zipf’s law.
vi.2 Applying the model to the halves
The number of words that appear times is expressed as in terms of the above parameter . Using (31) we provide a theoretical estimate for this quantity times
Note from (25) that the parameter characterizes the width of the prior word density, i.e. a smaller means that more low-probable words are involved. An important observation is that for all texts we analyzed this ratio is smaller in the first half [see Table 4]:
This result is consistent with the above model [see (26)], since it heuristically predicts that the first half involves more rare words.
Our aim was to relate Zipf’s law (1, 2, 4) to meaning-carrying features of the text. First we had to understand in which specific sense this law is a statistical regularity (roughly akin to the law of large numbers). To this end, we studied the validity range of the law upon taking together two different texts; see section II.3. The validity range does increase, but this increase happens mostly due to low-frequency words. Next, in section III each text was divided into two halves. This allows to uncover hidden relations, since various confounding variables (genre, style, the author’s vocabulary etc) are the same in both halves. For the first half, Zipf’s law applies from smaller ranks, and its validity range covers more frequent words.
On the other hand, we uncovered several textual features that are different between the halves; see Table 1. In section IV we reviewed several basic notions of text linguistics, e.g. theme vs. rheme in clauses, and autosemantic sentences of paragraphs; see hutchins () for a more detailed review. We argued that quantitative differences between the halves (shown on Table 1) imply that the first half of the text has more thematic information than the second half. We suggest that Zipf’s law is related to the presence and amount of this information.
For describing the above relations of Zipf’s law we need new models, since the existing statistical and optimization models do not account for meaning-carrying features of texts; see section I for a review. However, the statistical model developed in pre () can be applied for confirming some of the observed empiric relations and for predicting new ones; see (33). This fact was demonstrated in section VI.
Relations discussed in the first group of Table 1 only require that the text consists of well-defined (but possibly unknown) words, no futher structure of the text is needed. These are: the minimal rank (and the maximal frequency) of Zipf’s law, homogeneity of the spatial distribution for the Zipfian words, the number of different words, the number of rare words, the normalized prefactor of Zipf’s law. These features can be employed for finding out whether a sequence of word-like symbols (written in unknown system) constitutes a text. Several other regularities are known in literature that distinguish between a text and a random string of words. Ref. lande () found that texts are compressed better in their natural ordering of words than after sufficiently many random permutations of words. Ref. debo () argues that the scaling behavior of the mutual information between different long block of a text can indicate on its difference from the random text.
Remaining relations we found—see the second part of Table 1—do require that finer-grained text structures are known and available, e.g. that words consist of letters (which is not true in non-alphabetic writing systems) or that the text is fractioned into sentences and paragraphs (not the case in cryptic texts) etc. These features are important, because they uncover a textual structure that goes well-beyond paragraphs and chapters. Though they cannot be directly employed for the above task of text recognition, there are possible relations between some features from the first versus second group. We submit them for future consideration.
– Recall the relation between rare words and frequent words that goes via sentences: each sentence normally contains both frequent words and rare words; cf. Footnote 5. Hence we expect a relation between larger number of sentences (in the first half) and the following two facts. First, Zipf’s law starts from smaller ranks and larger frequency; see (8, 9). Second, there are more rare words in the first half; cf. (10, 11).
– More punctuation signs in the first half obviously correlate with shorter sentences; see Table 1. We also expect that longer sentences in the second half correlate with a larger word repetitiveness, as quantified by the Yule’s constant (21). Also, there can be a direct connection between spatial hamogeneity of Zipfian words seen in section V.1, and the number of sentences.
– We anticipate a relation between smaller dispersion and entropy of the sentence distribution (from one hand) and the larger number of paragraphs (on the other hand).
W. Deng was partially supported by the Program of Introducing Talents of Discipline to Universities under grant no. B08033, and National Natural Science Foundation of China (Grant No. 11505071). A.E. Allahverdyan was supported by SCS of Armenia, grant 18RF-015.
Appendix I: Numeric data
Total number of words is , the number of different words is , is the lower rank of the Zipfian domain, is the number of punctuation signs, is the number of letters, is the number of words that appear less then 4 times, is the Yule’s constant, is the normalized distance, and is the number of paragraphs. The lower indices, e.g. in and refer to the first and second halves, respectively. Underlined pairs are atypical with respect to the halves, e.g. everywhere besides TF we get .
|AR and DL||5542||27||518||0.171||1.059||0.00219|
|AR and TF||5586||36||643||0.158||1.054||0.00196|
|AR and TM||6451||34||586||0.155||1.047||0.00392|
|DL and TF||6288||31||482||0.178||1.075||0.00267|
|DL and TM||6668||39||473||0.156||1.051||0.00159|
|TF and TM||7071||38||550||0.180||1.082||0.00256|
Appendix II: Linear fitting
For each text we extract the ordered frequencies of different words [the number of different words is ; the overall number of words in a text is ]:
We should now see whether the data fits to a power law: . We represent the data as
and fit it to the linear form . Two unknowns and are obtained from minimizing the sum of squared errors:
It is known since Gauss that this minimization produces
where we defined
As a measure of fitting quality one can take:
This is however not the only relevant quality measure. Another (more global) aspect of this quality is the coefficient of correlation between and :
For the linear fitting (37) the squared correlation coefficient is equal to the coefficient of determination,
the amount of variation in the data explained by the fitting. Hence and mean good fitting.
Appendix III: Kolmogorov-Smirnov (KS) test
We wanted to have an alternative method for checking the quality of the above least-square method. To this end we applied the Kolmogorov-Smirnov (KS) test to our data on the word frequencies. The empiric results on word frequencies in the Zipfian range are fit to a power law. With null hypothesis that empiric data follows the numerical fittings, we calculated the maximum differences (test statistics) and the corresponding p-values (between empiric data and numerical fitting) in the KS tests. Here are typical numbers for 3 texts: , ; , ; , . One sees that all the test statistics are quite small, while the p-values are much larger than 0.1. We conclude that from the viewpoint of the KS test the numerical fittings and theoretical results can be used to characterize the empiric data in the Zipfian range reasonably well.
- (1) J.-B. Estoup, Gammes sténographiques (Paris, Institut Sténographique, 1912).
- (2) E.U. Condon, Science, 67, 300 (1928).
- (3) G.K. Zipf, The psycho-biology of language (MIT Press, Cambridge, MA, 1935) .
- (4) M. Joos, Language, 12, 196 (1936).
- (5) L E. Wyllis, Library Trends, 30, 53 (1981).
- (6) H. Baayen, Word frequency distribution (Kluwer Academic Publishers, 2001).
- (7) B. Mandelbrot, Fractal geometry of nature (W. H. Freeman, New York, 1983).
- (8) N. Hatzigeorgiu, G. Mikros, and G. Carayannis, Journal of Quantitative Linguistics, 8, 175 (2001).
- (9) B.D. Jayaram and M.N. Vidya, Journal of Quantitative Linguistics, 15, 293 (2008).
- (10) I. Moreno-Sanchez, F. Font-Clos, and A. Corral, PLoS ONE 11 e0147073 (2016).
- (11) W. Deng, A. E. Allahverdyan, B. Li and Q. A. Wang, Eur. Phys. J. B, 87, 47 (2014).
- (12) Yu.A. Shrejder and A.A. Sharov, Systems and Models (Moscow, Radio i Svyaz, 1982) (In Russian).
- (13) R. Ferrer-i-Cancho and R. Solé, PNAS, 100, 788 (2003).
- (14) M. Prokopenko, N. Ay, O. Obst and D. Polani, JSTAT, P11025 (2010).
- (15) R. Dickman, N.R Moloney and E.G. Altmann, JSTAT, P12022 (2012).
- (16) V. Dunaev, Aut. Doc. Math. Linguistics, 14 (1984).
- (17) B. Corominas-Murtra, J. Fortuny and R.V. Sole, Phys. Rev. E 83, 036115 (2011).
- (18) D. Manin, Cognitive Science, 32, 1075 (2008).
- (19) B. Mandelbrot, An information theory of the statistical structure of language, in Communication theory, ed. by W. Jackson (London, Butterworths, 1953).
- (20) M.V. Arapov and Yu.A. Shrejder, in Semiotics and Informatics, v. 10, p. 74 (Moscow, VINITI, 1978).
- (21) Yu.A. Shrejder, Prob. Inform. Trans. 3, 57 (1967).
- (22) Y. Dover, Physica A 334, 591 (2004).
- (23) E.V. Vakarin and J. P. Badiali, Phys. Rev. E 74, 036120 (2006).
- (24) C.-S. Liu, Fractals 16, 99 (2008).
- (25) S.K. Baek, S. Bernhardsson and P. Minnhagen, New Journal of Physics 13, 043004 (2011).
- (26) G.A. Miller, Am. J. Psyc. 70, 311 (1957). G.A. Miller and E.B.Newman, Am. J. Psyc. 71, 209 (1958). W.T. Li, IEEE Inform. Theory, 38, 1842 (1992).
- (27) H.A. Simon, Biometrika 42, 425 (1955).
- (28) D.H. Zanette and M.A. Montemurro, J. Quant. Ling. 12, 29 (2005).
- (29) I. Kanter and D.A. Kessler, Phys. Rev. Lett. 74, 4559 (1995).
- (30) B.M. Hill, J. Am. Stat. Ass. 69, 1017 (1974). H.S. Sichel, ibid., 70, 542 (1975). G. Troll and P. Beim Graben, Phys. Rev. E 57, 1347 (1998). A. Czirok, H.E. Stanley and T. Vicsek, ibid. 53, 6371 (1996). K.E. Kechedzhi, O.V. Usatenko and V.A. Yampol’skii, ibid. 72 (2005).
- (31) A. E. Allahverdyan, W. Deng, and Q. A. Wang, Phys. Rev. E 88, 062804 (2013).
- (32) L. Aitchison, N. Corradi, P. E. Latham, PLoS Comput. Biol. 12, e1005110 (2016).
- (33) D. Howes, Am. J. Psyc. 81, 269 (1968).
- (34) S. Bernhardsson, S.K. Baek and P. Minnhagen, J. Stat. Mech. P07013 (2011).
- (35) R. Ferrer-i-Cancho and B. Elveva, PLoS ONE, 5, 9411 (2010).
- (36) V. Yngve, IRE Transactions on Information Theory, 2, 106 (1956).
- (37) M. Ortuno, P. Carpena, P. Bernaola-Galvan, E. Munoz, and A.M. Somoza, Europhysics Letters, 57, 759 (2002).
- (38) J.P. Herrera and P.A. Pury, European Physical Journal B, 63, 135 (2008).
- (39) P. Carpena, P. Bernaola-Galvan, M. Hackenberg, A.V. Coronado and J.L. Oliver, Phys. Rev. E 79, 035102 (2009).
- (40) M.A. Montemurro and D. Zanette, Adv. Comp. Syst. 13, 135 (2010).
- (41) A. Cohen, R. N. Mantegna, and S. Havlin, Fractals 5, 95 (1997).
- (42) S. Bernhardsson, L. E. Correa da Rocha, P. Minnhagen, Physica A 389, 330 (2010).
- (43) J.K. Orlov, Naucno-techniceskaja informacija (Serija 2), 8, 11 (1970) (In Russian).
- (44) S.T. Piantadosi, Psych. Bull. Rev. 21, 1112 (2014).
- (45) J. R. Williams, J. P. Bagrow, C. M. Danforth, and P. S. Dodds, Phys. Rev. E 91, 052811 (2015).
- (46) A.D. Booth, Information and Control, 10, 386 (1967). B.M. Hill, J. Am. Stat. Ass. 65, 1220 (1970).
- (47) H.P. Luhn, IBM J. Res. Devel. 2, 159 (1958).
- (48) W. J. Hutchins, Journal of Informatics, 1, no.1, 17 (1977).
- (49) N.S. Valgina, Theory of the text. Textbook (Logos, Moscow, 2003) (Russian Edition).
- (50) M. A. K. Halliday and R. Hasan, Language, Context and Text; Aspects of Language in a Social-semiotic Perspective (Oxford, Oxford University Press, 2nd edition, 1989).
- (51) M.A. Hearst, Computational Linguistics, 23, 33 (1997).
- (52) T. Hofmann, Probabilistic Latent Semantic Analysis, in Uncertainty in Artificial Intelligence, 1999.
- (53) P.F. Lazarsfeld and N.W. Henry, Latent Structure Analysis (Boston, Houghton Mifflin, 1968).
- (54) W. Ebeling and A. Neiman, Physica A 215, 233 (1995).
- (55) E. Alvarez-Lacalle, B. Dorow, J.-P. Eckmann, and E. Moses, PNAS, 103 (2006).
- (56) E. G. Altmanna, G. Cristadoro, and M. Degli Esposti, PNAS, 109, 11582 (2012).
- (57) R.E. Madsen, D. Kauchak and C. Elkan, Modeling word burstiness using the Dirichlet distribution, in Proc. Intl. Conf. Machine Learning, 2005.
- (58) Ch. Muller, Initiation a la statistique linguistique (Press. Univ. Paris, 1968).
- (59) D.V. Lande, A.A. Snarskii, On the role of autocorrelations in texts, arXiv:0710.0225.
- (60) L. Debowski, Chaos, 21, 037105 (2011).
- (61) A. Kornai, Glottometrics 4, 61 (2002).
- (62) D. Genzel and E. Charniak, Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number, Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, pp. 65-72.
- (63) R. Ferrer-i-Cancho, L. Debowski, and F.M. del Prado Martín, Journal of Statistical Mechanics: Theory and Experiment, L07001 (2013).