Multi-user guesswork and brute force security

Multi-user guesswork and brute force security

Mark M. Christiansen and Ken R. Duffy
Hamilton Institute
National University of Ireland Maynooth
Email: {mark.christiansen, ken.duffy}
   Flávio du Pin Calmon and Muriel Médard Research Laboratory of Electronics
Massachusetts Institute of Technology
Email: {flavio, medard}

The Guesswork problem was originally motivated by a desire to quantify computational security for single user systems. Leveraging recent results from its analysis, we extend the remit and utility of the framework to the quantification of the computational security of multi-user systems. In particular, assume that users independently select strings stochastically from a finite, but potentially large, list. An inquisitor who does not know which strings have been selected wishes to identify of them. The inquisitor knows the selection probabilities of each user and is equipped with a method that enables the testing of each (user, string) pair, one at a time, for whether that string had been selected by that user.

Here we establish that, unless , there is no general strategy that minimizes the distribution of the number of guesses, but in the asymptote as the strings become long we prove the following: by construction, there is an asymptotically optimal class of strategies; the number of guesses required in an asymptotically optimal strategy satisfies a Large Deviation Principle with a rate function, which is not necessarily convex, that can be determined from the rate functions of optimally guessing individual users’ strings; if all users’ selection statistics are identical, the exponential growth rate of the average guesswork as the string-length increases is determined by the specific Rényi entropy of the string-source with parameter , generalizing the known case; and that the Shannon entropy of the source is a lower bound on the average guesswork growth rate for all and , thus providing a bound on computational security for multi-user systems. Examples are presented to illustrate these results and their ramifications for systems design.

I Introduction

F.d.P.C. sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, recommendations, and conclusions are those of the authors and are not necessarily endorsed by the United States Government. Specifically, this work was supported by Information Systems of ASD(R&E). M.M. and K.D. were supported in part by a Netapp faculty fellowship.

The security of systems is often predicated on a user or application selecting an object, a password or key, from a large list. If an inquisitor who wishes to identify the object in order to gain access to a system can only query each possibility, one at a time, then the number of guesses they must make in order to identify the selected object is likely to be large. If the object is selected uniformly at random using, for example, a cryptographically secure pseudo-random number generator, then the analysis of the distribution of the number of guesses that the inquisitor must make is trivial.

Since the earliest days of code-breaking, deviations from perfect uniformity have been exploited. For example, it has long since been known that human-user selected passwords are highly non-uniformly selected, e.g. [1], and this forms the basis of dictionary attacks. In information theoretic security, uniformity of the string source is typically assumed on the basis that the source has been compressed. Recent work has cast some doubt on the appropriateness of that assumption by establishing that fewer queries are required to identify strings chosen from a typical set than one would expect by a naïve application of the asymptotic equipartition property. This arises by exploitation of the mild non-uniformity of the distribution of strings conditioned to be in the typical set [2].

If the string has not been selected perfectly uniformly, but with a distribution that is known to the inquisitor, then the quantification of security is relatively involved. Assume that a string, , is selected stochastically from a finite list, . An inquisitor who knows the selection probabilities, for all , is equipped with a method to test one string at a time and develops a strategy, , that defines the order in which strings are guessed. As the string is stochastically selected, the number of queries, , that must be made before it is identified correctly is also a random variable, dubbed guesswork. Analysis of the distribution of guesswork serves as a natural a measure of computational security in brute force determination.

In a brief paper in 1994, Massey [3] established that if the inquisitor orders his guesses from most likely to least likely, then the Shannon entropy of the random variable bears little relation to the expected guesswork , the average number of guesses required to identify . Arikan [4] established that if a string, , is chosen from with i.i.d. characters, again guessing strings from most likely to least likely, then the moments of the guesswork distribution grow exponentially in with a rate identified in terms of the Rényi entropy of the characters,

where is the Rényi entropy of with parameter . In particular, the average guesswork grows as the Rényi entropy with parameter , a value that is lower bounded by Shannon entropy.

Arikan’s result was subsequently extended significantly beyond i.i.d. sources [5, 6, 7], establishing its robustness. In the generalized setting, specific Rényi entropy, the Rényi entropy per character, plays the rôle of Rényi entropy. In turn, these results have been leveraged to prove that the guesswork process satisfies a Large Deviation Principle (LDP), e.g. [8, 9], in broad generality [10]. That is, there exists a lower semi-continuous function such that for all Borel sets contained in


where denotes the interior of and denotes its closure. Roughly speaking, this implies for large . In [10] this LDP, in turn, was used to provide direct estimates on the guesswork probability mass function, for . These deductions, along with others described in Section IV, have developed a quantitative framework for the process of brute force guessing a single string.

In the present work we address a natural extension in this investigation of brute force searching: the quantification for multi-user systems. We are motivated by both classical systems, such as the brute force entry to a multi-user computer where the inquisitor need only compromise a single account, as well as modern distributed storage services where coded data is kept at distinct sites in a way where, owing to coding redundancy, several, but not all, servers need to be compromised to access the content [11, 12].

Ii Summary of contribution

Assume that users select strings independently from . An inquisitor knows the probabilities with which each user selects their string, is able to query the correctness of each (user, string) pair, and wishes to identify any subset of size of the strings. The first question that must be addressed is what is the optimal strategy, the ordering in which (user, string) pairs are guessed, for the inquisitor. For the single user system, since the earliest investigations [3, 4, 13, 14] it has been clear that the strategy of ordering guesses from the most to least likely string, breaking ties arbitrarily, is optimal in any reasonable sense. Here we shall give optimality a specific meaning: that the distribution of the number of guesses required to identify the unknown object is stochastically dominated by all other strategies. Amongst other results, for the multi-user guesswork problem we establish the following:

  • If , the existence of optimal guessing strategies, those that are stochastically dominated by all other strategies, is no longer assured.

  • By construction, there exist asymptotically optimal strategies as the strings become long.

  • For asymptotically optimal strategies, we prove a large deviation principle for their guesswork. The resulting large deviations rate function is, in general, not convex and so this result could not have been established by determining how the moment generating function of the multi-user guesswork distribution scales in string-length.

  • The non-convexity of the rate function shows that, if users’ string statistics are distinct, there may be no fixed ordering of weakness amongst users. That is, depending on how many guesses are made before the users’ strings are identified, the collection of users whose strings have been identified are likely to be distinct.

  • If all strings are chosen with the same statistics, then the rate function is convex and the exponential growth rate of the average guesswork as string-length increases is the specific Rényi entropy of the string source with parameter

  • For homogeneous users, from an inquisitor’s point of view, there is a law of diminishing returns for the expected guesswork growth rate in excess number of users ().

  • For homogeneous users, from a designer’s point of view, coming full circle to Massey’s original observation that Shannon entropy has little quantitative relationship to how hard it is to guess a single string, the specific Shannon entropy of the source is a lower bound on the average guesswork growth rate for all and .

These results generalize both the original guesswork studies, where , as well as some of the results in [13, 15] where, as a wiretap model, the case and with one of the strings selected uniformly, is considered and scaling properties of the guesswork moments are established. Interestingly, we shall show that that setting is one where the LDP rate function is typically non-convex, so while results regarding the asymptotic behavior of the guesswork moments can be deduced from the LDP, the reverse is not true. To circumvent the lack of convexity, we prove the main result using the contraction principle, Theorem 4.2.1 [9], and the LDP established in [10], which itself relies on earlier results of work referenced above.

Fig. 1: Strings created from i.i.d. letters are selected from a binary alphabet with probability for one character. Given an inquisitor wishes to identify of strings, the left panel shows the average exponential guesswork growth rate as a function of , the excess number of guessable strings; the right panel shows the theoretically predicted approximate average guesswork for bit strings, as used in triple DES, as a function of , the excess number of guessable strings.

Iii The impact of the number of users on expected guesswork growth rate, an example

As an exemplar that illustrates the reduction in security that comes from having multiple users, the left panel in Figure 1 the average guesswork growth rate for an asymptotically optimal strategy is plotted for the simplest case, a binary alphabet with i.i.d. Bernoulli string sources. In order to be satisfied, the inquisitor wishes to identify of the strings. The x-axis shows the excess number of guessable strings, , and the y-axis is the growth rate of the expected guesswork in string length. If the source is perfectly uniform (i.e. characters are chosen with a Bernoulli process), then the average guesswork growth rate is maximal and unchanging in . If the source is not perfectly uniform, then the growth rate decreases as the number of excess guessable strings increases, with a lower bound of the source’s Shannon entropy.

For a string of length bits, as used in the triple DES cipher, and a Bernoulli source, the right panel in Figure 1 displays the impact that the change in this exponent has, approximately, on the average number of guesses required to determine strings. More refined results for a broader class of processes can be found in later sections, including an estimate on the guesswork distribution.

The rest of this paper is organized as follows. In Section IV, we begin with a brief overview of results on guesswork that we have not touched on so far. Questions of optimal strategy are considered in Section V. Asymptotically optimal strategies are established to exist in Section VI and results for these strategies appear in Section VII. In Section VIII we present examples where strings sources have distinct statistics. In Section IX we return to the setting where string sources have identical statistics. Concluding remarks appear in Section X.

Iv A brief overview of guesswork

Since Arikan’s introduction of the long string length asymptotic, several generalizations of its fundamental assumptions have been explored. Arikan and Boztas [16] investigate the setting where the truthfulness in response to a query is not certain. Arikan and Merhav [17] loosen the assumption that inquisitor needs to determine the string exactly, assuming instead that they only need to identify it within a given distance. That the inquisitor knows the distribution of words exactly is relaxed by Sundaresan [18], [19] and by the authors of [20].

Motivated by a wiretap application, the problem of multiple users was first investigated by Merhav and Arikan [13] in the and setting, assuming one of the users selects their string uniformly on a reduced alphabet. In [21] Hayashi and Yamamoto extend the results in [13] to the case if there is an additional i.i.d. source correlated to the first, used for coding purposes, while Harountunian and Ghazaryan [22] extend the results in [13] to the setting of [17]. Harountunian and Margaryan [23] expand on [13] by adding noise to the original string, altering the distribution of letters. Hanawal and Sundaresan [15] extend the bounds in [13] to a pre-limit and to more general sources, showing that they are tight for Markovian and unifilar sources.

Sundaresan [24] uses length functions to identify the link between guesswork and compression. This result is extended by Hanawal and Sundaresan [25] to relate guesswork to the compression of a source over a countably infinite alphabet. In [2] the authors prove that, if the string is conditioned on being an element of a typical set the expected guesswork, is growing more slowly than a simple uniform approximation would suggest. In [26] the authors consider the impact of guessing over a noisy erasure channel showing that the mean noise on the channel is not the significant moment in determining the expected guesswork, but instead one determined by its Rényi entropy with parameter . Finally, we mention that recent work by Bunte and Lapidoth [27] identifies a distinct operational meaning for Rényi entropy in defining a rate region for a scheduling problem.

V Optimal strategies

In order to introduce the key concepts used to determine the optimal multi-user guesswork strategy, we first reconsider the optimal guesswork strategy in the single user case, i.e. . Recall that is a finite set.

Definition 1

A single user strategy, , is a one-to-one map that determines the order in which guesses are made. That is, for a given strategy and a given string , is the number of guesses made until is queried.

Let be a random variable taking values in . Assume that its probability mass function, for all , is known. Since the first results on the topic it has been clear that the best strategy, which we denote , is to guess from most likely to least likely, breaking ties arbitrarily. In particular, is defined by if . We begin by assigning optimality a precise meaning in terms of stochastic dominance [28, 29].

Definition 2

A strategy is optimal for if the random variable is stochastically dominated by , for all strategies . That is, if for all strategies and all .

This definition captures the stochastic aspect of guessing by stating than an optimal strategy is one where the identification stopping time is probabilistically smallest. One consequence of this definition that explains its appropriateness is that for any monotone function , it is the case that for an optimal and any other (e.g. Proposition 3.3.17, [29]). Thus has the least moments over all guessing strategies. That guessing from most- to least-likely in the single user case is optimal is readily established.

Lemma 1

If , the optimal strategies are those that guess from most likely to least likely, breaking ties arbitrarily.

  • Consider the strategy defined above and any other strategy . By construction, for any

In the multi-user case, where (user, string) pairs are queried, a strategy is defined by the following.

Definition 3

A multi-user strategy is a one-to-one map that orders the guesses of (user, string) pairs.

The expression for the number of guesses required to identify strings is a little involved as we must take into account that we stop making queries about a user once their string has been identified. For a given strategy , let be defined by

which computes the number of queries in the strategy up to that correspond to user .

The number of queries that need to be made if strings are to be identified is

where and gives the smallest component of . The number of guesses required to identify components of is then


This apparently unwieldy object counts the number of queries made to each user, curtailed either when their string is identified or when strings of other users are identified.

If , equation (2) simplifies significantly, as for all , becoming


the sum of the number of queries required to identify each individual word. In this case, we have the analogous result to Lemma 1, which is again readily established.

Lemma 2

If , the optimal strategies are those that employ individual optimal strategies, but with users selected in any order.

  • For any multi-user strategy , equation (3) holds. Consider an element in the sum on the right hand side, . It can be recognized to be the number of queries made to user until their string is identified. By Lemma 1, for each user , for any this stochastically dominates the equivalent single user optimal strategy. Thus the multi-user optimal strategies in this case are the sum of individual user optimal strategies, with users queried in any arbitrary order.

The formula (2) will be largely side-stepped when we consider asymptotically optimal strategies, but is needed to establish that there is, in general, no stochastically dominant strategy if . With being a random vector taking values in with independent, not necessarily identically distributed, components, we are not guaranteed the existence of an such that for all alternate strategies .

Lemma 3

If , a stochastically dominant strategy does not necessarily exist.

  • A counter-example suffices and so let , , and . Let the distributions of and be

    User 1 User 2

    If a stochastically dominant strategy exists, its first guess must be user , string , i.e. , so that . Given this first guess, to maximize , the second guess must be user , string , , so that .

    An alternate strategy with and , however, gives and . While , and so there is no strategy stochastically dominated by all others in this case.

Despite this lack of universal optimal strategy, we shall show that there is a sequence of random variables that are stochastically dominated by the guesswork of all strategies and, moreover, there exists a strategy with identical performance in Arikan’s long string length asymptotic.

Definition 4

A strategy is asymptotically optimal if satisfies a LDP with the same rate function as a sequence where is stochastically dominated by for all strategies .

Note that need not correspond to the guesswork of a strategy.

Vi An asymptotically optimal strategy

Let be a sequence of random strings, with taking values in , with independent components, , corresponding to strings selected by users through , although each user’s string may not be constructed from i.i.d. letters. For each individual user, , let denote its single-user optimal guessing strategy; that is, guessing from most likely to least likely.

We shall show that the following random variable, constructed using the , is stochastically dominated by the guesswork distribution of all strategies:


This can be thought of as allowing the inquisitor to query, for each in turn, the most likely string for all users while only accounting for a single guess and so it does not correspond to an allowable strategy.

Lemma 4

For any strategy and any , is stochastically dominated by . That is, for any any and any

  • Using equation (2) and the positivity of its summands, for any strategy

    As for each , is stochastically dominated by all other strategies,

    Using equation (4), this implies that

    as required.

The strategy that we construct that will asymptotically meet the performance of the lower bound is to round-robin the single user optimal strategies. That is, to query the most likely string of one user followed by the most likely string of a second user and so forth, for each user in a round-robin fashion, before moving to the second most likely string of each user. An upper bound on this strategy’s performance is to consider only stopping at the end of a round of such queries, even if they reveal more than strings, which gives


where is defined in (4).

In large deviations parlance the stochastic processes and arising from equations (4) and (5) are exponentially equivalent, e.g. Section 4.2.2 [9], as . As a result, if one process satisfies the LDP with a rate function that has compact level sets, then the other does [9][Theorem 4.2.3]. Thus if can be shown to satisfy a LDP, then the round-robin strategy is proved to be asymptotically optimal.

Vii Asymptotic performance of optimal strategies

We first recall what is known for the single-user setting. For each individual user , the specific Rényi entropy of the sequence , should it exist, is defined by

for , and for ,

the specific Shannon entropy. Should exist for , then the specific min-entropy is defined

where the limit necessarily exists. The existence of for all and its relationship to the scaled Cumulant Generating Function (sCGF)


has been established for the single user case for a broad class of character sources that encompasses i.i.d., Markovian and general sofic shifts that admit an entropy condition [4, 5, 6, 7, 10]. If, in addition, is differentiable with respect to and has a continuous derivative, it is established in [10] that the process satisfies a LDP, i.e. equation (1), with a convex rate function


In [10], this LDP is used to deduce an approximation to the guesswork distribution,


for large and .

The following theorem establishes the fundamental analogues of these results for an asymptotically optimal strategy, where user strings may have distinct statistical properties.

Theorem 5

Assume that the components of are independent and that for each exists for all , is differentiable and has a continuous derivative, and that equation (6) holds. Then the process , and thus any asymptotically optimal strategy, satisfies a Large Deviation Principle. Defining

the rate function is


which is lower semi-continuous and has compact level sets, but may not be convex. The sCGF capturing how the moments scale is

  • Under the assumptions of the theorem, for each , satisfies the LDP with the rate function given in equation (7). As users’ strings are selected independently, the sequence of vectors

    satisfies the LDP in with rate function , the sum of the rate functions given in equation (7).

    Within our setting, the contraction principle, e.g. Theorem 4.2.1 [9], states that if a sequence of random variables taking values in a compact subset of satisfies a LDP with rate function and is a continuous function, then the sequence satisfies the LDP with rate function .

    Assume, without loss of generality, that is such that , so that , and let . Let . There exists such that for all . Thus for all and all and so . Hence is a continuous function and that a LDP holds follows from an application of the contraction principle, giving the rate function

    This expression simplifies to that in equation (9) by elementary arguments. The sCGF result follows from an application of Varahadan’s Lemma, e.g [9, Theorem 4.3.1].

The expression for the rate function in equation (9) lends itself to a useful interpretation. In the long string-length asymptotic, the likelihood that an inquisitor has identified of the users’ strings after approximately queries is contributed to by three distinct groups of identifiable users. For given , the argument in the first term identifies the last of the users whose string is identified. The second summed term is contributed to by the collection of users, to , whose strings have already been identified prior to queries, while the final summed term corresponds to those users, to , whose strings have not been identified.

The reason for using the notation in lieu of for the rate function in Theorem 5 is that is not convex in general, which we shall demonstrate by example, and so is not always the Legendre-Fenchel transform of the sCGF . Instead

forms the convex hull of . In particular, this means that we could not have proved Theorem 5 by establishing properties of alone, which was the successful route taken for the setting, and instead needed to rely on the LDP proved in [10]. Indeed, in the setting considered in [13, 15] with , , with one of the strings chosen uniformly, while the authors directly identify for , one cannot establish a full LDP from this approach as the resulting rate function is not convex.

Convexity of the rate function defined in equation (9) is ensured, however, if all users select strings using the same stochastic properties, whereupon the results in Theorem 5 simplify greatly.

Corollary 1

If, in addition to the assumptions of Theorem 5, for all with corresponding Rényi entropy , then the rate function in equation (7) simplifies to the convex function


where is the specific Shannon entropy, and the sCGF in equation (10) simplifies to


In particular, with we have


where is a decreasing function of .

  • The simplification in equation (11) follows readily from equation (9). To establish that is a decreasing function of , it suffices to establish that is a convex, decreasing function for .

    That as is a general property of specific Rényi entropy. For convexity, using equation (13) it suffices to show that is convex for . This can be seen by noting that for any and ,

    where and we have used the convexity of .

As the growth rate, , is decreasing there is a law of diminishing returns for the inquisitor where the greatest decrease in the average guesswork growth rate is through the provision of one additional user. From the system designer’s point of view, the specific Shannon entropy of the source is a universal lower bound on the exponential growth rate of the expected guesswork that, while we cannot take the limit to infinity, is tight for large .

Regardless of whether the rate function is convex, Theorem 6, which follows, justifies the approximation

for large and . It is analogous to that in equation (8), first developed in [10], but there are additional difficulties that must be overcome to establish it. In particular, if , the likelihood that the string is identified at each query is a decreasing function of guess number, but this is not true in the more general case.

As a simple example, consider , , strings of length and strings chosen uniformly. Here the probability of guessing both strings in one guess is , but at the second guess it is . Despite this lack of monotonicity, the approximation still holds in the following sense.

Theorem 6

Under the assumptions of Theorem 5, for any we have


is the collection of guesses made in a log-neighborhood of .

  • The proof follows the ideas in [10] Corollary 4, but with the added difficulties resolved by isolating the last word that is likely to be guessed and leveraging the monotonicity it its individual likelihood of being identified.

    Noting the definition of in the statement of the theorem, consider for

    The first equality holds by definition of . The first inequality follows from the union bound over all possible permutations of . The second inequality utilizes if , while the third inequality uses the monotonic decreasing probabilities in guessing a single user’s string.

    Taking on both sides of the inequality, interchanging the order of the max and the supremum, using the continuity of for each , and the representation of the rate function in equation (9), gives the upper bound

    Considering the least likely guesswork in the ball leads to a matching lower bound. The other case, , follows similar logic, leading to the result.

We next provide some illustrative examples of what these results imply, returning to using in figures.

Viii Mismatched Statistics Example

The potential lack of convexity in the rate function of Theorem 5, equation (9), only arises if users’ string statistics are asymptotically distinct. The significance of this lack of convexity on the phenomenology of guesswork can be understood in terms of the asymptotically optimal round-robin strategy: if the rate function is not convex, there is no single set of users whose strings are most vulnerable. That is, if strings are recovered after a small number of guesses, they will be from one set of users, but after a number of guesses corresponding to a transition from the initial convexity they will be from another set of users. This is made explicit in the following corollary to Theorem 5.

Corollary 2

If is not convex in , then there is there is no single set of users whose strings will be identified in the long string length asymptotic.

  • We prove the result by establishing the contraposition: if a single set of users is always most vulnerable, then is convex. Recall the expression for given in equation (9)

    As explained after Theorem 5, for given the set of users corresponds to those users whose strings, on the scale of large deviations, will be identified by the inquisitor after approximately queries. If this set is unchanging in , i.e. the same set of users is identified irrespective of , then both of the functions