1 Introduction and Summary

Analysis of the Expected Number of Bit Comparisons

Required by Quickselect

James Allen Fill1

Department of Applied Mathematics and Statistics

The Johns Hopkins University

jimfill@jhu.edu and http://www.ams.jhu.edu/~fill/

and

Take Nakama

Department of Applied Mathematics and Statistics

The Johns Hopkins University

nakama@jhu.edu and http://www.ams.jhu.edu/~nakama/

ABSTRACT

When algorithms for sorting and searching are applied to keys that are represented as bit strings, we can quantify the performance of the algorithms not only in terms of the number of key comparisons required by the algorithms but also in terms of the number of bit comparisons. Some of the standard sorting and searching algorithms have been analyzed with respect to key comparisons but not with respect to bit comparisons. In this paper, we investigate the expected number of bit comparisons required by Quickselect (also known as Find). We develop exact and asymptotic formulae for the expected number of bit comparisons required to find the smallest or largest key by Quickselect and show that the expectation is asymptotically linear with respect to the number of keys. Similar results are obtained for the average case. For finding keys of arbitrary rank, we derive an exact formula for the expected number of bit comparisons that (using rational arithmetic) requires only finite summation (rather than such operations as numerical integration) and use it to compute the expectation for each target rank.

AMS 2000 subject classifications. Primary 68W40; secondary 68P10, 60C05.

Key words and phrases. Quickselect, Find, searching algorithms, asymptotics, average-case analysis, key comparisons, bit comparisons.

Date. June 15, 2007.

## 1 Introduction and Summary

When an algorithm for sorting or searching is analyzed, the algorithm is usually regarded either as comparing keys pairwise irrespective of the keys’ internal structure or as operating on representations (such as bit strings) of keys. In the former case, analyses often quantify the performance of the algorithm in terms of the number of key comparisons required to accomplish the task; Quickselect (also known as Find) is an example of those algorithms that have been studied from this point of view. In the latter case, if keys are represented as bit strings, then analyses quantify the performance of the algorithm in terms of the number of bits compared until it completes its task. Digital search trees, for example, have been examined from this perspective.
In order to fully quantify the performance of a sorting or searching algorithm and enable comparison between key-based and digital algorithms, it is ideal to analyze the algorithm from both points of view. However, to date, only Quicksort has been analyzed with both approaches; see Fill and Janson [3]. Before their study, Quicksort had been extensively examined with regard to the number of key comparisons performed by the algorithm (e.g., Knuth [12], Régnier [17], Rösler [18], Knessl and Szpankowski [9], Fill and Janson [2], Neininger and Rüschendorf [16]), but it had not been examined with regard to the number of bit comparisons in sorting keys represented as bit strings. In their study, Fill and Janson assumed that keys are independently and uniformly distributed over (0,1) and that the keys are represented as bit strings. [They also conducted the analysis for a general absolutely continuous distribution over (0,1).] They showed that the expected number of bit comparisons required to sort keys is asymptotically equivalent to as compared to the lead-order term of the expected number of key comparisons, which is asymptotically . We use ln and lg to denote natural and binary logarithms, respectively, and use log when the base does not matter (for example, in remainder estimates).
In this paper, we investigate the expected number of bit comparisons required by Quickselect. Hoare [7] introduced this search algorithm, which is treated in most textbooks on algorithms and data structures. Quickselect selects the -th smallest key (we call it the rank- key) from a set of distinct keys. (The keys are typically assumed to be distinct, but the algorithm still works—with a minor adjustment—even if they are not distinct.) The algorithm finds the target key in a recursive and random fashion. First, it selects a pivot uniformly at random from keys. Let denote the rank of the pivot. If , then the algorithm returns the pivot. If , then the algorithm recursively operates on the set of keys smaller than the pivot and returns the rank- key. Similarly, if , then the algorithm recursively operates on the set of keys larger than the pivot and returns the ()-th smallest key from the subset. Although previous studies (e.g., Knuth [10], Mahmoud et al. [14], Grübel and U. Rösler [6], Lend and Mahmoud [13], Mahmoud and Smythe [15], Devroye [1], Hwang and Tsai [8]) examined Quickselect with regard to key comparisons, this study is the first to analyze the bit complexity of the algorithm.
We suppose that the algorithm is applied to distinct keys that are represented as bit strings and that the algorithm operates on individual bits in order to find a target key. We also assume that the keys are uniformly and independently distributed in . For instance, consider applying Quickselect to find the smallest key among three keys , , and whose binary representations are .01001100…, .00110101…, and .00101010…, respectively. If the algorithm selects as a pivot, then it compares each of and to in order to determine the rank of . When and are compared, the algorithm requires 2 bit comparisons to determine that is smaller than because the two keys have the same first digit and differ at the second digit. Similarly, when and are compared, the algorithm requires 4 bit comparisons to determine that is smaller than . After these comparisons, key has been identified as smallest. Hence the search for the smallest key requires a total of 6 bit comparisons (resulting from the two key comparisons).
We let denote the expected number of bit comparisons required to find the rank- key in a file of keys by Quickselect. By symmetry, . First, we develop exact and asymptotic formulae for , the expected number of bit comparisons required to find the smallest key by Quickselect, as summarized in the following theorem.

###### Theorem 1.1.

The expected number of bit comparisons required by Quickselect to find the smallest key in a file of keys that are independently and uniformly distributed in has the following exact and asymptotic expressions:

 μ(1,n) = 2n(Hn−1)+2n−1∑j=2Bjn−j+1−(nj)j(j−1)(1−2−j) = cn−1ln2(lnn)2−(2ln2+1)lnn+O(1),

where and denote harmonic and Bernoulli numbers, respectively, and, with and , we define

 c:=289+17−6γ9ln2−4ln2∑k∈Z∖{0}ζ(1−χk)Γ(1−χk)Γ(4−χk)(1−χk)≐5.27938.

The asymptotic formula shows that the expected number of bit comparisons is asymptotically linear in with the lead-order coefficient approximately equal to 5.27938. Hence the expected number of bit comparisons is asymptotically different from that of key comparisons required to find the smallest key only by a constant factor (the expectation for key comparisons is asymptotically 2). Complex-analytical methods are utilized to obtain the asymptotic formula. Details of the derivations of the formulae are described in Section 3.
We also derive exact and asymptotic expressions for the expected number of bit comparisons for the average case. We denote this expectation by . In the average case, the parameter in is considered a discrete uniform random variable; hence The derived asymptotic formula shows that is also asymptotically linear in ; see (4.48). More detailed results for are described in Section 4.
Lastly, in Section 5, we derive an exact expression of for each fixed that is suited for computations. Our preliminary exact formula for [shown in (2.8)] entails infinite summation and integration. As a result, it is not a desirable form for numerically computing the expected number of bit comparisons. Hence we establish another exact formula that only requires finite summation and use it to compute for , . The computation leads to the following conjectures: (i) for fixed , increases in for and is symmetric about ; and (ii) for fixed , increases in (asymptotically linearly).

## 2 Preliminaries

To investigate the bit complexity of Quickselect, we follow the general approach developed by Fill and Janson [3]. Let denote the keys uniformly and independently distributed on (0, 1), and let denote the rank- key. Then, for (assume ),

 P{U(i) and U(j) are compared}=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩2j−m+1if m≤i2j−i+1if i

To determine the first probability in (2.1), note that remain in the same subset until the first time that one of them is chosen as a pivot. Therefore, and are compared if and only if the first pivot chosen from is either or . Analogous arguments establish the other two cases.
For , it is well known that the joint density function of and is given by

 fU(i),U(j)(s,t) := (ni−1,1,j−i−1,1,n−j) si−1(t−s)j−i−1(1−t)n−j. (2.2)

Clearly, the event that and are compared is independent of the random variables and . Hence, defining

 P1(s,t,m,n) = ∑m≤i

[the sums in (2.3)–(2.5) are double sums over and ], and letting denote the index of the first bit at which the keys and differ, we can write the expectation of the number of bit comparisons required to find the rank- key in a file of keys as

 μ(m,n) = ∫10∫1sβ(s,t)P(s,t,m,n)dtds (2.7) = ∞∑k=02k∑l=1∫(l−12)2−k(l−1)2−k∫l2−k(l−12)2−k(k+1)P(s,t,m,n)dtds; (2.8)

in this expression, note that represents the last bit at which and agree.

## 3 Analysis of μ(1,n)

In Section 3.1, we derive the exact expression for shown in Theorem 1.1. In Section 3.2, we prove the asymptotic result stated in Theorem 1.1.

### 3.1 Exact Computation of μ(1,n)

Since the contribution of or to is zero for , we have [see (2.4) through (2.6)]. Let . Then

 P1(s,t,1,n) = zn∑1≤i

Making the change of variables and integrating, and recalling , we find, after some calculation,

 P1(s,t,1,n) = 2n∑j=2(−1)j(nj)tj−2. (3.2)

From (2.8) and (3.2),

 μ(1,n) = ∞∑k=0(k+1)2k∑l=1∫(l−12)2−k(l−1)2−k∫l2−k(l−12)2−kP1(s,t,1,n)dtds (3.3) = 2∞∑k=0(k+1)2k∑l=1∫(l−12)2−k(l−1)2−k∫(l−1)2−k(l−12)2−kn∑j=2(−1)j(nj)tj−2dtds = 2∞∑k=0(k+1)2k∑l=1n∑j=2(−1)j(nj)∫l2−k(l−12)2−ktj−2[(l−%$12$)2−k−(l−1)2−k]dt = ∞∑k=0(k+1)2k∑l=1n∑j=2(−1)j(nj)j−12−k{(l2−k)j−1−[(l−12)2−k]j−1} = n∑j=2(−1)j(nj)j−1∞∑k=0(k+1)2−kj2k∑l=1[lj−1−(l−12)j−1].

To further transform (3.3), define

 aj,r=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩Brr(j−1r−1)if r≥212if r=11jif r=0, (3.4)

where denotes the -th Bernoulli number. Let . Then (see Knuth [12]), and

 2k∑l=1[lj−1−(l−12)j−1]=S2k,j−2−(j−1)2k∑l=1(2l−1)j−1 (3.5) = S2k,j−2−(j−1)(S2k+1,j−2j−1S2k,j)=2S2k,j−2−(j−1)S2k+1,j = 2j−1∑r=0aj,r2k(j−r)−2−(j−1)j−1∑r=0aj,r2(k+1)(j−r)=2j−1∑r=1aj,r2k(j−r)(1−2−r).

From (3.3) and (3.5),

 μ(1,n) = 2n∑j=2(−1)j(nj)j−1∞∑k=0(k+1)2−kjj−1∑r=1aj,r2k(j−r)(1−2−r).

Here

 ∞∑k=0(k+1)2−kjj−1∑r=1aj,r2k(j−r)(1−2−r)=∞∑k=0(k+1)j−1∑r=1aj,r2−kr(1−2−r) = j−1∑r=1aj,r(1−2−r)∞∑k=0(k+1)2−kr=j−1∑r=1aj,r(1−2−r)−1.

Hence

 μ(1,n) = 2n∑j=2(−1)j(nj)j−1j−1∑r=1aj,r(1−2−r)−1=2n−1∑r=1(1−2−r)−1n∑j=r+1(−1)j(nj)j−1aj,r = 2n∑j=2(−1)j(nj)j−1+2n−1∑r=2(1−2−r)−1Brrn∑j=r+1(−1)j(nj)(j−1r−1)j−1 = 2n∑j=2(−1)j(nj)j−1+2n−1∑r=2(1−2−r)−1Brr⎡⎣n∑j=r(−1)j(nj)(j−1r−1)j−1−(−1)r(nr)r−1⎤⎦.

To simplify , note that

 n∑j=r(nj)(j−1r−1)zj−2=rn!(n−r)!r!n∑j=r(n−r)!j(n−j)!(j−r)!zj−2 = r(nr)z−2n∑j=r(n−rj−r)zjj=r(nr)z−2n−r∑j=0(n−rj)zj+rj+r = r(nr)z−2∫z0ζr−1n−r∑j=0(n−rj)ζjdζ=r(nr)z−2∫z0ζr−1(1+ζ)n−rdζ.

Thus

 n∑j=r(−1)j(nj)(j−1r−1)j−1=∫0−1[n∑j=r(nj)(j−1r−1)zj−2]dz (3.7) = −r(nr)∫0−1z−2∫0zζr−1(1+ζ)n−rdζdz=−r(nr)∫0−1ζr−1(1+ζ)n−r∫ζ−1z−2dzdζ = r(nr)∫0−1ζr−2(1+ζ)n−r+1dζ=(−1)rr(nr)∫10ur−2(1−u)n−r+1du = (−1)rr(nr)Γ(r−1)Γ(n−r+2)Γ(n+1)=(−1)r(n−r+1)r−1.

Plugging (3.7) into (LABEL:intermediateMu1_1) and recalling for , we finally obtain

 μ(1,n) = 2n∑j=2(−1)j(nj)j−1+2n−1∑r=2(1−2−r)−1Brr[(−1)r(n−r+1)r−1−(−1)r(nr)r−1] (3.8) = 2n∑j=2(−1)j(nj)j−1+2n−1∑j=2Bjn−j+1−(nj)j(j−1)(1−2−j) = 2n(Hn−1)+2tn,

where denotes the -th harmonic number and

 tn:=n−1∑j=2Bjj(1−2−j)⎡⎣n−(nj)j−1−1⎤⎦. (3.9)

The last equality in (3.8) follows from the easy identity

 n∑k=1(−1)k−1(nk)k=Hn.

### 3.2 Asymptotic Analysis of μ(1,n)

In order to obtain an asymptotic expression for , we analyze in (3.8)–(3.9). The following lemma provides an exact expression for that easily leads to an asymptotic expression for :

###### Lemma 3.1.

For , let and . Let denote Euler’s constant , and define . Then

1.  vn=1n+1+Hn+2ln2−(γln2−12)(n+1)(n+2)−Σn,

where

 Σn:=∑k∈Z∖{0}ζ(1−χk)Γ(n+1)Γ(1−χk)(ln2)Γ(n+3−χk);
2.  un=−Hn+a−Hn+1(ln2)(n+1)+(γ−1ln2−12)1n+1+~Σn,

where

 a := 149+17−6γ18ln2−2ln2∑k∈Z∖{0}ζ(1−χk)Γ(1−χk)Γ(4−χk)(1−χk), ~Σn := ∑k∈Z∖{0}ζ(1−χk)Γ(1−χk)(ln2)(1−χk)Γ(n+1)Γ(n+2−χk);
3.  tn = −(nHn−n−1)+a(n−2)−12ln2[H2n+H(2)n−72] +(γ−1ln2−12)(Hn−32)+b−~~Σn,

where

 b := ∑k∈Z∖{0}2ζ(1−χk)Γ(−χk)(ln2)(1−χk)Γ(3−χk), ~~Σn := ∑k∈Z∖{0}ζ(1−χk)Γ(−χk)Γ(n+1)(ln2)(1−χk)Γ(n+1−χk),

and denotes the -th Harmonic number of order 2, i.e., .

In this lemma, and are derived in order to obtain the exact expression for in (iii). From (3.8), the exact expression for also provides an alternative exact expression for .
Before proving Lemma 3.1, we complete the proof of Theorem 1.1 using part (iii). We know

 Hn = lnn+γ+12n−112n2+O(n−4), (3.10) H(2)n = π26−1n+12n2+O(n−3). (3.11)

Combining (3.10)–(3.11) with (3.8) and Lemma 3.1(iii), we obtain an asymptotic expression for :

 μ(1,n) = 2an−1ln2(lnn)2−(2ln2+1)lnn+O(1). (3.12)

The term in (3.12) has fluctuations of small magnitude due to , which is periodic in with amplitude smaller than 0.00110. The asymptotic slope in (3.12) is

 c=2a=289+17−6γ9ln2−4ln2∑k∈Z∖{0}ζ(1−χk)Γ(1−χk)Γ(4−χk)(1−χk)≐5.27938. (3.13)

Now we prove Lemma 3.1:

###### Proof.

(i) Since

 un = tn+1−tn=n∑j=2Bjj(1−2−j)⎡⎣(n+1)−(n+1j)j−1−1⎤⎦−n−1∑j=2Bjj(1−2−j)⎡⎣n−(nj)j−1−1⎤⎦ = −n∑j=2Bjj(j−1)(1−2−j)[(nj−1)−1],

it follows that

 vn = un+1−un=−n+1∑j=2Bjj(j−1)(1−2−j)[(n+1j−1)−1]+n+1∑j=2Bjj(j−1)(1−2−j)[(nj−1)−1] (3.14) = −n−1∑k=0(nk)Bk+2(k+2)(k+1)[1−2−(k+2)] = n−1∑k=0(−1)k(nk)ζ(−1−k)(k+1)[1−2−(k+2)] = (−1)n2πi∫Cζ(−1−s)(s+1)[1−2−(s+2)]n!s(s−1)⋯(s−n)ds, (3.15)

where is a positively oriented closed curve that encircles the integers 0,, and does not include or encircle any of the following points: (where ), ; ; and . Equality (3.14) follows from the fact that the Bernoulli numbers are extrapolated by the Riemann zeta function taken at nonnegative integers: . [The coefficients do not concern us since the Bernoulli numbers of odd index greater than 1 vanish.] Equality (3.15) follows from a direct application of residue calculus, taking into account contributions of the simple poles at the integers 0,, .
Let denote the integrand in (3.15):

 ϕ(s)=ζ(−1−s)(s+1)[1−2−(s+2)]n!s(s−1)⋯(s−n).

We consider a positively oriented rectangular contour with horizontal sides and , where , , and vertical sides and , where . By elementary bounds on along and the fact that

 ∫n−θ+i∞n−θ−i∞ϕ(s)ds=0 (3.16)

(this is implicit on page 113 of Flajolet and Sedgewick [5] and explicitly proved in the Appendix), one can show that

 liml→∞∫Clϕ(s)ds=0.

Accounting for residues due to the poles encircled by , we obtain

 vn = (−1)n+1⎧⎨⎩Ress=−1[ϕ(s)]+Ress=−2[ϕ(s)]+∑k∈Z∖{0}Ress=−2+χk[ϕ(s)]⎫⎬⎭ (3.17) = −1n+1+Hn+2ln2−(γln2−12)(n+1)(n+2)−Σn,

where

 Σn:=∑k∈Z∖{0}ζ(1−χk)Γ(n+1)Γ(1−χk)(ln2)Γ(n+3−χk). (3.18)

(ii) We have . Hence, from (i),

 un = u2+n−1∑j=2vj=−19+n−1∑j=2