Analysis of the Expected Number of Bit Comparisons
Required by Quickselect
James Allen Fill111Research for both authors supported by NSF grant DMS–0406104, and by The Johns Hopkins University’s Acheson J. Duncan Fund for the Advancement of Research in Statistics.
Department of Applied Mathematics and Statistics
The Johns Hopkins University
jimfill@jhu.edu and http://www.ams.jhu.edu/~fill/
and
Take Nakama
Department of Applied Mathematics and Statistics
The Johns Hopkins University
nakama@jhu.edu and http://www.ams.jhu.edu/~nakama/
ABSTRACT
When algorithms for sorting and searching are applied to keys that are represented as bit strings, we can quantify the performance of the algorithms not only in terms of the number of key comparisons required by the algorithms but also in terms of the number of bit comparisons. Some of the standard sorting and searching algorithms have been analyzed with respect to key comparisons but not with respect to bit comparisons. In this paper, we investigate the expected number of bit comparisons required by Quickselect (also known as Find). We develop exact and asymptotic formulae for the expected number of bit comparisons required to find the smallest or largest key by Quickselect and show that the expectation is asymptotically linear with respect to the number of keys. Similar results are obtained for the average case. For finding keys of arbitrary rank, we derive an exact formula for the expected number of bit comparisons that (using rational arithmetic) requires only finite summation (rather than such operations as numerical integration) and use it to compute the expectation for each target rank.
AMS 2000 subject classifications. Primary 68W40; secondary 68P10, 60C05.
Key words and phrases. Quickselect, Find, searching algorithms, asymptotics, average-case analysis, key comparisons, bit comparisons.
Date. June 15, 2007.
1 Introduction and Summary
When an algorithm for sorting or searching is analyzed, the
algorithm is usually regarded either as comparing keys pairwise
irrespective of the keys’ internal structure or as operating on
representations (such as bit strings) of keys. In the former case,
analyses often quantify the performance of the algorithm in terms of
the number of key comparisons required to accomplish the task;
Quickselect (also known as Find) is an example of those algorithms
that have been studied from this point of view. In the latter case,
if keys are represented as bit strings, then analyses quantify the
performance of the algorithm in terms of the number of bits compared
until it completes its task. Digital search trees, for example,
have been examined from this
perspective.
In order to fully quantify the performance of a sorting
or searching algorithm and enable comparison between key-based and
digital algorithms, it is ideal to analyze the algorithm from both
points of view. However, to date, only Quicksort has been analyzed
with both approaches; see Fill and Janson [3]. Before
their study, Quicksort had been extensively examined with regard to
the number of key comparisons performed by the algorithm (e.g.,
Knuth [12], Régnier [17], Rösler
[18], Knessl and Szpankowski [9], Fill and Janson
[2], Neininger and Rüschendorf [16]), but it
had not been examined with regard to the number of bit comparisons
in sorting keys represented as bit strings. In their study, Fill and
Janson assumed that keys are independently and uniformly distributed
over (0,1) and that the keys are represented as bit strings. [They
also conducted the analysis for a general absolutely continuous
distribution over (0,1).] They showed that the expected number of
bit comparisons required to sort keys is asymptotically
equivalent to as compared to the lead-order term
of the expected number of key comparisons, which is
asymptotically . We use ln and lg to denote natural and
binary logarithms, respectively, and use log when the base does not
matter (for example, in remainder estimates).
In this paper, we investigate the expected number of bit
comparisons required by Quickselect. Hoare [7] introduced
this search algorithm, which is treated in most textbooks on
algorithms and data structures. Quickselect selects the -th
smallest key (we call it the rank- key) from a set of
distinct keys. (The keys are typically assumed to be distinct, but
the algorithm still works—with a minor adjustment—even if they
are not distinct.) The algorithm finds the target key in a recursive
and random fashion. First, it selects a pivot uniformly at random
from keys. Let denote the rank of the pivot. If ,
then the algorithm returns the pivot. If , then the
algorithm recursively operates on the set of keys smaller than the
pivot and returns the rank- key. Similarly, if , then the
algorithm recursively operates on the set of keys larger than the
pivot and returns the ()-th smallest key from the subset.
Although previous studies (e.g., Knuth [10], Mahmoud
et al. [14], Grübel and U. Rösler
[6], Lend and Mahmoud [13], Mahmoud and Smythe
[15], Devroye [1], Hwang and Tsai [8])
examined Quickselect with regard to key comparisons, this study is
the first to analyze
the bit complexity of the algorithm.
We suppose that the algorithm is applied to distinct
keys that are represented as bit strings and that the algorithm
operates on individual bits in order to find a target key. We also
assume that the keys are uniformly and independently distributed
in . For instance, consider applying Quickselect to find the
smallest key among three keys , , and whose binary
representations are .01001100…, .00110101…, and .00101010…,
respectively. If the algorithm selects as a pivot, then it
compares each of and to in order to determine the
rank of . When and are compared, the algorithm
requires 2 bit comparisons to determine that is smaller than
because the two keys have the same first digit and differ at
the second digit. Similarly, when and are compared, the
algorithm requires 4 bit comparisons to determine that is
smaller than . After these comparisons, key has been
identified as smallest. Hence the search for the smallest key
requires a total of 6 bit comparisons (resulting from the two key
comparisons).
We let denote the expected number of bit
comparisons required to find the rank- key in a file of keys
by Quickselect. By symmetry, . First, we
develop exact and asymptotic formulae for , the
expected number of bit comparisons required to find the smallest key
by Quickselect, as summarized in the following theorem.
Theorem 1.1.
The expected number of bit comparisons required by Quickselect to find the smallest key in a file of keys that are independently and uniformly distributed in has the following exact and asymptotic expressions:
where and denote harmonic and Bernoulli numbers, respectively, and, with and , we define
The asymptotic formula shows that the expected number of bit
comparisons is asymptotically linear in with the lead-order
coefficient approximately equal to 5.27938. Hence the expected
number of bit comparisons is asymptotically different from that of
key comparisons required to find the smallest key only by a
constant factor (the expectation for key comparisons is
asymptotically 2). Complex-analytical methods are utilized to
obtain the asymptotic formula. Details of the derivations of the
formulae are
described in Section 3.
We also derive exact and asymptotic expressions for the
expected number of bit comparisons for the average case. We denote
this expectation by . In the average case, the
parameter in is considered a discrete uniform random
variable; hence The derived asymptotic formula shows that
is also asymptotically linear in ; see
(4.48). More detailed results
for are described in Section 4.
Lastly, in Section 5, we derive an exact
expression of for each fixed that is suited for
computations. Our preliminary exact formula for [shown
in (2.8)] entails infinite summation and integration. As a
result, it is not a desirable form for numerically computing the
expected number of bit comparisons. Hence we establish another exact
formula that only requires finite summation and use it to compute
for , . The computation
leads to the following conjectures: (i) for fixed ,
increases in for and is symmetric about
; and (ii) for fixed , increases in
(asymptotically linearly).
2 Preliminaries
To investigate the bit complexity of Quickselect, we follow the general approach developed by Fill and Janson [3]. Let denote the keys uniformly and independently distributed on (0, 1), and let denote the rank- key. Then, for (assume ),
(2.1) |
To determine the first probability in (2.1), note that
remain in the same subset until the first
time that one of them is chosen as a pivot. Therefore, and
are compared if and only if the first pivot chosen from
is either or . Analogous
arguments establish the other two cases.
For , it is well known that the joint
density function of and is given by
(2.2) |
Clearly, the event that and are compared is independent of the random variables and . Hence, defining
(2.3) | |||||
(2.4) | |||||
(2.5) | |||||
(2.6) |
[the sums in (2.3)–(2.5) are double sums over and ], and letting denote the index of the first bit at which the keys and differ, we can write the expectation of the number of bit comparisons required to find the rank- key in a file of keys as
(2.7) | |||||
(2.8) |
in this expression, note that represents the last bit at which and agree.
3 Analysis of
In Section 3.1, we derive the exact expression for shown in Theorem 1.1. In Section 3.2, we prove the asymptotic result stated in Theorem 1.1.
3.1 Exact Computation of
Since the contribution of or to is zero for , we have [see (2.4) through (2.6)]. Let . Then
(3.1) | |||||
Making the change of variables and integrating, and recalling , we find, after some calculation,
(3.2) |
(3.3) | |||||
To further transform (3.3), define
(3.4) |
where denotes the -th Bernoulli number. Let . Then (see Knuth [12]), and
(3.5) | |||||
Here
Hence
To simplify , note that
Thus
(3.7) | |||||
Plugging (3.7) into (LABEL:intermediateMu1_1) and recalling for , we finally obtain
(3.8) | |||||
where denotes the -th harmonic number and
(3.9) |
The last equality in (3.8) follows from the easy identity
3.2 Asymptotic Analysis of
In order to obtain an asymptotic expression for , we analyze in (3.8)–(3.9). The following lemma provides an exact expression for that easily leads to an asymptotic expression for :
Lemma 3.1.
For , let and . Let denote Euler’s constant , and define . Then
-
where
-
where
-
where
and denotes the -th Harmonic number of order 2, i.e., .
In this lemma, and are derived in order to
obtain the exact expression for in (iii). From
(3.8), the exact expression for also provides an
alternative exact expression for .
Before proving Lemma 3.1, we complete the
proof of Theorem 1.1 using part (iii). We know
(3.10) | |||||
(3.11) |
Combining (3.10)–(3.11) with (3.8) and Lemma 3.1(iii), we obtain an asymptotic expression for :
(3.12) |
The term in (3.12) has fluctuations of small magnitude due to , which is periodic in with amplitude smaller than 0.00110. The asymptotic slope in (3.12) is
(3.13) |
Now we prove Lemma 3.1:
Proof.
(i) Since
it follows that
(3.14) | |||||
(3.15) |
where is a positively oriented closed curve that
encircles the integers 0,, and does not include or
encircle any of the following points: (where ), ; ; and .
Equality (3.14) follows from the fact that the
Bernoulli numbers are extrapolated by the Riemann zeta function
taken at nonnegative integers: . [The
coefficients do not concern us since the Bernoulli numbers
of odd index greater than 1 vanish.] Equality
(3.15) follows from a direct application
of residue calculus, taking into account contributions of the simple
poles at the integers 0,, .
Let denote the integrand in
(3.15):
We consider a positively oriented rectangular contour with horizontal sides and , where , , and vertical sides and , where . By elementary bounds on along and the fact that
(3.16) |
(this is implicit on page 113 of Flajolet and Sedgewick [5] and explicitly proved in the Appendix), one can show that
Accounting for residues due to the poles encircled by , we obtain
(3.17) | |||||
where
(3.18) |
∎
(ii) We have . Hence, from (i),
(3.19) | |||||