Caching in Multidimensional Databases
One utilisation of multidimensional databases is the field of On-line Analytical Processing (OLAP). The applications in this area are designed to make the analysis of shared multidimensional information fast .
On one hand, speed can be achieved by specially devised data structures and algorithms. On the other hand, the analytical process is cyclic. In other words, the user of the OLAP application runs his or her queries one after the other. The output of the last query may be there (at least partly) in one of the previous results. Therefore caching also plays an important role in the operation of these systems.
However, caching itself may not be enough to ensure acceptable performance. Size does matter: The more memory is available, the more we gain by loading and keeping information in there.
Oftentimes, the cache size is fixed. This limits the performance of the multidimensional database, as well, unless we compress the data in order to move a greater proportion of them into the memory. Caching combined with proper compression methods promise further performance improvements.
In this paper, we investigate how caching influences the speed of
OLAP systems. Different physical representations (multidimensional
and table) are evaluated. For the thorough comparison, models are
proposed. We draw conclusions based on these models, and the
conclusions are verified with empirical data. In particular, using
benchmark databases, we show examples when one physical
representation is more beneficial than the alternative
one and vice versa.
Keywords: compression, caching, multidimensional database, On-line Analytical Processing, OLAP.
Why is it important to investigate the caching effects in multidimensional databases?
A number of papers compare the different physical representations of databases in order to find the one resulting in higher performance than others. For examples, see [4, 11, 12, 13, 14, 21]. However, many of these papers either ignore the influence of caching or discusses this issue very briefly.
As it will be shown later, the size of the buffer cache affects the results significantly. Hence the thorough analysis of the buffering is necessary in order to better understand what is the real reason of the performance improvements.
The results of this paper can be summarized as follows:
Two models are proposed to analyse the caching effects of the alternative physical representations of relations.
With the help of the models, it is shown that the performance difference between the two representations can be several orders of magnitude depending on the size of the buffer cache.
It is also demonstrated that the generally better multidimensional physical representation may become worse, if the memory available for caching is large enough.
The models are verified by a number of experiments.
1.3 Related Work
The paper of Westmann et al.  lists several related works in this field. It also discusses how compression can be integrated into a relational database system. It does not concern itself with the multidimensional physical representation, which is the main focus of our paper. They demonstrate that compression indeed offers high performance gains. It can, however, also increase the running time of certain update operations. In this paper we will analyse the retrieval (or point query) operation only, as a lot of On-line Analytical Processing (OLAP) applications handle the data in a read only or read mostly way. The database is updated outside working hours in batch. Despite this difference, we also encountered performance degradation due to compression when the entire physical representation was cached into the memory. In this case, at one of the benchmark databases (TPC-D), the multidimensional representation became slower than the table representation because of the CPU-intensive Huffman decoding.
In this paper, we use difference – Huffman coding to compress the multidimensional physical representation of the relations. This method is based on difference sequence compression, which was published in .
Chen et al.  propose a Hierarchical Dictionary Encoding and discusses query optimization issues. Both of these topics are beyond the scope of our paper.
In the article of O’Connell et al. , compressing of the data itself is analysed in a database built on a triple store. We remove the empty cells from the multidimensional array, but do not compress the data themselves.
When we analyse algorithms that operate on data on the secondary storage, we usually investigate how many disk input/output (I/O) operations are performed. This is because we follow the dominance of the I/O cost rule . We followed a similar approach in Section 3 below.
The main focus of  is the CPU cache. In our paper, we deal with the buffer cache as opposed to the CPU cache.
Vitter et al.  describe an algorithm for prefetching based on compression techniques. Our paper supposes that the system does not read ahead.
Poess et al.  show how compression works in Oracle. They do not test the performance for different buffer cache sizes, which is an important issue in this paper.
In , Xi et al. predict the buffer hit rate using a Markov chain model for a given buffer pool size. In our article, instead of the buffer hit rate, we estimate the expected number of pages brought into the memory from the disk, because it is proportional to the retrieval time. Another difference is that we usually start with a cold (that is empty) cache and investigate its increase together with the decrease in retrieval time. In , the authors fix the size of the buffer pool and then predict the buffer hit rate with the Markov chain model.
The rest of the paper is organised as follows. Section 2 describes the different physical representations of relations including two compression techniques used for the multidimensional representation. Section 3 introduces a model based on the dominance of the I/O cost rule for the analysis of the caching effects. An alternative model is presented in Section 4. The theoretical results are then tested in experiments outlined in Section 5. Section 6 rounds off the discussion with some conclusions and suggestions for future study. Lastly, for the sake of completeness, a list of references ends the paper.
2 Physical Representations of Relations
Throughout this paper we use the expressions ‘multidimensional representation’ and ‘table representation,’ which are defined as follows.
Suppose we wish to represent relation physically. The multidimensional (physical) representation of is as follows:
A compressed array, which only stores the nonempty cells, one nonempty cell corresponding to one element of ;
The header, which is needed for the logical-to-physical position transformation;
One array per dimension in order to store the dimension values.
The table (physical) representation consists of the following:
A table, which stores every element of relation ;
A B-tree index to speed up the access to given rows of the table when the entire primary key is given.
In the experiments, to compress the multidimensional representation, difference – Huffman coding (DHC) was used, which is closely related to difference sequence compression (DSC). These two methods are explained in the remainder of this section.
Difference sequence compression. By transforming the multidimensional array into a one-dimensional array, we obtain a sequence of empty and nonempty cells:
In the above regular expression, is an empty cell and is a nonempty one. The difference sequence compression stores only the nonempty cells and their logical positions. (The logical position is the position of the cell in the multidimensional array before compression. The physical position is the position of the cell in the compressed array.) We denote the sequence of logical positions by . This sequence is strictly increasing:
In addition, the difference sequence contains smaller values than the original sequence. (See also Definition 2 below.)
The search algorithm describes how we can find an element (cell) in the compressed array. During the design of the data structures of DSC and the search algorithm, the following principles were used:
We compress the header in such a way that enables quick decompression.
It is not necessary to decompress the entire header.
Searching can be done during decompression, and the decompression stops immediately when the header element is found or when it is demonstrated that the header element cannot be found (that is, when the corresponding cell is empty).
Let us introduce the following notations.
is the number of elements in the sequence of logical positions ();
is the sequence of logical positions ();
The sequence () is defined as follows:
where , and is the size of a
sequence element in bits.
The sequence will be defined recursively in the following way:
Here the sequence is called the overflow difference sequence. There is an obvious distinction between and , but the latter will also be called the difference sequence, if it is not too disturbing. it is called the jump sequence. The compression method which makes use of the and sequences will be called difference sequence compression (DSC). The and sequences together will be called the DSC header.
Notice here that and are basically the same sequence. The only difference is that some elements of the original difference sequence are replaced with zeros, if and only if they cannot be stored in bits. (The symbol denotes a natural number. The theoretically optimal value of can be determined, if the distribution of is known. In practice, for performance reasons, is either 8 or 16 or 32.)
The difference sequence will also be called the relative logical position sequence, and we shall call the jump sequence the absolute logical position sequence.
From the definitions of and , one can see clearly that, for every zero element of the sequence, there is exactly one corresponding element in the sequence. For example, let us assume that , and . Then the above mentioned correspondence is shown in the following table:
From the above definition, the recursive formula below follows for .
In other words, every element of the sequence can be
calculated by adding zero or more consecutive elements of the
sequence to the proper
jump sequence element. For instance, in the above example
and so on.
A detailed analysis of DSC and the search algorithm can be found in .
Difference – Huffman coding. The key idea in difference – Huffman coding is that we can compress the difference sequence further if we replace it with its corresponding Huffman code.
The compression method, which uses the jump sequence () and the Huffman code of the difference sequence (), will be labelled difference – Huffman coding (DHC). The sequence and the Huffman code of the sequence together will be called the DHC header.
The difference sequence usually contains a lot of zeros. Moreover, it contains many ones too if there are numerous consecutive elements in the sequence of logical positions. By definition, the elements of the difference sequence are smaller than those of the logical position sequence. The elements of will recur with greater or less frequency. Hence it seems reasonable to code the frequent elements with fewer bits, and the less frequent ones with more. To do this, the optimal prefix code can be determined by the well-known Huffman algorithm .
3 A Model Based on the Dominance of the I/O Cost Rule
During our analysis of caching effects, we followed two different approaches:
For the first model, we applied the dominance of the I/O cost rule to calculate the expected number of I/O operations.
In the second one, instead of counting the number of disk inputs/outputs, we introduced two different constants: and . The constant denotes the time needed to retrieve one cell from the disk, if the multidimensional representation is used. The constant shows the time required to read one row from the disk, if the table representation is used. The constants were determined experimentally. The tests showed that , that is more disk I/O operations are needed to retrieve one row from the table representation than one cell from the multidimensional representation which is obvious when there is no caching. However, for the second model, it was not necessary to compute the exact number of I/O operations for the alternative physical representations due to the experimental approach.
The first model is described in this section, whereas the second model in the next one.
Throughout the paper, we suppose that the different database pages are accessed with the same probability. In other words, uniform distribution will be assumed.
It is not hard to see that this assumption corresponds to the worst case. If the distribution is not uniform, then certain partitions of the pages will be read/written with higher probability than the average. Therefore it is more likely to find pages from these partitions in the buffer cache than from other parts of the database. Hence the non-uniform distribution increases the buffer hit rate and thus the performance.
We are going to estimate the number of database pages (blocks) in the buffer cache. First it will be done for the multidimensional representation, then for the table representation.
Multidimensional physical representation. In this paper, we shall assume that prefetching is not performed by the system. Hence, for the multidimensional representation, one or zero database page has to be copied from the disk into the memory, when a cell is accessed. This value is one if the needed page is not in the buffer cache, zero otherwise.
The multidimensional representation requires that the header and the dimension values are preloaded into the memory. The total size of these will be denoted by . The compressed multidimensional array can be found on the disk. The pages of the latter are gradually copied into the memory as a result of caching. Thus the total memory occupancy of this representation can be computed by adding to the size of the buffer cache.
In this section, for the multidimensional
representation, we shall use the following notation.
is the number of pages required to store the compressed array ();
is the expected value of the number of pages in the buffer cache after the database access ().
Suppose that is less than the size of the
The theorem will be proven by induction. For convenience, let us define as follows:
For , the theorem holds:
Now assume that the theorem has already been proven for :
Then for we obtain that
Because of the uniform distribution, is the probability that the required database block can be found in the memory. Zero new page will be copied from the disk into the buffer cache in this situation. However, in the opposite case, one new page will be brought into the memory. This will occur with probability . In other words, the expected value of the increase is
From the induction hypothesis follows that
It is easy to see that
The last formula can be written as
which proves the theorem. ∎
The time to retrieve one cell from the multidimensional representation is proportional to the number of pages brought into the memory. The latter is a linear function of the size of the buffer cache. This is rephrased in the following theorem.
Assume that the number of database pages in the buffer cache is . The memory available for caching is greater than . Let us suppose that a cell is accessed in the multidimensional representation. Then the expected number of pages copied from the disk into the memory is
Similarly to Equation (1), the expected number of pages necessary for this operation is
The above theorem holds even if is equal to the number of pages available for caching. However, in this case, the database management system (or the operating system) has to remove a page from the buffer cache, if a page fault happens. If the removed page is ‘dirty,’ then it has to be written back to the disk in order not to lose the modifications. That is why another disk I/O operation is needed. In this paper, we are going to ignore these situations, because most OLAP applications handle the data in a read only or read mostly way.
Figure 1 illustrates the behaviour of the multidimensional representation. The horizontal axis shows the number of pages in the buffer cache. The vertical one demonstrates the expected number of pages retrieved from the disk. The function is defined as follows:
Table physical representation. Now, let us turn to the other storage method, the table representation. Both the table and B-tree index are kept on the disk. The table itself could be handled similarly to the compressed array, but the B-tree index is structured differently. It consists of several levels. In our model, we are going to consider these levels separately. To simplify the notation, the table will also be considered as a separate level. The following definition introduces the necessary notations.
is the number of levels in the table representation.
On level 1, the root page of the B-tree can be found. Level is the last level of the B-tree, which contains the leaf nodes.
Level corresponds to the table.
is the number of pages on level (). Specifically, , as there is only one root page.
The total number of pages is
is the number of pages in the buffer cache
from level after the database access ( and ).
The total number of pages in the buffer cache is
Suppose that is less than the size of the memory available for caching for every index. In addition, let us assume that the buffer cache is cold initially: . Then, for the table representation,
Similarly to the other representation, the necessary time to retrieve one row from the table representation is proportional to the number of pages brought into the memory. The next theorem investigates how the number of pages brought into the memory depends on the size of the buffer cache.
Assume that the number of database pages in the buffer cache is . The memory available for caching is greater than . Let us suppose that a row is accessed in the table representation. Then the expected number of pages read from the disk into the memory is
This will be shown by applying the result of Theorem 2 per level. For level , the number of pages copied into the memory is:
Hence, for all levels in total, it is:
are constants. Therefore Equation (5) is a linear function of . The same expression can be looked at as a function of , as well:
Just like before, we are going to assume that the buffer cache is cold initially: . If this is the case, then for every , because of Definition 5. Therefore,
In other words, one page per level has to be read into the memory at the first database access. If the memory available for caching is not smaller than , then for every and
Obviously, we obtain the same, if we use the alternative (recursive) formula:
Now, let us investigate the special case, when Because of the latter, there is only one page per level (), which means that also equals . To put it into another way, the entire database is cached into the memory after the first database access, given that the available memory is greater than or equal to the size of the database. After this, there is no need to copy more pages into the memory:
To summarise this paragraph, below we show the values of and for every :
In the remainder of this section, we shall assume that .
For sufficiently large values, can be considered a linear function of . This is the main idea behind the theorem below.
Suppose that is less than the size of the memory available for caching for every index. In addition, let us assume that , and . Then, for the table representation,
First, we show that
where is a weighted average of constants . Then we demonstrate that tends to , if tends to infinity. From Equation (4), we know that
Using Definition 6, we obtain that
Theorem 3 implies the following equation:
Let us define as follows:
given that the denominator is not zero (). Observe that is a weighted average of constants . The weight of is for every . With the previous definition, we get that
If does not vanish (), then
Finally, we have to prove that , if . For every , the inequality holds. It is not difficult to see that
Figure 2 demonstrates the behaviour of the table representation. The horizontal axis is the number of pages in the buffer cache. The vertical one shows the expected number of pages retrieved from the disk. The Estimation denoted by ‘Est.’ in the chart is the limit of the function:
We conclude this section by summarising the findings:
If we assume requests with uniform distribution, then the expected number of database pages brought into the memory at a database access is a linear function of the number of pages in the buffer cache.
Specifically, for the multidimensional representation, it equals
where is the number of pages in the buffer cache and is the size of the compressed multidimensional array in pages.
For the table representation, it is
where is the number of levels, is the number of pages in the buffer cache from level , and is the total number of pages on level .
The expression above is a linear function of , but for large values, it can be considered as a linear function of , as well, because
where and .
4 An Alternative Model
In this section we shall examine how the caching affects the speed of retrieval in the different physical database representations. For the analysis, a model will be proposed. Then we will give sufficient and necessary conditions for such cases where the expected retrieval time is smaller in one representation than in the other.
The caching can speed up the operation of a database management system significantly if the same block is requested while it is still in the memory. In order to show how the caching modifies the results of this paper, let us introduce the following notations.
|the retrieval time, if the information is in the memory,|
|the retrieval time, if the disk also has to be accessed,|
|the probability of having everything needed in the memory,|
|how long it takes to retrieve the requested information.|
In our model we shall consider and constants. Obviously, is a random variable. Its expected value can be calculated as follows:
Notice that does not tell us how many blocks have to be read from the disk. This also means that the value of will be different for the table and the multidimensional representations. The reason for this is that, in general, at most one block has to be read with the multidimensional representation. Exactly one reading is necessary if nothing is cached, because only the compressed multidimensional array is kept on the disk. Everything else (the header, the dimension values, and so forth) is loaded into the memory in advance. With the table representation, more block readings may be needed because we also have to traverse through the B-tree first, and then we have to retrieve the necessary row from the table.
is also different for the two alternative physical representations. This is because two different algorithms are used to retrieve the same information from two different physical representations.
Hence, for the above argument, we are going to introduce four constants.
|the value of for the multidimensional representation,|
|the value of for the table representation,|
|the value of for the multidimensional representation,|
|the value of for the table representation.|
If we sample the cells/rows with uniform probability
By the ‘total size’ we mean that part of the physical representation which can be found on the disk at the beginning. In the multidimensional representation, it is the compressed multidimensional array, whereas in the table representation, we can put the entire size of the physical representation into the denominator of . The cached pages are those that had been originally on the disk, but were moved into the memory later. In other words, the size of the cached blocks (numerator) is always smaller than or equal to the total size (denominator).
The experiments show that the alternative physical representations differ from each other in size. That is why it seems reasonable to introduce four different probabilities in the following manner.
|the value of for the multidimensional representation,|
|the value of for the table representation,|
When does the inequality below hold? This is an important question:
Here and are random variables that are the retrieval times in the multidimensional and table representations, respectively.
In our model, (). Thus the question can be rephrased as follows:
The value of the , , and constants was measured by carrying out some experiments. (See the following section.) Two different results were obtained. For one benchmark database (TPC-D), the following was found:
Another database (APB-1) gave a slightly different result:
The and inequalities hold because disk operations are slower than memory operations by orders of magnitude. The third one () is because we have to retrieve more blocks from the table representation than from the multidimensional to obtain the same information.
Note here that is the convex linear combination of and ( and ). In other words, can take any value from the closed interval .
The following provides sufficient condition for :
From this, we can obtain the inequality constraint:
The value for was found to be 63.2%, 66.5% and 66.3% (for TPC-D, TPC-H and APB-1, respectively) in the experiments. This means that, based on the experimental results, the expected value of the retrieval time was smaller in the multidimensional representation than in the table representation when less than 63.2% of the latter one was cached. This was true regardless of the fact whether the multidimensional representation was cached or not.
Now we are going to distinguish two cases based on the value of and .
Case 1: . This was true for the TPC-D benchmark database. (Here the difference sequence consisted of 16-bit unsigned integers, which resulted in a slightly more complicated decoding, as the applied Huffman decoder returns 8 bits at a time. This may be the reason why became larger than .) In this case, we can give a sufficient condition for , as the equivalent transformations below show:
For we obtained a value of 99.9%. This means that the expected retrieval time was smaller in the table representation when more than 99.9% of it was cached. This was true even when the whole multidimensional representation was in the memory.
Case 2: . This inequality holds for the TPC-H and the APB-1 benchmark databases. Here we can give another sufficient condition for :
The left hand side of the last inequality was equal to 99.9% and 98.3% for the TPC-H and APB-1 benchmark databases, respectively. In other words, when more than 99.9% of the multidimensional representation was cached, it then resulted in a faster operation on average than the table representation regardless of the caching level of the latter.
Finally, let us give a necessary and sufficient condition for . First, let us consider the following equivalent transformations (making the natural assumption that ):
The last inequality was the following for the three tested databases, TPC-D, TPC-H and APB-1, respectively:
Suppose that . Then the expected retrieval time is smaller in the case of the multidimensional physical representation than in the table physical representation if and only if
Now, let us change our model slightly. In this modified version, we shall assume that the different probabilities are (piecewise) linear functions of the memory size available. This assumption is in accordance with Theorems 2 and 5. With the multidimensional representation, the formula below follows from the model for the expected retrieval time:
is the total size of the multidimensional representation part, which is loaded into the memory in advance (the header and the dimension values), is the size of the compressed multidimensional array and () is the size of the available memory.
In an analogous way, for the table representation, we obtain the formula:
is the total size of the table representation and () is the size of the memory available for caching.
It is not hard to see that the global maximum and minimum values and locations of the functions and are the following:
For values, let us define the speed-up factor in the following way:
The global maximum of the speed-up factor can be achieved, when the entire multidimensional representation is cached into the memory. This is specified in the following theorem.
Then the global maximum of the function can be found at .
The function is continuous, because and are continuous and . Hence, to prove the theorem, it is enough to show that this function is strictly monotone increasing on interval , strictly monotone decreasing on and constant on . On the first interval,
For convenience, let us introduce the following notation:
The first derivative of the function is
The first derivative is positive if and only if . Equation (12) can be written as
From the last two inequalities, we get that , which is equivalent with . Thus and is strictly monotone increasing on interval .
Now, suppose that . In this case
The fist derivative is
because and . So is strictly monotone decreasing.
Finally, let us take the case, when . The speed-up factor
which is constant. ∎
The location of the global maximum is . The global maximum value is obviously
As it will be described in details in the next section, experiments were made to determine the value of the constants. For these data, see Table 6 there. The sizes were also measured and can be seen in Table 1 (in bytes) together with the global maximum locations and values per benchmark database. As it can be seen from the latter table, the speed-up can be very large, 2 – 3 orders of magnitude. The maximum value for the TPC-D benchmark database was more than 400, while for the APB-1 benchmark database, it was more than 1,500.
We can draw the conclusions of this section as follows:
If (nearly) the entire physical representation is cached into the memory, then the complexity of the algorithm will determine the speed of retrieval. A less CPU-intensive algorithm will result in a faster operation.
In the tested cases, the expected retrieval time was smaller with multidimensional physical representation when less than 63.2% of the table representation was cached. This was true regardless of the caching level of the multidimensional representation.
Depending on the size of the memory available for caching, the speed-up factor can be very large, up to 2 – 3 orders of magnitude! In other words, the caching effects of the alternative physical representations modify the results significantly. Hence these effects should always be taken into account, when the retrieval time of the different physical representations are compared with each other.
We carried out experiments in order to measure the sizes of the different physical representations and the constants in the previous section. We also examined how the size of the cache influenced the speed of retrieval.
Table 2 shows the hardware and software used for testing. The speed of the processor, the memory and the hard disk all influence the experimental results quite significantly, just like the memory size. In the computer industry, all of these parameters have increased quickly over the time. But the increase of the hard disk speed has been somewhat slower. Hence, it is expected that the results presented will remain valid despite the continuing improvement in computer technology.
|Processor||Intel Pentium 4 with HT technology, 2.6 GHz,|
|800 MHz FSB, 512 KB cache|
|Memory||512 MB, DDR 400 MHz|
|Hard disk||Seagate Barracuda, 80 GB, 7200 RPM, 2 MB cache|
|Filesystem||ReiserFS format 3.6 with standard journal|
|Page size of B-tree||4 KB|
|Operating system||SuSE Linux 9.0 (i586)|
|Compiler||gcc (GCC) 3.3.1 (SuSE Linux)|
|Free||procps version 3.1.11|
In the experiments we made use of three benchmark databases: TPC-D , TPC-H  and APB-1 . One relation () was derived per benchmark database in exactly the same way as was described in . Then these relations were represented physically with a multidimensional representation and table representation.
Tables 3, 4 and 5 show that DHC results in a smaller multidimensional representation than difference sequence compression. (For TPC-H, the so-called Scale Factor was equal to 5. That is why the table representation of TPC-H is about five times greater than that of TPC-D.)
|Compression||Size in bytes||Percentage|
|Difference sequence compression||67,925,100||24.3%|
|Difference – Huffman coding||67,014,312||24.0%|
|Compression||Size in bytes||Percentage|
|Difference sequence compression||407,414,614||28.7%|
|Difference – Huffman coding||394,020,884||27.8%|
|Compression||Size in bytes||Percentage|
|Difference sequence compression||113,867,897||8.8%|
|Difference – Huffman coding||103,369,039||8.0%|
In the rest of this section, we shall deal only with DHC. Its performance will be compared to the performance of the uncompressed table representation.
In order to determine the constant values of the previous section, an experiment was performed. A random sample was taken with replacement from relation with uniform distribution. The sample size was 1000. Afterwards the sample elements were retrieved from the multidimensional representation and then from the table representation. The elapsed time was measured to calculate the average retrieval time per sample element. Then the same sample elements were retrieved again from the two physical representations. Before the first round, nothing was cached. So the results help us to determine the constants and . Before the second round, every element of the sample was cached in both physical representations. So the times measured in the second round correspond to the values of the constants and . The results of the experiment can be seen in Table 6.