# Failure Mitigation in Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

###### Abstract

A new roll-forward technique is proposed that recovers from
any single fail-stop failure in integer data streams
() when undergoing linear, sesquilinear or bijective (LSB)
operations, such as: scaling, additions/subtractions, inner or outer
vector products and permutations. In the proposed approach, the
input integer data streams are linearly superimposed to form
*numerically entangled* integer data streams that are stored
in-place of the original inputs. A series of LSB operations can then
be performed directly using these entangled data streams. The output
results can be extracted from any entangled output streams
by additions and arithmetic shifts, thereby guaranteeing robustness
to a fail-stop failure in any single stream computation. Importantly,
unlike other methods, the number of operations required for the entanglement,
extraction and recovery of the results is linearly related to the
number of the inputs and does not depend on the complexity of the
performed LSB operations. We have validated our proposal in an Intel
processor (Haswell architecture with AVX2 support) via convolution
operations. Our analysis and experiments reveal that the proposed
approach incurs only to reduction in processing
throughput in comparison to the failure-intolerant approach. This
overhead is 9 to 14 times smaller than that of the equivalent checksum-based
method. Thus, our proposal can be used in distributed systems and
unreliable processor hardware, or safety-critical applications, where
robustness against fail-stop failures becomes a necessity.

## I Introduction

The increase of integration density [1] and aggressive voltage/frequency scaling in processor and custom-hardware designs [2], along with the ever-increasing tendency to use commercial off-the-shelf processors to create vast computing clusters, have decreased the mean-time-to-failure of modern computing systems. Therefore, it is now becoming imperative for distributed computing systems to provide for fail-stop failure mitigation [3], i.e., recover from cases where one of their processor cores becomes unresponsive or does not return the results within a predetermined deadline. Applications that are particularly prone to fail-stop failures include distributed systems like grid computing [4], sensor-network [5], webpage, or multimedia retrieval and object or face recognition in images [6], financial computing [7], etc. The compute- and memory-intensive parts of these applications comprise linear, sesquilinear (also known as “one-and-half linear”) and bijective operations, collectively called LSB operations in this paper. These operations are typically performed using single or double-precision floating-point inputs or, for systems requiring exact reproducibility and/or reduced hardware complexity, 32-bit or 64-bit integer or fixed-point inputs. Thus, ensuring robust recovery from fail-stop failures for applications comprising integer LSB operations is of paramount importance.

### I-a Summary of Prior Work

Existing techniques that can ensure recovery from fail-stop failures
comprise two categories: *(i)* roll-back via checkpointing
and recomputation [8, 9], i.e., methods
that periodically save the state of all running processes, such that
the execution can be rolled back to a “safe state” in case of
failures;* (ii)* roll-forward methods producing additional
“checksum” inputs/outputs [9, 10, 11]
such that the missing results from a core failure can be recovered
from the remaining cores without recomputation. Examples of roll-forward
methods include algorithm-based fault-tolerance (ABFT) and modular
redundancy (MR) methods [12, 9, 13, 10, 14, 15, 11, 16].
Although no recomputation is required in roll-forward methods (thereby
ensuring quick recovery from a failure occurrence), checksum-based
methods can incur significant computational and energy-consumption
overhead because of the additional checksum-generation and redundant
computations required [17].

### I-B Contribution

We propose a new roll-forward failure-mitigation method for linear,
sesquilinear (also known as one-and-half linear) or bijective operations
performed in integer data streams. Examples of such operations are
element-by-element additions and multiplications, inner and outer
vector products, sum-of-squares and permutation operations. They are
the building blocks of algorithms of foundational importance, such
as: matrix multiplication [18, 12],
convolution/cross-correlation [19], template
matching for search algorithms [20], covariance
calculations [6], integer-to-integer transforms [21]
and permutation-based encoding systems [22],
which form the core of the applications discussed earlier. Because
our method performs linear superpositions of input streams onto each
other, it “entangles” input streams together and we term it as
*numerical entanglement*. Our approach guarantees recovery from
any single stream-processing failure without requiring recomputation.
Importantly, numerical entanglement does not generate additional “checksum”
or duplicate streams and does not depend on the specifics of the LSB
operation performed. It is therefore found to be extremely efficient
in comparison to checksum-based methods that incur overhead proportional
to the complexity of the operation performed.

### I-C Paper Organization

In \prettyrefsec:ABFT_MR_vs_Entanglement, we introduce checksum based methods and MR for fail-stop failure recovery in numerical stream processing. In \prettyrefsec:From-Numerical-Packing-to-Entanglement we introduce the notion of numerical entanglement and demonstrate its inherent reliability for LSB processing of integer streams. \prettyrefsec:Linear_processing presents the complexity of numerical entanglements within integer linear or sesquilinear operations. Section V presents experimental comparisons and \prettyrefsec:Conclusions presents some concluding remarks.

## Ii Checksum/MR-based Methods versus Numerical Entanglement

Consider a series of input streams of integers, each comprising
samples^{1}^{1}1Notations: Boldface uppercase and lowercase letters indicate matrices
and vectors, respectively; the corresponding italicized lowercase
indicate their individual elements, e.g. and ;
denotes the recovered value of after disentanglement;
all indices are integers. Operators: superscript T denotes
transposition; is the largest integer
that is smaller or equal to (floor operation);
is the smallest integer that is larger or equal to (ceil operation);
and indicate left and right arithmetic shift of
integer by bits with truncation occurring at the most-significant
or least significant bit, respectively;
is the modulo operation. ():

(1) |

These may be the elements of rows of a matrix of integers, or a set of input integer streams of data to be operated upon with an integer kernel . This operation is performed by:

(2) |

with the th vector of output results (containing
values) and op any LSB operator such as
element-by-element addition/subtraction/multiplication, inner/outer
product, permutation^{2}^{2}2We remark that we consider LSB operations that are *not* data-dependent,
e.g., permutations according to fixed index sets as in the Burrows-Wheeler
transform [22]. (i.e., bijective mapping from the sequential index set
to index set corresponding to )
and circular convolution or cross-correlation with .
Beyond the single LSB operator indicated in (2),
we can also assume* series* of such operators applied consecutively
in order to realize higher-level algorithmic processing, e.g., multiple
consecutive additions, subtractions and scaling operations with pre-established
kernels followed by circular convolutions and permutation operations.
Conversely, the input data streams can also be left in their native
state (i.e., stored in memory), if
and .

### Ii-a Checksum-based Methods

In their original (or “pure”) form, the input data streams of
(1) are uncorrelated and one input or
output element cannot be used for the recovery of another without
inserting some form of coding or redundancy. This is conventionally
achieved via checksum-based methods [12, 9, 13, 10, 23, 14, 15].
Specifically, one *additional* input stream is created, which
comprises *checksums* of the original inputs:

(3) |

by using, for example, the sum of groups of input samples [15, 14] at position in each stream, :

(4) |

Then the processing is performed in all input streams and in the checksum input stream (each running on a different core) by:

### Ii-B Proposed Numerical Entanglement

Numerical entanglement mixes the inputs prior to processing using linear superposition, and ensures the results can be recovered via a mixture of shift-add operations. Specifically, considering () input streams , (each comprising integer samples), each element of the th entangled stream denoted by (), comprises the superposition of two input elements and from different input streams and , i.e., and The LSB operation op with kernel is carried out with independent cores utilizing the entangled input streams directly, thereby producing the entangled output streams (each comprising integer samples). These can be disentangled to recover the final results . Any single fail-stop failure in the processor cores can be recovered from the results of the remaining cores utilizing additions and shift operations.

The complexity of entanglement, disentanglement (extraction) and recovery does not depend on the complexity of the operator op, or on the length of the kernel (operand) . The entangled inputs can be written in-place and no additional storage or additional operations are needed during the execution of the actual operation. The entire process is also suitable for stream processors with entanglement applied as data within each input stream is being read. Unlike checksum or MR methods, numerical entanglement does not use additional processor cores, and the only detriment is that the dynamic range of the entangled inputs is somewhat increased in comparison to the original inputs . However, as it will be demonstrated in the next section, this increase depends on the number of jointly-entangled inputs, , i.e., the desired failure recovery capability. Therefore, one can be traded for the other.

## Iii Numerical Entanglement for Fail-Stop Reliability in LSB Operations

We first illustrate our approach via its simplest instantiation, i.e., entanglement of inputs, and then present its general application and discuss its properties.

### Iii-a Numerical Entanglement in Groups of Inputs

#### Iii-A1 Entanglement

In the simplest form of entanglement (), each triplet of input samples of the three integer streams, , and , , produces the following entangled triplet via the superposition operations:

(6) | |||||

where:

(7) |

is the left or right arithmetic shift of by bits. If we assume that the utilized integer representation comprises bits, the -bit left-shift operations of (6) must be upper-bounded by to avoid overflow. Therefore, if the dynamic range of the input streams , , is bits:

(8) |

in order to ensure no overflow happens from the arithmetic shifts
of (6). The values for and
are chosen such that is maximum within the constraint of (8)
and . Via the application of LSB operations, each
entangled input stream () is converted to the entangled
output stream^{3}^{3}3For the particular cases of: ,
must also be entangled with itself via: ,
in order to retain the homomorphism of the performed operation. (which contains values):

(9) |

A conceptual illustration of the entangled outputs after (6)
and (9) is given in Fig. 1.
Our description until this point indicates a key aspect: bits
of dynamic range are used *within* each entangled input/output
in order to achieve recovery from one fail-stop failure occurring
in the computation of , or .
As a practical instantiation of (6),
we can set , and in a signed 32-bit integer
configuration.

We now describe the disentanglement and recovery process. The reader can also consult Fig. 1.

#### Iii-A2 Disentanglement and Recovery

We can disentangle the outputs by ():

(10) | |||||

The first three parts of (10) assume a -bit integer representation is used for the interim operations, as the temporary variable is stored in -bit integer representation. However, all recovered outputs, , and , require only bits.

Explanation of (10)—see also Fig.
1: The first part creates a composite
number comprising in the most-significant
bits and in the least-significant bits (therefore,
requires bits). In the second part,
is extracted by: *(i)* discarding the most-significant
bits; *(ii)* arithmetically shifting the output down to the correct
range. The third part of (10) uses
to recover and, in the fourth part
of (10), is used to
recover .

*Remark 1 (operations within bits):* To facilitate our exposition,
the first three parts of (10) are presented
under the assumption of a -bit integer representation. However,
it is straightforward to implement them via -bit integer operations
by separating into two parts of bits and performing
the operations separately within these parts*.*

*Remark 2 (recovery without the use of **):*
Notice that (10) does not use .
This is a crucial element of our approach: since ,
and were derived without using ,
full recovery of all outputs takes place even with the loss of one
entangled stream. We are able to do this because, for every ,
, and contain
and , which suffice to recreate
if the latter is not available due to a fail-stop failure. This link
is pictorially illustrated in Fig. 1.
Since the entangled pattern is cyclically-symmetric, it is straightforward
to demonstrate that recovery from loss of any single out the three
output streams is possible following the same approach.

*Remark 3 (dynamic range):* Bit within each recovered
output , and represents
its sign bit. Given that: *(i)* each entangled output comprises
the addition of two outputs (with one of them left-shifted by
bits); *(ii)* the entangled outputs must not exceed bits,
we deduce that the outputs of the LSB operations must not exceed the
range

(11) |

Therefore, (11) comprises the range permissible
for the LSB operations of (9) with
the entangled representation of (6).
Thus, we conclude that, for integer outputs produced by the LSB operations
of (9) with range bounded by (11),
the extraction mechanism of (10) is
*necessary and sufficient* for the recovery of *any single
stream in* , , for all
stream positions , .

### Iii-B Generalized Entanglement in Groups of Inputs ()

We extend the proposed entanglement process to using inputs and
providing entangled descriptions, each comprising the linear
superposition of two inputs. This ensures that, for every* *
()*, any *single failure will be recoverable
within each group of output samples.

The condition for ensuring that overflow is avoided is

(12) |

and the dynamic range supported for all outputs is ():

(13) |

We now define the following operator that generalizes the proposed numerical entanglement process:

(14) |

with the circulant matrix operator comprising cyclic permutations of the vector .

As before, in the generalized entanglement in groups of streams,
the values for and are chosen such that is maximum
within the constraint of (12) and .
Moreover, the exact same principle applies, i.e., pairs of inputs
are entangled together (with one of the two shifted by bits)
to create each entangled input stream of data. Any LSB operation is
then performed directly on these input streams and* any *single
fail-stop failure will be recoverable within each group of outputs.
For every input stream position , , the
entanglement vector performing the linear superposition of pairs out
of inputs is now formed by:

(15) |

After the application of (9), we can disentangle every output stream element , , as follows. We first identify the unavailable entangled output stream (with ) due the single core failure. Then, we produce the -bit temporary variable by:

(16) |

Notice that (16) does not use . We can then extract the value of and directly from :

(18) |

The other outputs can now be disentangled by ():

Given that for every output position we are able to recover *all
results of all streams without using*
in (16)–(LABEL:eq:integer_disentanglement_M_general-1),
the proposed method is able to recover from a single fail-stop failure
in one of the entangled streams.

*Remark 4 (dynamic range of generalized entanglement and equivalence
to checksum methods):* Examples for the maximum bitwidth achievable
for different cases of are given in Table I
assuming a 32-bit representation. We also present the dynamic range
permitted by the equivalent checksum-based method [(3)–(9)]
in order to ensure that its checksum stream does not overflow under
a 32-bit representation. Evidently, for , the proposed approach
incurs loss of to bits of dynamic range against the checksum-based
method, while it allows for higher dynamic range than the checksum-based
method for . At the same time, our proposal does not require
the overhead of applying the LSB operations to an additional stream,
as it “overlays” the information of each input onto another input
via the numerical entanglement of pairs of inputs. Beyond this important
different, our approach offers the exact equivalent to checksum
methods of (3)–(5)
for integer inputs. Therefore, equivalently to checksum methods,
beyond recovery from single fail-stop failures, our proposal can also
be used for the detection of silent data corruptions (SDCs) in any
input stream, as long as such SDCs do not occur in coinciding output
stream positions. We plan to explore this aspect in future work.

Maximum bitwidth supported by | ||||

Proposed: | Checksum-based | |||

3 | 11 | 10 | 21 | 30 |

4 | 8 | 8 | 24 | 30 |

5 | 7 | 4 | 25 | 29 |

8 | 4 | 4 | 28 | 29 |

11 | 3 | 2 | 29 | 28 |

16 | 2 | 2 | 30 | 28 |

32 | 1 | 1 | 31 | 27 |

*(i)*different numbers of entanglements;

*(ii)*checksum-based method of (3)–(9). Any failure in 1 out of streams is guaranteed to be recoverable under both frameworks.

## Iv Complexity in LSB Operations with Numerical Entanglement

Consider input integer data streams, each comprising several samples and consider that an LSB operation op with kernel is applied on each stream. The operations count (additions/multiplications) for stream-by-stream sum-of-products between a matrix comprising subblocks of integers and a matrix kernel comprising integers (see [24, 18, 25, 9] for example instantiations) is: . For sesquilinear operations like convolution and cross-correlation of input integer data streams (each comprising samples) with kernel [see Fig. 1(a)], depending on the utilized realization, the number of operations can range from for direct algorithms (e.g., time-domain convolution) to for fast algorithms (e.g., FFT-based convolution) [19]. For example, for convolution or cross-correlation under these settings and an overlap-save realization for consecutive block processing, the number of operations (additions/multiplications) is [19]: for time domain processing and for frequency-domain processing.

As described in Section III, numerical entanglement of input integer data streams (of samples each) requires operations for the entanglement, extraction and recovery per output sample. For example, ignoring all arithmetic-shifting operations (which take a negligible amount of time), based on the description of Section III the upper bound of the operations for numerical entanglement, extraction and recovery is: . Similarly as before, for the special case of the GEMM operation using subblocks of integers, the upper bound of the overhead of numerical entanglement of all inputs is: . For all values for and of practical relevance (e.g., and ) and sesquilinear operations like matrix products, convolution and cross-correlation, it can easily be calculated from the ratios , and that the relative overhead of numerical entanglement, extraction and recovery in terms of arithmetic operations is below . Most importantly,

(20) |

i.e., the relative overhead of the proposed approach approaches as the dimension of the LSB processing increases.

On the other hand, the overhead of checksum-based methods in terms of operations count (additions/multiplications) for each case is represented by , and . As expected, the relative overhead of checksum methods converges to as the dimension of the LSB processing operations increases, i.e.,

(21) | |||||

Therefore, the checksum-based method for fail-stop mitigation leads to substantial overhead (above ) when high reliability is pursued, i.e., when . Finally, even for the low reliability regime (i.e., when ), checksum-based methods will incur more than overhead in terms of arithmetic operations.

## V Experimental Validation

All our results were obtained using an Intel Core i7-4700MQ 2.40GHz processor (Haswell architecture with AVX2 support, Windows 8 64-bit system, Microsoft Visual Studio 2013 compiler). Entanglement, disentanglement and fail-stop recovery mechanisms were realized using the Intel AVX2 SIMD instruction set for faster processing. For all cases, we also present comparisons with checksum-based recovery, the checksum elements of which were also generated using AVX2 SIMD instructions.

We consider the case of convolution operations of integer streams.
We used Intel’s Integrated Performance Primitives (IPP) 7.0 [26]
convolution routine ippsConv_64f that can handle the dynamic
range required under convolutions with 32-bit integer inputs. We experimented
with: input size of samples, several kernel
sizes between samples.
Representative results are given in Fig. 2
under two settings for the number of input streams, , and without
the occurrence of failures, i.e., when operating under normal conditions^{4}^{4}4Under the occurrence of one fail-stop failure, the performance of
the proposed approach remains the same as the results are disentangled
as soon as (any) output streams become available. On the other
hand, the performance of the checksum-based approach will decrease
slightly under a fail-stop failure, since results will need to be
recovered from the checksum stream.. The results demonstrate that the proposed approach incurs substantially
smaller overhead for a single fail-stop mitigation in comparison to
the checksum-based method. Specifically, the decrease in throughput
for the proposed approach in comparison to the failure-intolerant
case is only to , while checksum-based method incurs
to throughput loss for the same test. As expected
by the theoretical calculations of Section IV,
this is an order-of-magnitude higher than the overhead of numerical
entanglement.

## Vi Conclusions

We propose a new approach to fail-stop failure recovery in linear,
sesquilinear and bijective (LSB) processing of integer data streams
that is based on the novel concept of numerical entanglement. Under
input streams (), the proposed approach provides for:
*(i)* guaranteed recovery from any single fail-stop failure;
*(ii)* complexity overhead that depends only on and not
on the complexity of the performed LSB operations, thus, quickly becoming
negligible as the complexity of the LSB operations increases. These
two features demonstrate that the proposed solution forms a *third
family* of recovery from fail-stop failures (i.e., beyond the well-known
and widely-used checksum-based methods and modular redundancy) and
offers unique advantages. As such, it is envisaged that it will find
usage in a multitude of systems that require enhanced reliability
against core failures in hardware with very low implementation overhead.

## References

- [1] M. Nicolaidis, L. Anghel, N-E Zergainoh, Y. Zorian, T. Karnik, K. Bowman, J. Tschanz, S.-L. Lu, C. Tokunaga, A. Raychowdhury, et al., “Design for test and reliability in ultimate cmos,” in IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012, pp. 677–682.
- [2] A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, and S.-L. Lu, “Energy-efficient cache design using variable-strength error-correcting codes,” in Proc. 38th IEEE Int. Symp. Computer Archit. (ISCA), 2011. IEEE, 2011, pp. 461–471.
- [3] S. Gotoda, M. Ito, and N. Shibata, “Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault,” in Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on. IEEE, 2012, pp. 260–267.
- [4] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud computing and grid computing 360-degree compared,” in Grid Computing Environments Workshop, 2008. GCE’08. Ieee, 2008, pp. 1–10.
- [5] W. Kurschl and W. Beer, “Combining cloud computing and wireless sensor networks,” in Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services. ACM, 2009, pp. 512–518.
- [6] J. Yang, D. Zhang, A. F Frangi, and J.-Y. Yang, “Two-dimensional PCA: a new approach to appearance-based face representation and recognition,” IEEE Trans. Patt. Anal. and Machine Intel., vol. 26, no. 1, pp. 131–137, 2004.
- [7] Y. Peng, B. Gong, H. Liu, and Y. Zhang, “Parallel computing for option pricing based on the backward stochastic differential equation,” in Springer High Perform. Comput. and Applic., pp. 325–330. 2010.
- [8] X. Ren, R. Eigenmann, and S. Bagchi, “Failure-aware checkpointing in fine-grained cycle sharing systems,” in Proceedings of the 16th international symposium on High performance distributed computing. ACM, 2007, pp. 33–42.
- [9] Z. Chen, G. E Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault tolerant high performance computing by a coding approach,” in Proc. 10th ACM SIGPLAN Symp. Princip. and Pract. Paral. Prog., 2005, pp. 213–223.
- [10] V. K. Stefanidis and K. G. Margaritis, “Algorithm based fault tolerance: Review and experimental study,” in International Conference of Numerical Analysis and Applied Mathematics. IEEE, 2004.
- [11] Z. Chen, “Optimal real number codes for fault tolerant matrix operations,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 2009, p. 29.
- [12] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Trans. Comput., vol. 100, no. 6, pp. 518–528, 1984.
- [13] F. T. Luk, “Algorithm-based fault tolerance for parallel matrix equation solvers,” SPIE, Real-Time Signal processing VIII, vol. 564, pp. 631–635, 1985.
- [14] J. Sloan, R. Kumar, and G. Bronevetsky, “Algorithmic approaches to low overhead fault detection for sparse linear algebra,” in Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on. IEEE, 2012, pp. 1–12.
- [15] N.K. Rexford, J.; Jha, “Algorithm-based fault tolerance for floating-point operations in massively parallel systems,” in Proceedings., 1992 IEEE International Symposium on Circuits and Systems. IEEE, May 1992, vol. 2, pp. 649,652.
- [16] C. Engelmann, H. Ong, and S. L Scott, “The case for modular redundancy in large-scale high performance computing systems,” in Proc. IASTED Int. Conf., 2009, vol. 641, p. 046.
- [17] L. Rizzo, “Effective erasure codes for reliable computer communication protocols,” ACM SIGCOMM computer communication review, vol. 27, no. 2, pp. 24–36, 1997.
- [18] K. Goto and R. A Van De Geijn, “Anatomy of high-performance matrix multiplication,” ACM Trans. Math. Soft, vol. 34, no. 3, pp. 12, 2008.
- [19] M. A. Anam and Y. Andreopoulos, “Throughput scaling of convolution for error-tolerant multimedia applications,” IEEE Trans. Multimedia, vol. 14, no. 3, pp. 797–804, 2012.
- [20] D. Anastasia and Y. Andreopoulos, “Software designs of image processing tasks with incremental refinement of computation,” IEEE Trans. Image Process., vol. 19, no. 8, pp. 2099–2114, 2010.
- [21] C. Lin, B. Zhang, and Y. F. Zheng, “Packed integer wavelet transform constructed by lifting scheme,” IEEE Trans. Circ. and Syst. for Video Technol., vol. 10, no. 8, pp. 1496–1501, 2000.
- [22] P. M. Fenwick, “The Burrows–Wheeler transform for block sorting text compression: principles and improvements,” The Comp. J., vol. 39, no. 9, pp. 731–740, 1996.
- [23] V.S.S Nair and J.A. Abraham, “General linear codes for fault tolerant matrix operations on processor arrays,” in Int. Symp. Fault Tolerant Comput. IEEE, 1988, pp. 180–185.
- [24] D. Anastasia and Y. Andreopoulos, “Throughput-distortion computation of generic matrix multiplication: Toward a computation channel for digital signal processing systems,” IEEE Trans. Signal Process., vol. 60, no. 4, pp. 2024–2037, 2012.
- [25] D. G Murray and S. Hand, “Spread-spectrum computation,” in Proc. USENIX 4th Conf. Hot Top. in Syst. Dependab., 2008, pp. 5–9.
- [26] S. Taylor, Intel Integrated Performance Primitives: How to Optimize Software Applications Using Intel IPP, 2003.