A Row-parallel 8\times8 2-D DCT Architecture Using Algebraic Integer Based Exact Computation

A Row-parallel 88 2-D DCT Architecture Using Algebraic Integer Based Exact Computation

Abstract

An algebraic integer (AI) based time-multiplexed row-parallel architecture and two final-reconstruction step (FRS) algorithms are proposed for the implementation of bivariate AI-encoded 2-D discrete cosine transform (DCT). The architecture directly realizes an error-free 2-D DCT without using FRSs between row-column transforms, leading to an 88 2-D DCT which is entirely free of quantization errors in AI basis. As a result, the user-selectable accuracy for each of the coefficients in the FRS facilitates each of the 64 coefficients to have its precision set independently of others, avoiding the leakage of quantization noise between channels as is the case for published DCT designs. The proposed FRS uses two approaches based on (i) optimized Dempster-Macleod multipliers and (ii) expansion factor scaling. This architecture enables low-noise high-dynamic range applications in digital video processing that requires full control of the finite-precision computation of the 2-D DCT. The proposed architectures and FRS techniques are experimentally verified and validated using hardware implementations that are physically realized and verified on FPGA chip. Six designs, for 4- and 8-bit input word sizes, using the two proposed FRS schemes, have been designed, simulated, physically implemented and measured. The maximum clock rate and block-rate achieved among 8-bit input designs are 307.787 MHz and 38.47 MHz, respectively, implying a pixel rate of 8307.7872.462 GHz if eventually embedded in a real-time video-processing system. The equivalent frame rate is about 1187.35 Hz for the image size of 19201080. All implementations are functional on a Xilinx Virtex-6 XC6VLX240T FPGA device.

Keywords

DCT, Algebraic Integer Quantization, FPGA design


1 Introduction

High-quality digital video in multimedia devices and video-over-IP networks connected to the Internet are under exponential growth and therefore the demand for applications capable of high dynamic range (HDR) video is accordingly increasing. Some HDR imaging applications include automatic surveillance [1, 2, 3, 4], geospatial remote sensing [5], traffic cameras [6], homeland security [4], satellite based imaging [7, 8, 9], unmanned aerial vehicles [10, 11, 12], automotive industry [13], and multimedia wireless sensor networks [14]. Such HDR video systems operating at high resolutions require an associate hardware capable of significant throughput at allowable area-power complexity.

Efficient codec circuits capable of both high-speeds of operation and high numerical accuracy are needed for next-generation systems. Such systems may process massive amounts of video feeds, each at high resolution, with minimal noise and distortion while consuming as little energy as possible [15].

The two-dimensional (2-D) discrete cosine transform (DCT) operation is fundamental to almost all real-time video compression systems. The circuit realization of the DCT directly relates to noise, distortion, circuit area, and power consumption of the related video devices [15]. Usually, the 2-D DCT is computed by successive calls of the one-dimensional (1-D) DCT applied to the columns of an 88 sub-image; then to the rows of the transposed resulting intermediate calculation [16]. The VLSI implementation of trigonometric transforms such as DCT and DFT is indeed an active research area [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33].

An ideal 8-point 1-D DCT requires multiplications by numbers in the form , . These constants impose computational difficulties in terms of number binary representation since they are not rational. Usual DCT implementations adopt a compromise solution to this problem employing truncation or rounding off [34, 35] to approximate such quantities. Thus, instead of employing the exact value , a quantized value is considered. Clearly, this operation introduces errors.

One way of addressing this problem is to employ algebraic integer (AI) encoding [36, 37]. AI-encoding philosophy consists of mapping possibly irrational numbers to array of integers, which can be arithmetically manipulated without errors. Also, depending on the numbers to be encoded, this mapping can be exact. For example, all 8-point DCT multipliers can be given an exact AI representation [38]. Eventually, after computation is performed, AI-based algorithms require a final reconstruction step (FRS) in order to map the resulting encoded integer arrays back into usual fixed-point representation at a given precision [36].

Besides the numerical representation issues, error propagation also plays a role. In particular, when considering the fixed-point realization of the multiplication operation, quantization errors are prone to be amplified in the DCT computation [39, 40]. Quantization noise at a particular 2-D DCT coefficient can have significant correlation with noise in other coefficients depending on the statistics of the video signal of interest [39, 40, 31, 33]. Combating noise injection, noise coupling, and noise amplification is a concern in a practical DCT implementation [39, 40, 34, 35, 31, 33].

In [41, 42], AI-based procedures for the 2-D DCT are proposed. Their architecture was based on the low-complexity Arai algorithm [43], which formed the building-block of each 1-D DCT using AI number representation. The Arai algorithm is a popular algorithm for video and image processing applications because of its relatively low computational complexity. It is noted that the 8-point Arai algorithm only needs five multiplications to generate the eight output coefficients. Thus, we naturally choose this low complexity algorithm as a foundation for proposing optimized architectures having lower complexity and lower-noise. However, such design required the algebraically encoded numbers to be reconstructed to their fixed-point format by the end of column-wise DCT calculation by means of an intermediate reconstruction step. Then data are re-coded to enter into the row-wise DCT calculation block [41, 42]. This approach is not ideal because it introduces both numerical representation errors and error propagation from the intermediate FSR to subsequent blocks.

We propose a digital hardware architecture for the 2-D DCT capable of (i) arbitrarily high numeric accuracy and (ii) high-throughput. To achieve these goals our design maintains the signal flow free of quantization errors in all its intermediate computational steps by means of a novel doubly AI encoding concept. No intermediate reconstruction step is introduced and the entire computation truly occurs over the AI structure. This prevents error propagation throughout intermediate computation, which would otherwise result in error correlation among the final DCT coefficients. Thus errors are totally confined to a single FRS that maps the resulting doubly AI encoded DCT coefficients into fixed-point representations [36]. This procedure allows the selection of individual levels of precision for each of the 64 DCT spectral components at the FRS. At the same time, such flexibility does not affect noise levels or speed of other sections of the 2-D DCT.

This works extends the 8-point 1-D AI-based DCT architecture [41, 42, 37] into a fully-parallel time-multiplexed 2-D architecture for 88 data blocks. The fundamental differences are (i) the absence of any intermediate reconstruction step; (ii) a new doubly AI encoding scheme; and (iii) the utilization of a single FRS. The proposed 2-D architecture has the following characteristics: (i) independently selectable precision levels for the 2-D DCT coefficients; (ii) total absence of multiplication operations; and (iii) absence of leakage of quantization noise between coefficient channels. The proposed architectures aim at performing the FRS operation directly in the bi-variate encoded 2-D AI basis. We introduce designs based on (i) optimized Dempster-Macleod multipliers and on (ii) the expansion factor approach [44]. All hardware implementations are designed to be realized on field programmable gate arrays (FPGAs) from Xilinx [45].

This paper unfolds as follows. In Section 2 we review existing designs and the main theoretical points of number representation based on AI. We keep our focus on the core results needed for our design. Section 3 brings a description of the proposed circuitry and hardware architecture in block level detail. In Section 4 strategies for obtaining the FRS block are proposed and described. Simulation results and actual test measurements are reported in Section 5. Concluding remarks are drawn in Section 6.

2 Review

The AI encoding was originally proposed for digital signal processing systems by Cozzens and Finkelstein [46]. Since then it has been adapted for the VLSI implementation of the 1-D DCT and other trigonometric transforms by Julien et al. in [47, 48, 49, 50, 51], leading to a 1-D bivariate encoded Arai DCT algorithm by Wahid and Dimitrov [41, 42, 52, 37]. Recently, subsequent contributions by Wahid et al. (using bivariate encoded 1-D Arai DCT blocks for row and column transforms of the 2-D DCT) has led to practical area-efficient VLSI video processing circuits with low-power consumption [53, 54, 55]. We now briefly summarize the state-of-the-art in both 1-D and 2-D DCT VLSI cores based on conventional fixed-point arithmetic as well as on AI encoding.

2.1 Summary and Comparison with Literature

Fixed-Point DCT VLSI Circuits

A unified distributed-arithmetic parallel architecture for the computation of DCT and the DST was proposed in [24]. A direct-connected 3-D VLSI architecture for the 2-D prime-factor DCT that does not need a transpose memory (buffer) is available in [25]. A pioneering implementation at a clock of 100 MHz on 0.8 m CMOS technology for the 2-D DCT with block-size which is suitable for HDTV applications is available in [17].

An efficient VLSI linear-array for both -point DCT and IDCT using a subband decomposition algorithm that results in computational- and hardware-complexity of with FPGA realization is reported in [20]. Recently, VLSI linear-array 2-D architectures and FPGA realizations having computation complexity (for forward DCT) was reported in [21].

An efficient adder-based 2-D DCT core on 0.35 m CMOS using cyclic convolution is described in [29]. A high-performance video transform engine employing a space-time scheduling scheme for computing the 2-D DCT in real-time has been proposed and implemented in 0.18 m CMOS [22]. A systolic-array algorithm using a memory based design for both the DCT and the discrete sine transform which is suitable for real-time VLSI realization was proposed in [18]. An FPGA-based system-on-chip realization of the 2-D DCT for block size that operates at 107 MHz with a latency of 80 cycles is available in [28]. A low-complexity IP core for quantized DCT combined with MPEG4 codecs and FPGA synthesis is available in [30]. “New distributed-arithmetic (NEDA)” based low-power 2-D DCT is reported in [31]. A reconfigurable processor on TSMC 0.13 m CMOS technology operating at 100 MHz is described in [32] for the calculation of the fast Fourier transform and the 2-D DCT. A high-speed 2-D transform architecture based on NEDA technique and having unique kernel for multi-standard video processing is described in [33].

AI-based DCT VLSI Circuits

The following AI-based realizations of 2-D DCT computation relies on the row- and column-wise application of 1-D DCT cores that employ AI quantization [47, 48, 49, 50, 51]. The architectures proposed by Wahid et al. rely on the low-complexity Arai Algorithm and lead to low-power realizations [41, 42, 53, 54, 52]. However, these realizations also are based on repeated application along row and columns of an fundamental 1-D DCT building block having an FRS section at the output stage. Here, 2-D DCT refers to the use of bivariate encoding in the AI basis and not to the a true AI-based 2-D DCT operation.

A approximate 2-D-DCT using AI quantization is reported in [56]. Both FPGA implementation and ASIC synthesis on 90 nm CMOS results are provided. Although [56] employs AI encoding, it is not an error-free architecture. The low complexity of this architecture makes it suitable for H.264 realizations.

2.2 Preliminaries for Algebraic Integer Encoding and Decoding

In order to prevent quantization noise, we adopt the AI representation. Such representation is based on a mapping function that links input numbers to integer arrays.

This topic is a major and classic field in number theory. A famous exposition is due to Hardy and Wright [57, Chap. XI and XIV], which is widely regarded as masterpiece on this subject for its clarity and depth. Pohst also brings a didactic explanation in [58] with emphasis on computational realization. In [59, p. 79], Pollard and Diamond devote an entire chapter to the connections between algebraic integers and integral basis. In the following, we furnish an overview focused on the practical aspects of AI, which may be useful for circuit designers.

Definition 1

A real or complex number is called an algebraic integer if it is a root of a monic polynomial with integer coefficients [57, 38].

The set of algebraic integers have useful mathematical properties. For instance, they form a commutative ring, which means that addition and multiplication operations are commutative and also satisfies distribution over addition.

A general AI encoding mapping has the following format

where is a multidimensional array of integers and is a fixed multidimensional array of algebraic integers. It can be shown that there always exist integers such that any real number can be represented with arbitrary precision [46]. Also there are real numbers that can be represented without error.

Decoding operation is furnished by

where the binary operation is the generalized inner product — a component-wise inner product of multidimensional arrays. The elements of  constitute the AI basis. In hardware, decoding is often performed by an FRS block, where the AI basis is represented as precisely as required.

As an example, let the AI basis be such that , where is the algebraic integer and the superscript  denotes the transposition operation. Thus, a possible AI encoding mapping is , where  and are integers. Encoded numbers are then represented by a 2-point vector of integers. Decoding operation is simply given by the usual inner product: . For example, the number has the following encoding:

which is an exact representation.

In principle, any number can be represented in an arbitrarily high precision [46, 60]. However, within a limited dynamic range for the employed integers, not all numbers can be exactly encoded. For instance, considering the real number , we have , where integers were limited to be 8-bit long. Although very close, the representation is not exact:

In a similar way, the multipliers required by the DCT could be encoded into 2-point integer vectors: . Given that the DCT constants are algebraic integers [38], an exact AI representation can be derived [61]. Thus, the integer sequences and can be easily realized in VLSI hardware.

The multiplication between two numbers represented over an AI basis may be interpreted as a modular polynomial multiplication with respect to the monic polynomial that defines the AI basis. In the above particular illustrative example, consider the multiplication of the following pair of numbers with , where and are integers. This operation is equivalent to the computation of the following expression:

Thus, existing algorithms for fast polynomial multiplication may be of consideration [62, p. 311].

In practical terms, a good AI representation possesses a basis such that: (i) the required constants can be represented without error; (ii) the integer elements provided by the representation are sufficiently small to allow a simple architecture design and fast signal processing; and (iii) the basis itself contains few elements to facilitate simple encoding-decoding operations.

Other AI procedures allow the constants to be approximated, yielding much better options for encoding, at the cost of introducing error within the transform (before the FRS) [38].

2.3 Bivariate AI Encoding

Depending on the DCT algorithm employed, only the cosine of a few arcs are in fact required. We adopted the Arai DCT algorithm [43]; and the required elements for this particular 1-D DCT method are only [41, 42, 37]:

These particular values can be conveniently encoded as follows. Considering and , we adopt the following 2-D array for AI encoding:

This leads to a 2-D encoded coefficients of the form (scaled by 4):

Such encoding is referred to as bivariate. For this specific AI basis, the required cosine values possess an error-free and sparse representation as given in Table 1 [42, 41, 37]. Also we note that this representation utilizes very small integers and therefore is suitable for fast arithmetic computation. Moreover, these employed integers are powers of two, which require no hardware components other than wired-shifts, being cost-free.

Table 1: 2-D AI encoding of Arai DCT constants

Encoding an arbitrary real number can be a sophisticated operation requiring the usage of look-up tables and greedy algorithms [63]. Essentially, an exhaustive search is required to obtain the most accurate representation. However, integer numbers can be encoded effortlessly:

(1)

where is an integer. In this case, the encoding step is unnecessary. Our proposed design takes advantage of this property.

For a given encoded number , the decoding operation is simply expressed by:

In terms of circuitry design, this operation is usually performed by the FRS.

In order to reduce and simplify the employed notation, hereafter a superscript notation is used for identifying the bivariate AI encoded coefficients. For a given real , we have the following representation

(2)

where superscripts , , , and indicate the encoded integers associated to basis elements 1, , , and , respectively. We denote this basis as .

It is worth to emphasize that in the 2-D AI encoding the equivalence between the algebraic integer multiplication and the polynomial modular multiplication does not hold true. Thus, a tailored computational technique to handle this operation must be developed.

3 2-D AI DCT Architecture

An 88 image block has its 2-D DCT transform mathematically expressed by [16]:

(3)

where is the usual DCT matrix [44]. It is straightforward to notice that this operation corresponds to the column-wise application of the 1-D DCT to the input image , followed by a transposition, and then the row-wise application of the 1-D DCT to the resulted matrix.

The 2-D DCT realizations in [64, 65, 42, 41] use the AI encoding scheme with decoding sections placed in between the row- and column-wise 1-D DCT operations. This intermediate reconstruction step leads to the introduction of quantization noise and cross-coupling of correlated noise components. In contrast, we employ a bivariate AI encoding, maintaining the computation over AI arithmetic to completely avoid arithmetic errors within the algorithm  [61].

Figure 1: 1-D AI Arai DCT block used in Fig. 3[41].
Figure 2: 1-D AI transpose buffer used in Fig. 3.
Input Sequence @ @
Figure 3: The 2-D AI-DCT consists of an input section having a decimation structure, 1-D 8-point AI-DCT block for column-wise DCTs, a real-time AI-TB, four parallel 1-D 8-point AI-DCT blocks for row-wise DCTs, and an FRS  [61].

The proposed architecture consists of five sub-circuits [61]: (i) an input decimator circuit; (ii) an 8-point AI-encoded 1-D DCT block shown in Fig. 1 which performs column-wise computation based on the Arai algorithm [43] and furnishes the intermediate result in the AI domain; (iii) an AI-based transposition buffer shown in Fig. 2 with a wired cross-connection block for obtaining ; (iv) four parallel instantiations of the same 8-point AI-based Arai DCT block in Fig. 1 for row-wise computation of eight 1-D DCTs, which results in ; and (v) an FRS circuit for mapping the AI-encoded 2-D DCT coefficients to 2’s complement format. The last transposition (3) is obtained via wired cross-connections. The proposed architecture is shown in Fig. 3.

Our implementation covers items (ii)–(v) listed above. We now describe in detail each of the system blocks.

3.1 Bit Serial Data Input, SerDes, and Decimation

We assume that the input video data, in raster-scanned format, has already been split into 88 pixel blocks. We further assume that these blocks can be stacked to form an 8-column and ()-row data structure. This leads to so-called “blocked” video frames, each of size 88 pixels. The blocking procedure leads to a raster-scanned sequence of pixel intensity (or color) values , , , from an 88 blocked image. Notice that we use column-row order for the indexes, instead of row-column. Due to the 88 size of the 2-D DCT computation, we find it quite convenient to consider the time index after a modular operation . Hereafter, we will refer to the time index as a modular quantity .

The video signal is serially streamed through the input port of the architecture at a rate of . A bit serial port connected to a serializer/deserializer (SerDes) is required to be fed using a bit rate of without considering overheads. As an aside, we note that this input bit stream may be typically derived from optical fiber transmission or high throughput Ethernet ports driven at 9.6 Gbps. Following the SerDes, a decimation block converts the input byte sequence into a row structure by means of delaying and downsampling by eight as shown in Fig. 3.

Therefore, the raster-scanned input is decimated in time into eight parallel streams operating rate of ; resulting in eight columns of the input block. It is important to emphasize that such input data consist of integer values. Thus, they are AI coded without any computation as shown in (1). The obtained column data is submitted to the column-wise application of the AI-based 1-D DCT.

3.2 An 8-point AI-Encoded Arai DCT Core

The column-wise transform operation is performed according to the 8-point AI-based Arai DCT hardware cores as designed in [41, 42] shown in Fig. 1. Here, this scheme is employed with the removal of its original FRS. The proposed 2-D architecture employs an integer arithmetic entirely defined over the AI basis . This transformation step operates at the reduced clock rate of .

Indeed, the resulting AI encoded data components are split in four channels according to their basis representation  [61]. Such outputs are time-multiplexed mixed-domain partially computed spectral components. We denote them as , , , , where is the column index and is the modular time index containing the information of the row number.

In hardware, this means that the AI representation is contained in at most four parallel integer channels  [61]. Some quantities are known beforehand to require less than four AI encoded integers (cf. (2)). Thus, in some cases, less than four connections are required. These channels are routed to the proposed AI-based transpose buffer (AI-TB) shown in Fig. 2, as a necessary pre-processing for the subsequent row-wise DCT calculation.

3.3 Real-time AI-based Transpose Buffer

Each partially computed transform component , , from the column-wise DCT block is represented in . Such encoded components are stored in the proposed AI-TB (shown in Fig. 2 only for channel ), which computes an 88 matrix transposition operation in real-time every eight clock cycles.

The proposed AI-TB consists of a chain of clocked first-in-first-out (FIFO) buffers for each AI-based channel of each component of the column-wise transformation [61]. For each parallel integer channel , there are eight FIFO taps clocked at rate . Therefore, the set of FIFO buffers leads to output ports from the FIFO buffer section.

Hard wired cross-connections are used that physically realize the required transpose matrix for the next row-wise DCT section. These physical connections are encapsulated in the cross-connection block in Fig. 3 for brevity. The AI-TB is clocked at a rate of and yields a new 88 block of transposed data every 64 clock periods of the master clock . Subsequently, the transposed AI-encoded elements are submitted to four 1-D AI DCT cores operating in parallel.

3.4 Row-wise DCT Computation

After route cross-connection, the output taps from the transposition operation are connected to 32 parallel 8:1 multiplexers. Each multiplexer commutes continuously and routes each partially computed DCT component by cycling through its 3-bit control codes such that the channel inputs of each of the four row-wise AI-based DCT cores are provided with a new set of valid input vectors at rate .

Figure 4: Row-wise DCT block that leads to the 2-D DCT of the 88 input blocks.

The cores are set in parallel being able to compute an 8-point DCT every eight clock cycles of the master clock signal. This operation performs the required row-wise DCT computation in order to complete the 2-D DCT evaluation, resulting in a doubly encoded AI representation , . Fig. 4 shows the above described block.

3.5 Final Reconstruction Step

The output channels for the 64 2-D DCT coefficients are passed through the proposed FRS for decoding the AI-encoded numbers back into their fixed-point, binary representation, in 2’s complement format. Two different architectures are proposed for the FRS.

4 Final Reconstruction Step

The proposed FRS architectures differ from the one in [64] by having individualized circuits to compute each output value at possibly different precisions.

Indeed, no FRS circuits are employed in any intermediate 1-D DCT block. This prevents quantization noise cross-coupling between DCT channels. Any quantization noise is injected only at the final output. Therefore noise signals are uncorrelated, which further allows the noise for each output to be independently adjustable and made as low as required.

4.1 FRS based on Dempster-Macleod method

In this method the doubly encoded elements can be decoded according to:

(4)

which are then submitted to (2). The result is the th row of the final 2-D DCT data , .

Therefore, for each , (4) unfolds into a particular mathematical expression as shown below:

(5)
(6)
(7)
(8)

The summation of above quantities returns (cf. (2)). Terms depending on and may not be rational numbers. Indeed, they are given by

(9)

Multiplier is a power of two and can be represented exactly. Remaining constants require a binary approximation.

Closest signed 12-bit approximations can be employed to approximate the above listed numbers. Such approach furnished the quantities below:

Consequently, the 12-bit approximation expressions related to are given by:

(10)
(11)
(12)
(13)
Figure 5: Final reconstruction step blocks with multi-level pipelining, for (10) and (11), respectively.
Figure 6: Final reconstruction step blocks with multi-level pipelining, for (12) and (13), respectively.
Input: ; Output: , where
669 ; ;
2217 ; ; ;
181 ; ;
3135 ; ;
473 ; ;
437 ; ;
2399 ; ;
8
Table 2: Fast algorithms for required integer multipliers

Finally, considering the above quantities and applying (2), the sought fixed-point representations are fully recovered. Hardware implementation of the multiplier circuits, required by the 12-bit approximations above, is accomplished by using the method of Dempster and Macleod [66, 67]. This method is known to be optimal for constant integer multiplier circuits.

In this multiplierless method, the minimum number of 2-input adders are used for each constant integer multiplier. Wired shifts that perform “costless” multiplications by powers of two are used in each constant integer multiplier. Here, an enhancement to the Dempster-Macleod method is made for the constant integer multiplier circuits: the number of adder-bits is minimized, rather than the number of 2-input adders, yielding a smaller overall design.

Accordingly, the multiplications by non powers of two shown in expressions (10)-(13) can be algorithmically implemented as described in Table 2. Fig. 5 and 6 depict the corresponding pipeline implementation. Here, the various stages of the pipelined FRS architectures are shown by having FIFO registers (consisting of parallel delay flip-flops (D-FFs)) vertically aligned in the figures. Vertically aligned D-FFs indicate the same computation point in a pipelined constant coefficient multiplication within the FRS.

4.2 FRS based on expansion factor scaling

The set of exact values given in (9) suggests further relations among those quantities. Indeed, it may be established the following relations:

These identities indicate that a new design can be fostered. In fact, by substituting the above relations into (5)–(8), we have the following expressions:

Notice that the output value is the summation of the above quantities. Therefore, by grouping the terms on , we can express by the following summation: