Low-complexity 8-point DCT Approximation Based on Angle Similarity for Image and Video Coding

# Low-complexity 8-point DCT Approximation Based on Angle Similarity for Image and Video Coding

Raíza S. Oliveira R. S. Oliveira is with the Programa de Pós-Graduação em Engenharia Elétrica, Universidade Federal de Pernambuco (UFPE), Recife, Brazil; and with the Signal Processing Group, Departamento de Estatística, UFPE.    Renato J. Cintra Renato J. Cintra is with the Signal Processing Group, Departamento de Estatística, Universidade Federal de Pernambuco, Recife, Brazil; ECE, University of Calgary, Calgary, AB, Canada. E-mail: rjdsc@de.ufpe.br    Fábio M. Bayer F. M. Bayer is with the Departamento de Estatística, Universidade Federal de Santa Maria, Santa Maria, and LACESM, Brazil.    Thiago L. T. da Silveira T. L. T. Silveira is with the Programa de Pós-Graduação em Computação, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil.    Arjuna Madanayake Arjuna Madanayake is with the Department of Electrical and Computer Engineering, University of Akron, Akron, OH.    André Leite André Leite is with the Departamento de Estatística, Universidade Federal de Pernambuco, Recife, Brazil. E-mail: leite@de.ufpe.br
{onecolabstract}

The principal component analysis (PCA) is widely used for data decorrelation and dimensionality reduction. However, the use of PCA may be impractical in real-time applications, or in situations were energy and computing constraints are severe. In this context, the discrete cosine transform (DCT) becomes a low-cost alternative to data decorrelation. This paper presents a method to derive computationally efficient approximations to the DCT. The proposed method aims at the minimization of the angle between the rows of the exact DCT matrix and the rows of the approximated transformation matrix. The resulting transformations matrices are orthogonal and have extremely low arithmetic complexity. Considering popular performance measures, one of the proposed transformation matrices outperforms the best competitors in both matrix error and coding capabilities. Practical applications in image and video coding demonstrate the relevance of the proposed transformation. In fact, we show that the proposed approximate DCT can outperform the exact DCT for image encoding under certain compression ratios. The proposed transform and its direct competitors are also physically realized as digital prototype circuits using FPGA technology.

Keywords

DCT Approximation, Fast algorithms, Image/video encoding

## 1 Introduction

Data decorrelation is a central task in many statistical and signal processing problems [1, 2, 3]. Decorrelation can be accomplished by means of a linear transformation that converts correlated observations into linearly uncorrelated values. This operation is commonly performed by principal component analysis (PCA) [2]. PCA is widely used to reduce the dimensionality of data [2, 4], where the information contained in all the original variables is replaced by the data variability information of the initial few uncorrelated principal components. The quality of such approximation depends on the number of components used and the proportion of variance explained, or energy retained, by each of them.

In the field of analysis and processing of images and signals, PCA, also known as the discrete Karhunen–Loève transform (KLT) [3], is considered the optimal linear transformation for data decorrelation when the signal is a first order Markov process [5, 3]. The KLT has the following features [3]: (i) decorrelates the input data completely in the transform domain; (ii) minimizes the mean square error in data compression; and (iii) concentrates the energy (variance) in a few coefficients of the output vector. Because the KLT matrix depends on the variance and covariance matrix of the input data, deriving computationally efficient algorithms for real-time processing becomes a very hard task [6, 7, 8, 9, 10, 11, 12, 3, 13].

If the input data follows a stationary highly correlated first-order Markov process [12, 3, 14], then the KLT is very closely approximated by the discrete cosine transform (DCT) [12, 3]. Natural images fall into this particular statistical model category [15]. Thus DCT inherits the decorrelation and compaction properties of the KLT, with the advantage of having a closed-form expression independent of the input signal. Image and video communities widely adopt the DCT in their most successful compression standards, such as JPEG [16] and MPEG [17]. Often such standards include two-dimensional (2-D) versions of the DCT applied to small image blocks ranging from 44 to 3232 pixels.

The 88 block is employed in a large number of standards, for example: JPEG [16], MPEG [18], H.261 [19], H.263 [20], H.264/AVC [21], and HEVC [22]. The arithmetic cost of the 8-point DCT is 64 multiplications and 56 additions, when computed by definition. Fast algorithms can dramatically reduce the arithmetic cost to 11 multiplications and 29 additions, as in the Loeffler DCT algorithm [23].

However, the number of DCT calls in common image and video encoders is extraordinarily high. For instance, a single image frame of high-definition TV (HDTV) contains 32.400 88 image subblocks. Therefore, computational savings in the DCT step may effect significant performance gains, both in terms of speed and power consumption [24, 25]. Being quite a mature area of research [26], there is little room for improvement on the exact DCT computation. Thus, one approach to further minimize the computational cost of computing the DCT is the use of matrix approximations [27, 14]. Such approximations provide matrices with similar mathematical behavior to the exact DCT while presenting a dramatically low arithmetic cost.

The goals of this paper are as follows. First, we aim at establishing an optimization problem to facilitate the derivation of 8-point DCT approximations. To this end, we adopt a vector angle based objective function to minimize the angle between the rows of the approximate and the exact DCT matrices subject to orthogonality constraints. Second, the sought approximations are (i) submitted to a comprehensive assessment based on well-known figures of merit and (ii) compared to state-of-the-art DCT approximations. Third, fast algorithms are derived and realized in FPGA hardware with comparisons with competing methods. We also examine the performance of the obtained transformations in the context of image compression and video coding. We demonstrate that one of our DCT approximations can outperform the exact DCT in terms of effected quality after image compression.

This paper is organized as follows. In Section 2, the 8-point DCT and popular DCT approximations are discussed. Section 3 introduces an optimization problem to pave the way for the derivation of new DCT approximations. In Section 4 the proposed approximations are detailed and assessed according to well-known performance measures. In Section 5 a fast algorithm for the proposed approximation is presented. Moreover, a field-programmable gate array (FPGA) design is proposed and compared with competing methods. Section 6 furnishes computational evidence of the appropriateness of the introduced approximate DCT for image and video encoding. Section 7 concludes the paper.

## 2 DCT Approximations

Let and be 8-point column vectors related by the DCT. Therefore, they satisfy the following expression: , where

 C=⎡⎢ ⎢ ⎢ ⎢⎣γ3−γ3−γ3−γ3−γ3−γ3−γ3−γ3γ0−γ2−γ4−γ6−γ6−γ4−γ2−γ0γ1−γ5−γ5−γ1−γ1−γ5−γ5−γ1γ2−γ6−γ0−γ4−γ4−γ0−γ6−γ2γ3−γ3−γ3−γ3−γ3−γ3−γ3−γ3γ4−γ0−γ6−γ2−γ2−γ6−γ0−γ4γ5−γ1−γ1−γ5−γ5−γ1−γ1−γ5γ6−γ4−γ2−γ0−γ0−γ2−γ4−γ6⎤⎥ ⎥ ⎥ ⎥⎦,

and , for .

Common algorithms for efficient DCT computation include: (i) Yuan et al. [28], (ii) Arai et al. [29], (iii) Chen et al. [30], (iv) Feig–Winograd [31], (v) Lee [32], and (vi) Hou [33]. Table 1 lists the computational costs associated to such methods. The theoretical minimal multiplicative complexity is 11 multiplications [34, 23], which is attained by the Loeffler algorithm [23].

A DCT approximation is a matrix  capable of furnishing where according to some prescribed criterion, such as matrix proximity or coding performance [3]. In general terms, as shown in [45, 46, 47, 3, 48], is a real valued matrix which consists of two components: (i) a low-complexity matrix  and (ii) a diagonal matrix . Such matrices are given by:

 ˆC=S⋅T, (1)

where

 S=√(T⋅T⊤)−1. (2)

The operation is the matrix square root operation [49, 50].

Hereafter low-complexity matrices are referred to as , where the subscript indicates the considered method. Also DCT approximations are referred to as . If the subscript is absent, then we refer to a generic low-complexity matrix or DCT approximation.

A traditional DCT approximation is the signed DCT (SDCT) [51] which matrix is obtained according to: , where is the entry-wise signum function. Therefore, in this case, the entries of the associated low-complexity matrix are in . Thus matrix is multiplierless.

Notably, in the past few years, several approximations for the DCT have been proposed as, for example, the rounded DCT (RDCT, [27], the modified RDCT (MRDCT, [48], the series of approximations proposed by Bouguezel–Ahmad–Swamy (BAS) [52, 53, 54, 55, 56], the Lengwehasatit–Ortega approximation (LO, [57], the approximation proposed by Pati et al. [58], and the collection of approximations introduced in [14]. Most of these approximations are orthogonal with low computational complexity matrix entries. Essentially, they are matrices defined over the set , with the multiplication by powers of two implying simple bit-shifting operations.

Such approximations were demonstrated to be competitive substitutes for the DCT and its related integer transformations as shown in [27, 48, 52, 53, 54, 55, 56, 57, 14]. Table 2 illustrates some common integer transformations linked to the DCT approximations.

## 3 Greedy Approximations

### 3.1 Optimization Approach

Approximate DCT matrices are often obtained by fully considering the exact DCT matrix , including its symmetries [59], fast algorithms [33, 23], parametrizations [31], and numerical properties [28]. Usually the low-complexity component of a DCT approximation is found by solving the following optimization problem:

 T=argminT′approx(T′,C),

where is a particular approximation assessment function—such as proximity measures and coding performance metrics [3]—and subject to various constraints, such as orthogonality and low-complexity of the candidate matrices .

However, the DCT matrix can be understood as a stack of row vectors , , as follows:

 C=[c1c2c3c4c5c6c7c8]⊤. (3)

In the current work, to derive an approximation for , we propose to individually approximate each of its rows in the hope that the set of approximate rows generate a good approximate matrix. Such heuristic can be categorized as a greedy method [60]. Therefore, our goal is to derive a low-complexity integer matrix

 T=[t1t2t3t4t5t6t7t8]⊤ (4)

such that its rows , , satisfy:

 tk=argmint∈Derror(t,ck),k=1,2,…,8, (5)

subject to constraints such as (i) low-complexity of the candidate vector  and (ii) orthogonality of the resulting matrix . The objective function returns a given error measure and is a suitable search space.

### 3.2 Search Space

In order to obtain a low-complexity matrix , its entries must be computationally simple [3, 11]. We define the search space as the collection of 8-point vectors whose entries are in a set, say , of low-complexity elements. Therefore, we have the search space . Some choices for  include: and . Tables 3 and 4 display some elements of the search spaces and . These search spaces have 6,561 and 390,625 elements, respectively.

### 3.3 Objective Function

The problem posed in (5) requires the identification of an error function to quantify the “distance” between the candidate row vectors from  and the rows of the exact DCT. Related literature often consider error functions based on matrix norms [46], proximity to orthogonality [61], and coding performance [3].

In this work, we propose the utilization of a distance based on the angle between vectors as the objective function to be minimized. Let and be two vectors defined over the same Euclidean space. The angle between vectors is simply given by:

 angle(u,v)=arccos(⟨u,v⟩∥u∥⋅∥v∥),

where is the inner product and indicates the norm induced by the inner product [62].

### 3.4 Orthogonality and Row Order

In addition, we require that the ensemble of rows , , must form an orthogonal set. This is to ensure that an orthogonal approximation can be obtained. As shown in [45, 47], for this property to be satisfied, it suffices that:

 T⋅T⊤=[diagonal matrix].

Because we aim at approximating each of the exact DCT matrix rows individually, the row sequential order according to which the approximations are performed may matter. Notice that we approximate the rows of the DCT based on a set of low-complexity rows, the search space. For instance, let us assume that we approximate the rows in the following order: . Once we find a good approximate row for the first exact row, i.e., a row vector in the search space which has the smallest angle in relation to that exact row, the second row is approximated considering only the row vectors in the search space that are orthogonal to the approximation for the first row. After that, the third exact row is approximated considering only the row vectors in the search space that are orthogonal to the first and second rows already chosen. And so on. This procedure characterize the greedy nature of the proposed algorithm.

Consider now the approximation order , a permutation of . In this case, we start by approximating the fourth exact row considering the whole search space because we are starting from it. Hence, the obtained approximate row might be different from the one obtained by considering , since in that case the search space is restricted in a different manner.

As an example, consider the DCT matrix of length 8, introduced in Section 2 of the manuscript. If considering the low complexity set and the approximation order (1, 2, 3, 4, 5, 6, 7, 8) we obtain the following approximate matrix:

 ⎡⎢ ⎢ ⎢⎣1−1−1−1−1−1−1−11−1−1−0−0−1−1−11−0−0−1−1−0−0−11−0−1−1−1−1−0−11−1−1−1−1−1−1−11−1−0−1−1−0−1−10−1−1−0−0−1−1−00−1−1−1−1−1−1−0⎤⎥ ⎥ ⎥⎦.

In other hand, if we consider the reverse approximation order, (8, 7, 6, 5, 4, 3, 2, 1), we obtain the following matrix:

 ⎡⎢ ⎢ ⎢⎣1−1−1−1−1−1−1−11−1−1−0−0−1−1−11−1−1−1−1−1−1−11−0−1−1−1−1−0−11−1−1−1−1−1−1−11−1−0−1−1−0−1−11−1−1−1−1−1−1−10−1−1−1−1−1−1−0⎤⎥ ⎥ ⎥⎦.

Therefore, the row sequence considered matters for the resulting matrix. The sequence matters during the process of finding the approximate matrix.

Thus, the row vectors from the exact matrix must be approximated in all possibles sequences. For a systematic procedure, all the possible permutations of the sequence  must be considered. Let , , be the resultant sequence that determines the th permutation; e.g. . The particular elements of a sequence are denoted by , . Then, for the given example above, we have .

### 3.5 Proposed Optimization Problem

Considering the above discussion, we can re-cast (5) in more precise terms. For each permutation sequence , we have the following optimization problem:

 t℘m(k)=argmind∈Dangle(c℘m(k),d),k=1,2,…,8, (6)

subject to:

 ⟨t℘m(i),t℘m(j)⟩=0,i≠j,

and a fixed search space . For each , the solution of the above problem returns eight vectors, , that are used as the rows of the desired low-complexity matrix. Note that each sequence may result in a different solution. Effectively, there are problems to be solved. In principle, each permutation  can furnish a different matrix.

Because the search space is relatively small, we solved (6) by means of exhaustive search. Although simple, such approach ensures the attainment of a solution and avoids convergence issues [60]. Figure 1 shows the pseudo-code for the adopted procedure to solve (6). It is important to highlight that although the proposed formulation is applicable to arbitrary transform lengths, it may not be computationally feasible. For this reason, we restrict our analysis to the 8-point case. Section 6.2 discusses an alternative form of generating higher order DCT approximations.

## 4 Results and Evaluation

In this section, we apply the proposed method aiming at the derivation of new approximations for the 8-point DCT. Subsequently, we analyze and compare the obtained matrices with a representative set of DCT approximations described in the literature according to several well-known figures of merit [63].

### 4.1 New 8-point DCT Approximations

Considering the search spaces and (cf. Table 3 and Table 4, respectively), we apply the proposed algorithm to solve (6). Because the first and fifth rows of the exact DCT are trivially approximated by the row vectors and , respectively, we limited the search to the remaining six rows. As a consequence, the number of possible candidate matrices is reduced to . For , only two different matrices were obtained, which coincide with previously archived approximations, namely: (i) the RDCT [27] and (ii) the matrix introduced in [14]. These approximations are depicted in Table 2.

On the other hand, considering the search space , the following two new matrices were obtained:

 T1=⎡⎢ ⎢ ⎢⎣1−1−1−1−1−1−1−12−2−1−0−0−1−2−22−1−1−2−2−1−1−21−0−2−2−2−2−0−11−1−1−1−1−1−1−12−2−0−1−1−0−2−21−2−2−1−1−2−2−10−1−2−2−2−2−1−0⎤⎥ ⎥ ⎥⎦,
 T2=⎡⎢ ⎢ ⎢⎣1−1−1−1−1−1−1−12−1−2−0−0−2−1−22−1−1−2−2−1−1−22−0−2−1−1−2−0−21−1−1−1−1−1−1−11−2−0−2−2−0−2−11−2−2−1−1−2−2−10−2−1−2−2−1−2−0⎤⎥ ⎥ ⎥⎦.

According to (1) and (2), the above low-complexity matrices and can be modified to provide orthogonal transformations and . The selected orthogonalization procedure is based on the polar decomposition as described in [45, 47, 64]. Thus, the orthogonal DCT approximations associated to the matrices  and are given by

 ˆC1=S1⋅T1andˆC2=S2⋅T2,

where

 Si=√(Ti⋅Ti⊤)−1,i=1,2.

Thus, it follows that:

 S1 =S2=diag(1√8,1√18,1√20,1√18,1√8,1√18,1√20,1√18).

Other simulations were performed considering extended sets of elements. In particular, following sets were considered: , , , , and . Generally, the resulting matrices did not perform as well as the ones being proposed. Moreover, the associate computational cost was consistently higher.

The number of vectors in the search space can be calculated as (cf. Section 3.2). Therefore, including more elements to effects a noticeable increase in the size of the search space. As a consequence, the processing time to derive all the candidate matrices increases accordingly.

### 4.2 Approximation Measures

Approximations measurements are computed between an approximate matrix (not the low-complexity matrix ) relative to the exact DCT. To evaluate the performance of the proposed approximations, and , we selected traditional figures of merit: (i) total error energy ([46]; (ii) mean square error ([3, 65]; (iii) coding gain ([3, 66, 67]; and (iv) transform efficiency ([3]. The MSE and total error energy are suitable measures to quantify the difference between the exact DCT and its approximations [3]. The coding gain and transform efficiency are appropriate tools to quantify compression, redundancy removal, and data decorrelation capabilities [3]. Additionally, due the angular nature of the objective function required by the proposed optimization problem, we also considered descriptive circular statistics [68, 69]. Circular statistics allows the quantification of approximation error in terms of the angle difference between the row vectors of the approximated and exact matrix.

Hereafter we adopt the following quantities and notation: the interpixel correlation is  [3, 66, 35], is an approximation for the DCT, and , where is the covariance matrix of , whose elements are given by , . We detail each of these measures below.

#### 4.2.1 Total Energy Error

The total energy error is a similarity measure given by [46]:

 ϵ(ˆC)=π⋅∥C−ˆC∥2F,

where represents the Frobenius norm [70].

#### 4.2.2 Mean Square Error

The MSE of a given approximation is furnished by [3, 65]:

 MSE(ˆC)=18⋅tr((C−ˆC)⋅Rx⋅(C−ˆC)⊤).

where represents the trace operator [3]. The total energy error and the mean square error are appropriate measures for capturing the approximation error in a Euclidean distance sense.

#### 4.2.3 Coding Gain

The coding gain quantifies the energy compaction capability and is given by [3]:

 Cg(ˆC)=10⋅log10⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩18∑8i=1r2i,i(∏8i=1r2i,i⋅∥ˆci∥2)1/8⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭,

where is the th element of the diagonal of  [3] and is the th row of .

However, as pointed in [67], the previous definition is suitable for orthogonal transforms only. For non-orthogonal transforms, such as SDCT [51] and MRDCT [48], we adopt the unified coding gain [67]. For , let and be th row of and , respectively. Then, the unified coding gain is given by:

 C∗g(ˆC)=10⋅log10{8∏i=118√Ai⋅Bi},

where returns the sum of all elements of the input matrix, the operator represents the element-wise product, and .

#### 4.2.4 Transform Efficiency

The transform efficiency is an alternative measure to the coding gain, being expressed according to [3]:

 η(ˆC)=∑8i=1|ri,i|∑8i=1∑8j=1|ri,j|⋅100,

where is the th entry of ,  [3].

#### 4.2.5 Circular Statistics

Because the objective function in (6) is the operator angle, its associate values are distributed around the unit circle. This type of data is suitably analyzed by circular statistics tools [68, 69, 71]. Let be an arbitrary 8-point vector and . The angle function is given by [68]:

 θ=angle(a′,q),k=1,2,…,8,

where is the normalized vector of .

The mean angle (circular mean) is given by [71, 69]:

 ¯θ=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩arctan(S/C),if C>0 and S≥0,π/2,if C=0 and S>0,arctan(S/C)+π,if C<0,arctan(S/C)+2π,if C≥0 and S<0,undefined,if C=0 and S=0,

where , , and is a collection of angles. The circular variance is given by [68]:

 V=1−√C2+S28.

The minimal variance occurs when all observed angles are identical. In this case, we have . In other hand, the maximum variance occurs when the observations are uniformly distributed around the unit circle. Thus,  [69].

Considering the rows of the 8-point DCT matrix and of a given 8-point DCT approximate matrix, the angle function furnishes the following angles, respectively: and , (cf. (3) and (4)). In this particular case, the mean circular difference, which measures the mean difference between the pairs of angles is defined as follows:

 ¯D=182⋅8∑i=18∑j=1(π−|π−|θci−θtj||).

The expression above considers the difference between all the possible pairs of angles. However, we are interested in comparing the angle between the th row of the DCT and the corresponding row of the approximated matrix, i.e., the cases where . Thus we have the modified circular mean difference according to:

 ¯Dmod=18⋅8∑i=1(π−|π−|θci−θti||).

### 4.3 Results and Comparisons

Table 5 shows the obtained measurements for all approximations derived, according to (1), from the low-complexity matrices considered in this paper. Table 6 brings a summary of the descriptive circular statistics. We also included the exact DCT and the integer DCT (IDCT) [72] for comparison. The considered IDCT is the 8-point approximation adopted in the HEVC standard [72]. A more detailed analysis on the performance of the proposed approximation in comparison with the IDCT is provided in Section 6.2. The proposed DCT approximation outperforms all competing methods in terms of MSE, coding gain, and transform efficiency. It also performs as the second best for total error energy measurement. It is unusual for an approximation to simultaneously excel in measures based on Euclidean distance ( and MSE) as well as in coding-based measures. The approximation by Lengwehasatit–Ortega ([57] achieves second best results MSE, and . Because of its relatively inferior performance, we removed the new approximation from our subsequent analysis. Nevertheless, could outperform the approximations  [52],  [54],  [55],  [56],  [51],  [48], and  [14] in all measures considered,  [14] and  [14] in terms of total error energy and transform efficiency,  [27] in terms of total error energy, and  [53] in terms of total error energy, MSE and transform efficiency. Hereafter we focus our attention on the proposed approximation .

The proposed search algorithm is greedy, i.e., it makes local optimal choices hoping to find the global optimum solution [60]. Therefore, it is not guaranteed that the obtained solutions are globally optimal. This is exactly what happens here. As can be seen in Table 6, the proposed matrix  is not the transformation matrix that provides the lowest circular mean difference among the approximations on literature. Despite this fact, the proposed matrix has outstanding performance according to the considered measures.

Figure 2 shows the effect of the interpixel correlation  on the performance of the discussed approximate transforms as measured by the unified coding gain difference compared to the exact DCT [73]. The proposed method outperforms the competing methods as its coding gain difference is smaller for any choice of . For highly correlated data the coding degradation in dB is roughly reduced by half when the proposed approximation  is considered.

## 5 Fast Algorithm and Hardware Realization

### 5.1 Fast Algorithm

The direct implementation of requires 48 additions and 24 bit-shifting operations. However, such computational cost can be significantly reduced by means of sparse matrix factorization [11]. In fact, considering butterfly-based structures as commonly found in decimation-in-frequency algorithms, such as [33, 74, 5], we could derive the following factorization for :

 T1=D⋅A4⋅A3⋅A2⋅A1,

where:

 A1=⎡⎢ ⎢ ⎢⎣111111111−11−11−11−1⎤⎥ ⎥ ⎥⎦,A2=⎡⎢ ⎢ ⎢⎣11111−11−11111⎤⎥ ⎥ ⎥⎦,
 A3=⎡⎢ ⎢⎣111−1111111⎤⎥ ⎥⎦,A4=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1121112−1−112112−11−21−11−12⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

and