# Information criteria and arithmetic codings : an illustration on raw images

###### Résumé

In this paper we give a short theoretical description of the general predictive adaptive arithmetic coding technique. The links between this technique and the works of J. Rissanen in the 80’s, in particular the BIC information criterion used in parametrical model selection problems, are established. We also design lossless and lossy coding techniques of images. The lossless technique uses a mix between fixed-length coding and arithmetic coding and provides better compression results than those separate methods. That technique is also seen to have an interesting application in the domain of statistics since it gives a data-driven procedure for the non-parametrical histogram selection problem. The lossy technique uses only predictive adaptive arithmetic codes and shows how a good choice of the order of prediction might lead to better results in terms of compression. We illustrate those coding techniques on a raw grayscale image.

Guilhem Coq , Christian Olivier , Olivier Alata , Marc Arnaudon \address \address

Laboratoire de Mathématiques et Applications

Université de Poitiers

Téléport 2 - BP 30179 86962 Chasseneuil FRANCE

Phone: +(33) 5 49 49 68 97 Fax: +(33) 5 49 49 69 01

email: coq,arnaudon@math.univ-poitiers.fr

Laboratoire Signal Image et Communications Université de Poitiers Téléport 2 - BP 30179 86962 Chasseneuil FRANCE Phone: +(33) 5 49 49 65 67 Fax: +(33) 5 49 49 65 70 email: olivier,alata@sic.sp2mi.univ-poitiers.fr## 1 Introduction

Arithmetic Coding (AC) is an efficient binary coding technique. We use it here in one of its most general form : the *predictive* and *adaptive* one. Even though those aspects of AC are known, it is quite hard to find literature dealing with both of them ; as well as to determine which aspects are actually used in image coding norms such as JPEG and JPEG2000. We try here to answer the first issue but could not collect useful informations about the second. This paper does not seek compression efficiency but wants to show how different AC processes may be used in both parametrical (§3) and non-parametrical (§4) model selection problems. This explains why we choose to work on raw images.

After a description of AC algorithm in §2, we take a closer look at the resulting codelength. To this end, we use works of J. Rissanen in [6, 7] and especially [8]. The main conclusion of §3 is that the codelength enters the family of information criteria, a widely used tool in the vast problem of model selection. We aim at showing that the *adaptive* aspect of the AC used here is an essential feature.

Next, we design in §4 a new lossless coding technique. It uses a mix between AC, which is compression efficient, and fixed-length coding, which is not. It is shown in §4.2 that correctly mixing those two methods gives better compression efficiency than using only AC. The most important parameter to be adjusted in order to get that "correct" mix is the order of prediction. Moreover, that method is shown in §4.3 to have a direct application in the histogram selection problem.

Finally we design in §5 a lossy coding technique which, once again, shows the importance of the order of prediction.

## 2 Generalities on arithmetic coding

### 2.1 Multiple Markov Chain

The notion of Multiple Markov Chain (MMC) leads to arithmetic coding. Let be a finite set with elements. An -valued process is an order MMC if is the smallest integer satisfying the law equality for all . We will always work in the case where that law does not depend on ; the chain is said homogeneous. An order 0 MMC is a sequence of independent random variables.

If is an order MMC, we will suppose that are independent and uniformly distributed on . For a state and a multiple state, we denote by the probability to see after . Consequently, choosing real numbers for and is enough for describing the evolution of . Let denote such a parameter and be a sequence of elements of , the likelihood of relatively to writes as :

(1) |

where is the number of occurences of after in .

### 2.2 Predictive adaptive arithmetic coding : PAAC

We deal here with a general AC which is both -predictive and adaptive ; we shorten it to -PAAC. *Predictive* means we code using orders that may be greater than 1, hence a prediction of the future state of the chain from the current state. *Adaptive* means we do not need any prior knowledge on the chain, except its order ; we learn how to predict the future step by step. Both notions have been formally introduced and studied by Rissanen [6, 7, 8]. For a more concrete description of arithmetic coding, we refer to [11] ; note that this paper does not mention the predictive aspect. Let us now give a theoretical description of the general -PAAC algorithm.

Let be a chain of elements of to be encoded and be the current interval firstly set to . For we note . The only prior we need is an order of coding , then the algorithm works as follows.

Suppose that the first symbols are dealt with ; means we have not started the coding yet. To deal with the -th symbol we actualize transition probabilities as follows :

where , , and denote the respective number of occurences of after and of in the chain ; must not count an occurence of at the very end of that chain. If , the multiple states vanish and we set . Those probabilities reflect what we know of the chain at the time of the coding process; they are the *adaptive* aspect. We then set the current state and split the current interval in smaller intervals according to the probabilities , . This way, we associate to each possible future state an interval whose length is proportional to the probability with which we expect it. The -th symbol is dealt with by choosing for new the interval corresponding to .

Once the last symbol has been dealt with, we are left with an interval . Let denote the superior integer part, there exists two consecutive dyadic numbers with length in . We take as the arithmetic code of the sequence of bits given by the fractionnal part of the biggest one. If encoder and decoder agree on the order of coding, that sequence of bits is decodable, we refer again to [11].

For illustration in table 1, we take and encode at order . In the splits, we allow the left interval to .

Split | ||||

0 | ||||

1 | ||||

2 | ||||

3 | ||||

4 | not used | |||

Code : 01001 ; predecessor : 01000 | ||||

Both 1/4+1/32 and 1/4 belong to |

This example shows the following general fact about -PAAC : the more unexpected behaviours occur in the chain, the smaller is the last , the longer is the code. For instance at step we expected with probability 2/3, and observed . This caused us to choose the small interval . For comparison, if had occured the code would have been 0110 which is 1 bit shorter. This leads us to the notion of information criteria (IC).

## 3 Information Criteria

Let us show how the PAAC may be used to solve a model selection problem being : if is a realisation of an unknown MMC (§2.1), which is its order ? More precisely, we will see how the *adaptive* aspect of the PAAC is involved.

### 3.1 Coding approach of the model selection problem

As mentionned earlier the -PAAC length of , say , is ruled by the unexpected events in : the more unexpected events, the longer the code. Consequently, if is ruled by an unknown order MMC and we try to -PAAC it at an order , many unexpected events might occur : either because and we do not look far enough in the past, or because and we take into account informations relative to a too far away past which has actually no influence on the future. Thus the minimization of is an appropriate tool for seeking .

The works of Rissanen will confirm that idea and establish a link with Information Criteria (IC).

### 3.2 Rissanen’s result

In [8] it is shown that asymptotically behaves as :

(2) |

where is the maximum likelihood (ML) estimator of order for , *i.e.* the parameter that maximizes (1).

BIC stands for Bayesian Information Criterion and enters the formalism of IC first introduced by Akaike [1] ; let us mention [10, 5, 3] in addition to [1, 8] as important steps in the theory of IC.

Here is the idea behind IC : the first term of the criterion (2), referred to as the ML term, decreases as grows. This is mainly because the ML estimator fits the datas more accurately if we let him look far away in the past. This phenomena is known as *overparametrization* and is the major problem to be solved in model selection, it appears on figure 1. On the other hand, the second term, the penalty, increases as grows due to which is the number of free parameters in the MMCs model of order . Therefore, the minimization of IC over realizes a balance between the data fitting, measured by the ML term, and the complexity of the model needed to obtain such a fitting, measured by the penalty.

The quantity is much faster to compute than ; the encoder should use BIC before encoding to find which order will achieve the minimum codelength.

One can design a *non-adaptive* order -predictive arithmetic coding process whose codelength would be exactly . However, this process requires to send the parameter for decodability and, especially, it no longer answers the problem of order selection since ML suffers the overparametrization issue. In terms of IC, the *adaptive* aspect of the process creates the penalty term which avoids overparametrization, see again figure 1.

### 3.3 Comparison of actual codings with criterion

We generate a realization of an order MMC with and . For we encode it with -PAAC process. We also compute the criterion and the quantity . Results are presented on figure 1 divided by to express them as a bit-rate.

As expected, BIC and -PAAC curves present a minimum at while the ML method overparametrizes at .

Note that, when computing BIC, it is desirable to have enough observations compared to the number of free parameters, empirically :

(3) |

would be good. If is too small behind the number of transition probabilities to be estimated, those transitions do not occur often in the chain and their estimation is weak, resulting in the penalty to dominate the ML term. An alternative would be to compute the number of transitions actually observed in the chain and plug them in (2) instead of .

## 4 Lossless coding of raw images

Let be the set of integers from to . Let us choose an greyscale image and set . Firstly, the image has to be turned into a vector . For order codings, the way this linearization is done does matter since one does not want to lose proximity information on the pixels. We have chosen the "zigzag" linearization used in blocks of DCT transform in JPEG norm [12]. Other transformations have been tested and results are quite similar. Let us now describe our lossless coding method.

### 4.1 Lossless coding method

It is a two-part coding technique. In first, choose a partition of ; that is a set of disjoined intervals whose union is . Then, from , form a new chain as follows :

(4) |

That is, each denotes the number of the interval of in which falls. The chain has values in . For an order, we denote by its -PAAC codelength. If , we set to 0.

Secondly, we denote by the number of integers in . Once is known one needs, in order to recover , to specify which one of those integers actually is. This is done for each by a simple code with fixed length . Therefore, the number of bits required to recover from is .

For decodability, one should also send the partition chosen to encode. We do not take this into account here since the codelength required to this end is very small compared to the quantities and we work on.

Let us note the total lossless codelength of with help of the partition .

### 4.2 Choice of partition and order of prediction

As grows also grows because has values in . By opposition decreases since the intervals get smaller. Consequently, there should exist a partition which balances those two phenomena by minimizing the codelentgh . This argument takes place in the theory of Minimum Description Length (MDL) introduced by Rissanen and for which we refer to Grunwald and al. [4].

We estimate by , see §3. We then define the following criterion as an estimation of the lossless order coding of with the partition :

(5) |

We restrict ourselves to regular partitions ; *i.e.* partitions whose intervals all have length . We work with the greyscale Lena image.

Figure 2 presents, for ranging from 1 to 256 the estimated bit-rate for . For , the condition (3) is satisfied for up to 115 but we still give the curve up to for completeness. The algorithm complexity increases considerably with the order and computations for shows no significant improvements ; in the case we went up to which makes about 10.

Note that our coding technique with is equivalent to the pgm format^{1}^{1}1http://www.imagemagick.org/script/formats.php. In the other extreme case, with we get and ; this means we directly encode the chain with the -PAAC process. Considering this, figure 2 shows how a mix of those two methods leads to better bit-rates. The minimization of the criterion (5) tells us which partition is to be chosen in order to get the correct mix.

More important, -PAAC is clearly seen to reaches better bit-rates than 0-PAAC : roughly 7 bpp with huge partition for 0-PAAC against 5.4 bpp with for 1-PAAC. Note that the order chosen for the coding process only affects the first term of the criterion (5), hence we may also give the following interpretation of the curves in figure 2 : no matter how we quantize them via a partition, the grey scales in our image should not be considered independent but rather of order 1. Unsurprisingsly, that dependance of a pixel greyscale on its neighboors may be shown this way on most of common images which content is comprehensible by the human brain.

### 4.3 Histogram selection statistical problem

It is interesting to note that the criterion (5) may be directly extended to the histogram selection statistical problem : if is an unknown density on an interval and is a sample from this density, which partition of is to be chosen for building an histogram estimator of ?

For such a partition , by independence of and formula (4), it is readily seen that the ’s are independent so that the 0-PAAC of will be the best. Let us denote by the length of and suppose that each contains a number of real numbers proportional to . Then, up to terms which do not depend on and after little calculations, the estimated lossless order 0 codelength of using is :

(6) |

This criterion is in shape really similar to the one used by Birgé and al. in [2] except it has a coding background which justifies its use. Moreover it is not restricted to regular partitions of . If is supposed to contain real numbers, there could be partitions to be tested, which is huge. Rissanen and al. presented in [9] a dynamic programing method which shrinks to the number of computations required to find which one of the partitions achieves the minimum of (4.3). For illustration, we present in figure 3 the partition chosen on a -sample from the Laplace distribution used to represent DCT coefficients in the JPEG norm. We assume that and .

## 5 Lossy coding of raw images

We keep the same linearization as in §4 to turn an image into a vector and now describe our lossy coding method.

### 5.1 Lossy coding method

For a partition of in intervals, we define the -valued chain as in (4). Next, we quantize the datas on at their barycenter. That is, for each , we consider all ’s falling into , compute their barycenter, round it to the closest integer and finally set all those ’s to . This gives a new image with only grey levels, this is where the loss occurs. Moreover, that quantization creates an injective map :

With the help that map, the decoder is able to reconstruct the quantized image from only the chain ; therefore is to be sent. However, the coding of such a map is very short compared to the codelength of the chain , so we drop it.

Now we are left to encode with the -PAAC process, hence the estimation of the lossy codelength of our image by the BIC criterion (2) :

### 5.2 Influence of the order on bit-rates

### 5.3 Comparison involving distortion

Each value of brings a certain quantization, thus a certain distortion. We measure this distortion by the Peak Signal to Noise Ratio (PSNR) and plot it against the corresponding bit-rate of 0-PAAC and 1-PAAC in figure 5. For illustration, we present in figure 6 the two quantized Lena images obtained for and with their respective PSNR. We also give bit-rates achieved by 0-PAAC and 1-PAAC on each of those image. For instance, this shows that at an imposed rate of about 1.4 bpp, the 1-PAAC allows to encode Lena with a PSNR of 33.15 dB while the 0-PAAC only gives 22.11 dB.

levels : 22.11 dB | levels : 33.15 dB |

0-PAAC : 1.36 bpp | 0-PAAC : 3.18 bpp |

1-PAAC : 0.43 bpp | 1-PAAC : 1.39 bpp |

## 6 Perspectives

As mentionned in the introduction we did not provide efficient compression results by intentionally working on raw images. Therefore it would be interesting to insert the discussed binary coding methods after, for instance, the wavelet transform block of the JPEG2000 norm. In order to compress, one should in first determine with the BIC criterion (2) the order of the sequence of wavelet coefficients and then use the criterion (5) to determine the partition which allows to encode those coefficients efficiently.

ACKNOWLEDGMENTS

The authors would like to thank the PIMHAI, INTERREG IIIB "Arc Atlantique" project for its support in the writing of this paper.

First published in *Proceedings og the 15th European Signal processing Conference EUSIPCO 2007* in 2007, published by EURASIP.

## Références

- [1] H. Akaike. A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, 19:716–723, 1974.
- [2] Lucien Birgé and Yves Rozenholc. How many bins should be put in a regular histogram. ESAIM Probab. Stat., 10:24–45 (electronic), 2006.
- [3] Abdelaziz El Matouat and Marc Hallin. Order selection, stochastic complexity and Kullback-Leibler information. In Athens Conference on Applied Probability and Time Series Analysis, Vol. II (1995), volume 115 of Lecture Notes in Statist., pages 291–299. Springer, New York, 1996.
- [4] Peter D. Grunwald, In Jae Myung, and Mark A. Pitt. Advances in Minimum Description Length: Theory and Applications (Neural Information Processing). The MIT Press, 2005.
- [5] R. Nishii. Maximum likelihood principle and model selection when the true model is unspecified. J. Multivariate Anal., 27(2):392–403, 1988.
- [6] Jorma Rissanen. Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203, 1976.
- [7] Jorma Rissanen. Complexity of strings in the class of Markov sources. IEEE Trans. Inform. Theory, 32(4):526–532, 1986.
- [8] Jorma Rissanen. Stochastic complexity and modeling. Ann. Statist., 14(3):1080–1100, 1986.
- [9] Jorma Rissanen, Terry P. Speed, and Bin Yu. Density estimation by stochastic complexity. IEEE Transactions on Information Theory, 38(2):315–323, 1992.
- [10] Gideon Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):461–464, 1978.
- [11] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Commun. ACM, 30(6):520–540, 1987.
- [12] Azza Ouled Zaid, Christian Olivier, Olivier Alata, and Francois Marmoiton. Transform image coding with global thresholding: application to baseline jpeg. Pattern Recognition Letters, 24(7):959–964, 2003.