Abstract

# PhD Dissertation: Generalized Independent Components Analysis Over Finite Alphabets

The Raymond and Beverly Sackler Faculty of Exact Sciences
School of Mathematical Sciences
Department of Statistics and Operations Research

PhD Thesis:
Generalized Independent Components Analysis
Over Finite Alphabets

by

Amichai Painsky
THESIS SUBMITTED TO THE SENATE OF TEL-AVIV UNIVERSITY
in partial fulfillment of the requirements for the degree of
“DOCTOR OF PHILOSOPHY”
Under the supervision of
Prof. Saharon Rosset and Prof. Meir Feder
September 01, 2016

## Abstract

Generalized Independent Components Analysis Over Finite Alphabets

[.5em] by Amichai Painsky

Independent component analysis (ICA) is a statistical method for transforming an observable multi-dimensional random vector into components that are as statistically independent as possible from each other. Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet.

In this thesis we consider a formulation of the finite-field case in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow’s minimal redundancy representation (Barlow et al., 1989) and is considered an open problem. We propose several theorems and show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems (Painsky et al., 2016b). Moreover, we show that there exists a simple transformation (namely, order permutation) which provides a greedy yet very effective approximation of the optimal solution (Painsky et al., 2016c). We further show that while not every random vector can be efficiently decomposed into independent components, the vast majority of vectors do decompose very well (that is, within a small constant cost), as the dimension increases. In addition, we show that we may practically achieve this favorable constant cost with a complexity that is asymptotically linear in the alphabet size. Our contribution provides the first efficient set of solutions to Barlow’s problem with theoretical and computational guarantees.

The minimal redundancy representation (also known as factorial coding (Schmidhuber, 1992)) has many applications, mainly in the fields of neural networks and deep learning (Becker & Plumbley, 1996; Obradovic, 1996; Choi & Lee, 2000; Bartlett et al., n.d.; Martiriggiano et al., 2005; Bartlett, 2007; Schmidhuber et al., 2011; Schmidhuber, 2015). In our work we show that the generalized ICA also applies to multiple disciplines in source coding (Painsky et al., 2016c). A special attention is given to large alphabet source coding (Painsky et al., 2015, 2016c, 2016d). We propose a conceptual framework in which a large alphabet memoryless source is decomposed into multiple sources with with a much smaller alphabet size that are “as independent as possible”. This way we slightly increase the average code-word length as the decomposed sources are not perfectly independent, but at the same time significantly reduce the overhead redundancy resulted by the large alphabet of the observed source. Our suggested method is applicable for a variety of large alphabet source coding setups.

To my father, Moti Painsky, who encouraged me to earn my B.Sc. and get a job

## Acknowledgements

First and foremost I would like to express my gratitude and love to my wife Noga and my new born daughter Ofri, who simply make me happy every single day. Thank you for taking part in this journey with me.

I wish to express my utmost and deepest appreciation to my advisers, Prof. Saharon Rosset and Prof. Meir Feder, from whom I learned so much, in so many levels. Coming from different disciplines and backgrounds, Meir and Saharon inspired me to dream high, but at the same time stay accurate and rigorous. This collaboration with two extraordinary experts has led to a fascinating research with some meaningful contributions. On a personal level, it was a privilege to work with such exceptional individuals. Looking back five years ago, when I moved back to Israel to pursue my Ph.D. in Tel Aviv University, I could not dream it up any better.

In the course of my studies I had the opportunity to meet and learn from so many distinguished individuals. Prof. Felix Abramovich, to whom I owe most of my formal statistical education. Prof. David Burshtein, who gave me the opportunity to teach the undergraduate Digital to Signal Processing class for the past four years. Prof. Uri Erez, who assigned me as a chief of Teaching Assistants in his Random Signals and Noise class. Dr. Ofer Shayevitz, who (unintentionally) led me to pursue my Ph.D. in Israel, on top of my offers abroad. Dr. Ronny Luss, who introduced me to Saharon when I moved back to Israel and guided my first steps in Optimization. My fellow faculty members and graduate students from both the Statistics and Electrical Engineering departments, Dr. Ofir Harari, Dr. Shlomi Lifshits, Dr. David Golan, Aya Vituri, Shachar Kaufman, Keren Levinstein, Omer Weissbord, Prof. Rami Zamir, Dr. Yuval Kochman, Dr. Zachi Tamo, Dr. Yair Yona, Dr. Anatoly Khina, Dr. Or Ordentlich, Dr. Ronen Dar, Assaf Ben-Yishai, Eli Haim, Elad Domanovitz, Nir Elkayam, Uri Hadar, Naor Huri, Nir Hadas, Lital Yodla and Svetlana Reznikov.

Finally I would like to thank my parents, Moti and Alicia Painsky, for their endless love, support and caring. It has been a constant struggle for the past decade, explaining what it is that I do for living. Yet it seems like you are quite content with the results.

## Chapter 1 Introduction

Independent Component Analysis (ICA) addresses the recovery of unobserved statistically independent source signals from their observed mixtures, without full prior knowledge of the mixing function or the statistics of the source signals. The classical Independent Components Analysis framework usually assumes linear combinations of the independent sources over the field of real valued numbers (Hyvärinen et al., 2004). A special variant of the ICA problem is when the sources, the mixing model and the observed signals are over a finite field.

Several types of generative mixing models can be assumed when working over GF(P), such as modulu additive operations, OR operations (over the binary field) and others. Existing solutions to ICA mainly differ in their assumptions of the generative mixing model, the prior distribution of the mixing matrix (if such exists) and the noise model. The common assumption to these solutions is that there exist statistically independent source signals which are mixed according to some known generative model (linear, XOR, etc.).

In this work we drop this assumption and consider a generalized approach which is applied to a random vector and decomposes it into independent components (as much as possible) with no prior assumption on the way it was generated. This problem was first introduced by Barlow et al. (1989) and is considered a long–standing open problem.

In Chapter 2 we review previous work on ICA over finite alphabets. This includes two major lines of work. We first review the line of work initiated by Yeredor (2007). In this work, Yeredor focuses on linear transformations where the assumptions are that the unknown sources are statistically independent and are linearly mixed (over GF(P)). Under these constraints, he proved that the there exists a unique transformation matrix to recover the independent signals (up to permutation ambiguity). This work was later extended to larger alphabet sizes (Yeredor, 2011) and different generative modeling assumptions (Šingliar & Hauskrecht, 2006; Wood et al., 2012; Streich et al., 2009; Nguyen & Zheng, 2011). In a second line of work, Barlow et al. (1989) suggest to decompose the observed signals “as much as possible”, with no assumption on the generative model. Barlow et al. claim that such decomposition would capture and remove the redundancy of the data. However, they do not propose any direct method, and this hard problem is still considered open, despite later attempts (Atick & Redlich, 1990; Schmidhuber, 1992; Becker & Plumbley, 1996).

In Chapter 3 we present three different combinatorical approaches for independent decomposition of a given random vector, based on our published paper (Painsky et al., 2016b). In the first, we assume that the underlying components are completely independent. This leads to a simple yet highly sensitive algorithm which is not robust when dealing with real data. Our second approach drops the assumption of statistically independent components and strives to achieve “as much independence as possible” (as rigorously defined in Section 3.2) through a branch-and-bound algorithm. However, this approach is very difficult to analyze, both in terms of its accuracy and its computational burden. Then, we introduce a piece-wise linear approximation approach, which tightly bounds our objective from above. This method shows how to decompose any given random vector to its “as statistically independent as possible” components with a computational burden that is competitive with any known benchmarks.

In Chapter 4 we present an additional, yet simpler approach to the generalized ICA problem, namely, order permutation. Here, we suggest to represent the least probable realization of a given random vector with the number (Painsky et al., 2016c). Despite its simplicity, this method holds some favorable theoretical properties. We show that on the average (where the average is taken over all possible distribution functions of a given alphabet size), the order permutation is only a small constant away from full statistical independence, even as the dimension increases. In fact, this result provides a theoretical guarantee on the “best we can wish for”, when trying to decompose any random vector (on the average). In addition, we show that we may practically achieve the average accuracy of the order permutation with a complexity that is asymptotically linear in the alphabet size.

In Chapter 5 we focus on the binary case and compare our suggested approaches with linear binary ICA (BICA). Although several linear BICA methods were presented in the past years (Attux et al., 2011; Silva et al., 2014b, a), they all lack theoretical guarantees on how well they perform. Therefore, we begin this section by introducing a novel lower bound on the generalized BICA problem over linear transformations. In addition, we present a simple heuristic which empirically outperforms all currently known methods. Finally, we show that the simple order permutation (presented in the previous section) outperforms the linear lower bound quite substantially, as the alphabet size increases.

Chapter 6 discusses a different aspect of the generalized ICA problem, in which we limit ourselves to sequential processing (Painsky et al., 2013). In other words, we assume that the components of a given vector (or process) are presented to us one after the other, and our goal is to represent it as a process with statistically independent components (memoryless), in a no-regret manner. In this chapter we present a non-linear method to generate such memoryless process from any given process under varying objectives and constraints. We differentiate between lossless and lossy methods, closed form and algorithmic solutions and discuss the properties and uniqueness of our suggested methods. In addition, we show that this problem is closely related to the multi-marginal optimal transportation problem (Monge, 1781; Kantorovich, 1942; Pass, 2011).

In Chapter 7 we apply our methodology to multiple data compression problems. Here, we propose a conceptual framework in which a large alphabet memoryless source is decomposed into multiple “as independent as possible” sources with a much smaller alphabet size (Painsky et al., 2015, 2016c, 2016d). This way we slightly increase the average code-word length as the compressed symbols are no longer perfectly independent, but at the same time significantly reduce the redundancy resulted by the large alphabet of the observed source. Our proposed algorithm, based on our solutions to the Barlow’s problem, shows to efficiently find the ideal trade-off so that the overall compression size is minimal. We demonstrate our suggested approach in a variety of lossless and lossy source coding problems. This includes the classical lossless compression, universal compression and high-dimensional vector quantization. In each of these setups, our suggested approach outperforms most commonly used methods. Moreover, our proposed framework is significantly easier to implement in most of these cases.

This thesis provides a comprehensive overview of the following publications (Painsky et al., 2013, 2014, 2015, 2016a, 2016b, 2016d) and a currently under–review manuscript (Painsky et al., 2016c).

## Chapter 2 Overview of Related Work

In his work from , Barlow et al. (1989) presented a minimally redundant representation scheme for binary data. He claimed that a good representation should capture and remove the redundancy of the data. This leads to a factorial representation/ encoding in which the components are as mutually independent of each other as possible. Barlow suggested that such representation may be achieved through minimum entropy encoding: an invertible transformation (i.e., with no information loss) which minimizes the sum of marginal entropies (as later presented in (3.2)). Barlow’s representation is also known as Factorial representation or Factorial coding.

Factorial representations have several advantages. The probability of the occurrence of any realization can be simply computed as the product of the probabilities of the individual components that represent it (assuming such decomposition exists). In addition, any method of finding factorial codes automatically implements Occam’s razor which prefers simpler models over more complex ones, where simplicity is defined as the number of parameters necessary to represent the joint distribution of the data. In the context of supervised learning, independent features can also make later learning easier; if the input units to a supervised learning network are uncorrelated, then the Hessian of its error function is diagonal, allowing accelerated learning abilities (Becker & Le Cun, 1988). There exists a large body of work which demonstrates the use of factorial codes in learning problems. This mainly includes Neural Networks (Becker & Plumbley, 1996; Obradovic, 1996) with application to facial recognition (Choi & Lee, 2000; Bartlett et al., n.d.; Martiriggiano et al., 2005; Bartlett, 2007) and more recently, Deep Learning (Schmidhuber et al., 2011; Schmidhuber, 2015).

Unfortunately Barlow did not suggest any direct method for finding factorial codes. Later, Atick & Redlich (1990) proposed a cost function for Barlow’s principle for linear systems, which minimize the redundancy of the data subject to a minimal information loss constraint. This is closely related to Plumbey’s objective function (Plumbley, 1993), which minimizes the information loss subject to a fixed redundancy constraint. Schmidhuber (1992) then introduced several ways of approximating Barlow’s minimum redundancy principle in the non–linear case. This naturally implies much stronger results of statistical independence. However, Schmidhuber’s scheme is rather complex, and appears to be subject to local minima (Becker & Plumbley, 1996). To our best knowledge, the problem of finding minimal redundant codes, or factorial codes, is still considered an open problem. In this work we present what appears to be the first efficient set of methods for minimizing Barlow’s redundancy criterion, with theoretical and computational complexity guarantees.

In a second line of work, we may consider our contribution as a generalization of the BICA problem. In his pioneering BICA work, Yeredor (2007) assumed linear XOR mixtures and investigated the identifiability problem. A deflation algorithm is proposed for source separation based on entropy minimization. Yeredor assumes the number of independent sources is known and the mixing matrix is a -by- invertible matrix. Under these constraints, he proves that the XOR model is invertible and there exists a unique transformation matrix to recover the independent components up to permutation ambiguity. Yeredor (2011) then extended his work to cover the ICA problem over Galois fields of any prime number. His ideas were further analyzed and improved by Gutch et al. (2012).

Šingliar & Hauskrecht (2006) introduced a noise-OR model for dependency among observable random variables using (known) latent factors. A variational inference algorithm is developed. In the noise-OR model, the probabilistic dependency between observable vectors and latent vectors is modeled via the noise-OR conditional distribution. Wood et al. (2012) considered the case where the observations are generated from a noise-OR generative model. The prior of the mixing matrix is modeled as the Indian buffet process (Griffiths & Ghahramani, n.d.). Reversible jump Markov chain Monte Carlo and Gibbs sampler techniques are applied to determine the mixing matrix. Streich et al. (2009) studied the BICA problem where the observations are either drawn from a signal following OR mixtures or from a noise component. The key assumption made in that work is that the observations are conditionally independent given the model parameters (as opposed to the latent variables). This greatly reduces the computational complexity and makes the scheme amenable to a objective descent-based optimization solution. However, this assumption is in general invalid. Nguyen & Zheng (2011) considered OR mixtures and propose a deterministic iterative algorithm to determine the distribution of the latent random variables and the mixing matrix.

There also exists a large body of work on blind deconvolution with binary sources in the context of wireless communication (Diamantaras & Papadimitriou, 2006; Yuanqing et al., 2003) and some literature on Boolean/binary factor analysis (BFA) which is also related to this topic (Belohlavek & Vychodil, 2010).

## Chapter 3 Generalized Independent Component Analysis - Combinatorical Approach

The material in this Chapter is partly covered in (Painsky et al., 2016b).

### 3.1 Notation

Throughout the following chapters we use the following standard notation: underlines denote vector quantities, where their respective components are written without underlines but with index. For example, the components of the -dimensional vector are . Random variables are denoted with capital letters while their realizations are denoted with the respective lower-case letters. is the probability function of while is the entropy of . This means where the function denotes a logarithm of base and . Further, we denote the binary entropy of the Bernoulli parameter as .

### 3.2 Problem Formulation

Suppose we are given a random vector of dimension and alphabet size for each of its components. We are interested in an invertible, not necessarily linear, transformation such that is of the same dimension and alphabet size, . In addition we would like the components of to be as ”statistically independent as possible”.

The common ICA setup is not limited to invertible transformations (hence and may be of different dimensions). However, in our work we focus on this setup as we would like to be “lossless” in the sense that we do not lose any information. Further motivation to this setup is discussed in (Barlow et al., 1989; Schmidhuber, 1992) and throughout Chapter 7.

Notice that an invertible transformation of a vector , where the components are over a finite alphabet of size , is actually a one-to-one mapping (i.e., permutation) of its words. For example, if is over a binary alphabet and is of dimension , then there are possible permutations of its words.

To quantify the statistical independence among the components of the vector we use the well-known total correlation measure, which was first introduced by Watanabe (1960) as a multivariate generalization of the mutual information,

 C(Y––)=d∑j=1H(Yj)−H(Y––). (3.1)

This measure can also be viewed as the cost of coding the vector component-wise, as if its components were statistically independent, compared to its true entropy. Notice that the total correlation is non-negative and equals zero iff the components of are mutually independent. Therefore, “as statistically independent as possible” may be quantified by minimizing . The total correlation measure was considered as an objective for minimal redundency representation by Barlow et al. (1989). It is also not new to finite field ICA problems, as demonstrated by Attux et al. (2011). Moreover, we show that it is specifically adequate to our applications, as described in Chapter 7. Note that total correlation is also the Kullback-Leibler divergence between the joint probability and the product of its marginals (Comon, 1994).

Since we define to be an invertible transformation of we have and our minimization objective is

 d∑j=1H(Yj)→min. (3.2)

In the following sections we focus on the binary case. The probability function of the vector is therefore defined by over possible words and our objective function is simply

 d∑j=1hb(P(Yj=0))→min. (3.3)

We notice that is the sum of probabilities of all words whose bit equals . We further notice that the optimal transformation is not unique. For example, we can always invert the bit of all words, or shuffle the bits, to achieve the same minimum.

Any approach which exploits the full statistical description of the joint probability distribution of would require going over all possible words at least once. Therefore, a computational load of at least seems inevitable. Still, this is significantly smaller (and often realistically far more affordable) than , required by brute-force search over all possible permutations. Indeed, the complexity of the currently known binary ICA (and factorial codes) algorithms falls within this range. The AMERICA algorithm (Yeredor, 2011), which assumes a XOR mixture, has a complexity of . The MEXICO algorithm, which is an enhanced version of AMERICA, achieves a complexity of under some restrictive assumptions on the mixing matrix. In (Nguyen & Zheng, 2011) the assumption is that the data was generated over OR mixtures and the asymptotic complexity is . There also exist other heuristic methods which avoid an exhaustive search, such as (Attux et al., 2011) for BICA or (Schmidhuber, 1992) for factorial codes. These methods, however, do not guarantee convergence to the global optimal solution.

Looking at the BICA framework, we notice two fundamental a-priori assumptions:

1. The vector is a mixture of independent components and there exists an inverse transformation which decomposes these components.

2. The generative model (linear, XOR field, etc.) of the mixture function is known.

In this work we drop these assumptions and solve the ICA problem over finite alphabets with no prior assumption on the vector . As a first step towards this goal, let us drop Assumption and keep Assumption , stating that underlying independent components do exist. The following combinatorial algorithm proves to solve this problem, over the binary alphabet, in computations.

### 3.3 Generalized BICA with Underlying Independent Components

In this section we assume that underlying independent components exist. In other words, we assume there exists a permutation such that the vector is statistically independent . Denote the marginal probability of the bit equals as . Notice that by possibly inverting bits we may assume every is at most and by reordering we may have, without loss of generality, that . In addition, we assume a non-degenerate setup where . For simplicity of presentation, we first analyze the case where . This is easily generalized to the case where several may equal, as discussed later in this section.

Denote the probabilities of as , assumed to be ordered so that . We first notice that the probability of the all-zeros word, is the smallest possible probability since all parameters are not greater than 0.5. Therefore we have .

Since is the largest parameter of all , the second smallest probability is just . Therefore we can recover from , leading to . We can further identify the third smallest probability as . This leads to .

However, as we get to we notice we can no longer uniquely identify its components; it may either equal or . This ambiguity is easily resolved since we can compute the value of from the parameters we already found and compare it with . Specifically, If then we necessarily have from which we can recover . Otherwise from which we can again recover and proceed to the next parameter.

Let us generalize this approach. Denote as a set of probabilities of all words whose bits are all zero.

###### Theorem 1.

Let be an arbitrary index in . Assume we are given that , the smallest probability in a given set of probabilities, satisfies the following decomposition

 pi=πd⋅πd−1⋅…⋅πk+1⋅(1−πk)⋅πk−1⋅…⋅∙π1.

Further assume the values of are all given in a sorted manner. Then the complexity of finding the value of , and calculating and sorting the values of is .

###### Proof.

Since the values of and are given we can calculate the values which are still missing to know entirely by simply multiplying each element of by . Denote this set of values as . Since we assume the set is sorted then is also sorted and the size of each set is . Therefore, the complexity of sorting is the complexity of merging two sorted lists, which is .

In order to find the value of we need to go over all the values which are larger than and are not in . However, since both the list of all probabilities and the set are sorted we can perform a binary search to find the smallest entry for which the lists differ. The complexity of such search is which is smaller than . Therefore, the overall complexity is

Our algorithm is based on this theorem. We initialize the values and , and for each step we calculate and .

The complexity of our suggested algorithm is therefore . However, we notice that by means of the well-known quicksort algorithm (Hoare, 1961), the complexity of our preprocessing sorting phase is . Therefore, in order to find the optimal permutation we need for sorting the given probability list and for extracting the parameters of .

Let us now drop the assumption that the values of ’s are non-equal. That is, . It may be easily verified that both Theorem 1 and our suggested algorithm still hold, with the difference that instead of choosing the single smallest entry in which the probability lists differ, we may choose one of the (possibly many) smallest entries. This means that instead of recovering the unique value of at the iteration (as the values of ’s are assumed non-equal), we recover the smallest value in the list .

Notice that this algorithm is combinatorial in its essence and is not robust when dealing with real data. In other words, the performance of this algorithm strongly depends on the accuracy of and does not necessarily converge towards the optimal solution when applied on estimated probabilities.

### 3.4 Generalized BICA via Search Tree Based Algorithm

We now turn to the general form of our problem (3.1) with no further assumption on the vector .

We denote = {all words whose bit equals 0}. In other words, is the set of words that “contribute” to . We further denote the set of that each word is a member of as for all words. For example, the all zeros word is a member of all hence . We define the optimal permutation as the permutation of the words that achieves the minimum of such that for every .

Let us denote the binary representation of the word with . Looking at the words of the vector we say that a word is majorant to if . In other words, is majorant to iff for every bit in that equals zeros, the same bit equals zero in . In the same manner a word is minorant to if , that is iff for every bit in that equals zeros, the same bit equals zero in . For example, the all zeros word is majorant to all the words, while the all ones word is minorant to all the word as none of its bits equals zeros.

We say that is a largest minorant to if there is no other word that is minorant to and majorant to . We also say that there is a partial order between and if one is majorant or minorant to the other.

###### Theorem 2.

The optimal solution must satisfy for all .

###### Proof.

Assume there exists such that which achieves the lowest (optimal) . Since then, by definition, . This means there exists which satisfies . Let us now exchange (swap) the words and . Notice that this swapping only modifies but leaves all other ’s untouched. Therefore this swap leads to a lower as the sum in (3.3) remains untouched apart from its summand which is lower than before. This contradicts the optimality assumption ∎

We are now ready to present our algorithm. As a preceding step let us sort the probability vector (of X) such that . As described above, the all zeros word is majorant to all words and the all ones word is minorant to all words. Hence, the smallest probability and the largest probability are allocated to them respectively, as Theorem 2 suggests. We now look at all words that are largest minorants to the all zeros word.

Theorem 2 guarantees that must be allocated to one of them. We shall therefore examine all of them. This leads to a search tree structure in which every node corresponds to an examined allocation of . In other words, for every allocation of we shall further examine the allocation of to each of the largest minorants that are still not allocated. This process ends once all possible allocations are examined.

The following example (Figure 3.1) demonstrates our suggested algorithm with . The algorithm is initiated with the allocation of to the all zeros word. In order to illustrate the largest minorants to we use the chart of the partial order at the bottom left of Figure 3.1. As visualized in the chart, every set is encircled by a different shape (e.g. ellipses, rectangles) and the largest minorants to are , and . As we choose to investigate the allocation of to we notice that remaining largest minorants, of all the words that are still not allocated, are and . We then investigate the allocation of to , for example, and continue until all are allocated.

This search tree structure can be further improved by introducing a depth-first branch and bound enhancement. This means that before we examine a new branch in the tree we bound the minimal objective it can achieve (through allocation of the smallest unallocated probability to all of its unallocated words for example).

The asymptotic computational complexity of this branch and bound search tree is quite involved to analyze. However, there are several cases where a simple solution exists (for example, for it is easy to show that the solution is to allocate all four probabilities in ascending order).

### 3.5 Generalized BICA via Piecewise Linear Relaxation Algorithm

In this section we present a different approach which bounds the optimal solution from above as tightly as we want in operations, where defines how tight the bound is. Throughout this section we assume that is a fixed value, for complexity analysis purposes.

Let us first notice that the problem we are dealing with (3.3) is a concave minimization problem over a discrete permutation set which is hard. However, let us assume for the moment that instead of our “true” objective (3.3) we have a simpler linear objective function. That is,

 L(Y––)=d∑j=1ajπj+bj=m∑i=1ciP(Y––=y–(i))+d0 (3.4)

where the coefficients correspond to different slopes and intersects that are later defined. Notice that the last equality changes the summation from over components to a summation over all words (this change of summation is further discussed in Section 3.5.1).

In order to minimize this objective over the given probabilities we simply sort these probabilities in a descending order and allocate them such that the largest probability goes with the smallest coefficient and so on. Assuming both the coefficients and the probabilities are known and sorted in advance, the complexity of this procedure is linear in .

We now turn to the generalized binary ICA problem as defined in (3.3). Since our objective is concave we would first like to bound it from above with a piecewise linear function which contains pieces, as shown in Figure 3.2. In this work we do not discuss the construction of such upper-bounding piecewise linear function, nor tuning methods for the value of , and assume this function is given for any fixed . Notice that the problem of approximating concave curves with piecewise linear functions is very well studied (for example, by Gavrilović (1975)) and may easily be modified to the upper bound case. We show that solving the piecewise linear problem approximates the solution to (3.3) as closely as we want, in significantly lower complexity.

From this point on we shall drop the previous assumption that , for simplicity of presentation. First, we notice that all are equivalent (in the sense that we can always interchange them and achieve the same result). This means we can find the optimal solution to the piecewise linear problem by going over all possible combinations of âplacingâ the variables in the different regions of the piecewise linear function. For each of these combinations we need to solve a linear problem (such as in (3.4), where the minimization is with respect to allocation of the given probabilities ) with additional constraints on the ranges of each . For example, assume and the optimal solution is such that two (e.g. and ) are at the first region, , and is at the second region, . Then, we need to solve the following constrained linear problem,

 minimize a1⋅(π1+π2)+2b1+a2⋅π3+b2 (3.5) subject to π1,π2∈R1,π3∈R2

where the minimization is over the allocation of the given , which determine the corresponding ’s, as demonstrated in (3.4). This problem is again hard. However, if we attempt to solve it without the constraints we notice the following:

1. If the collection of which define the optimal solution to the unconstrained linear problem happens to meet the constraints then it is obviously the optimal solution with the constraints.

2. If the collection of of the optimal solution do not meet the constraints (say, ) then, due to the concavity of the entropy function, there exists a different combination with a different constrained linear problem (again, over the allocation of the given probabilities ),

 minimize a1π1+b1+a2(π2+π3)+2b2 subject to π1∈R1π2,π3∈R2

in which this set of necessarily achieves a lower minimum (since ).

Therefore, in order to find the optimal solution to the piecewise linear problem, all we need to do is to go over all possible combinations of placing the in the different regions, and for each combination solve an unconstrained linear problem (which is solved in a linear time in ). If the solution does not meet the constraint then it means that the assumption that the optimal reside within this combination’s regions is false. Otherwise, if the solution does meet the constraint, it is considered as a candidate for the global optimal solution.

The number of combinations is equivalent to the number of ways of placing identical balls in boxes, which is for a fixed ,

 (d+k−1n)=(d+k−1k−1)≤(d+k−1)k−1(k−1)!=O(dk). (3.6)

Assuming the coefficients are all known and sorted in advance, for any fixed the overall asympthotic complexity of our suggested algorithm, as , is simply .

#### 3.5.1 The Relaxed Generalized BICA as a single matrix-vector multiplication

It is important to notice that even though the asymptotic complexity of our approximation algorithm is , it takes a few seconds to run an entire experiment on a standard personal computer for as much as , for example. The reason is that the factor refers to the complexity of sorting a vector and multiplying two vectors, operations which are computationally efficient on most available software. Moreover, if we assume that the coefficients in (3.4) are already calculated, sorted and stored in advance, we can place them in a matrix form and multiply the matrix with the (sorted) vector . The minimum of this product is exactly the solution to the linear approximation problem. Therefore, the practical complexity of the approximation algorithm drops to a single multiplication of a () matrix with a () vector.

Let us extend the analysis of this matrix-vector multiplication approach. Each row of the matrix corresponds to a single coefficient vector to be sorted and multiplied with the sorted probability vector . Each of these coefficient vectors correspond to one possible way of placing components in different regions of the piecewise linear function. Specifically, in each row, each of the components is assumed to reside in one of the regions, hence it is assigned a slope as indicated in (3.4). For each row, our goal is to minimize . Since this minimization is solved over the vector we would like to change the summation accordingly. To do so, each entry of the coefficient vector (denoted as in (3.4)) is calculated by summing all the slopes that correspond to each . For example, let us assume where , with a corresponding slope and intercept , while the with and . We use the following mapping: . Therefore

 π1=P(Y1=0)=p1+p2+p3+p4π2=P(Y2=0)=p1+p2+p5+p6π3=P(Y3=0)=p1+p3+p5+p7. (3.7)

The corresponding coefficients are then the sum of rows of the following matrix

 A=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝a1a1a2a1a10a10a2a1000a1a20a1000a2000⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (3.8)

This leads to a minimization problem

 L(Y––)=d∑j=1ajπj+bj=a1(π1+π2)+a2π3+2b1+b2= (3.9) (2a1+a2)p1+2a1p2+(a1+a2)p3+a1p4+(a1+a2)p5+a1p6+a2p7+2b1+b2

where the coefficients of are simply the sum of the row in the matrix .

Now let us assume that is greater than (which is usually the case). It is easy to see that many of the coefficients are actually identical in this case. Precisely, let us denote by the number of assignments for the region, where . Then, the number of unique coefficients is simply

 k∏v=1(lv+1)−1

subject to . Since we are interested in the worst case (of all rows of the matrix ), we need to find the non-identical coefficients. This is obtained when is as ”uniform” as possible. Therefore we can bound the number of non-identical coefficients from above by temporarily dropping the assumption that ’s are integers and letting so that

 maxk∏v=1(lv+1)≤(dk+1)k=O(dk). (3.10)

This means that instead of sorting the coefficients for each row of the matrix , we only need to sort coefficients.

Now, let us further assume that the data is generated from some known parametric model. In this case, some probabilities may also be identical, so that the probability vector may also not require operations to be sorted. For example, if we assume a block independent structure, such that components (bits) of the data are generated from independent and identically distributed blocks of of size , then it can be shown that the probability vector contains at most

 (dr+2r−1dr)=O((dr)2r) (3.11)

non-identical elements . Another example is a first order stationary symmetric Markov model. In this case there only exists a quadratic number, , of non-identical probabilities in (see Appendix A).

This means that applying our relaxed generalized BICA on such datasets may only require operations for the matrix and a polynomial number of operations (in ) for the vector ; hence our algorithm is reduced to run in a polynomial time in .

Notice that this derivation only considers the number of non-identical elements to be sorted through a quicksort algorithm. However, we also require the degree of each element (the number of times it appears) to eventually multiply the matrix with the vector . This, however, may be analytically derived through the same combinatorical considerations described above.

#### 3.5.2 Relaxed Generalized BICA Illustration and Experiments

In order to validate our approximation algorithm we conduct several experiments. In the first experiment we illustrate the convergence of our suggested scheme as increases. We arbitrarily choose a probability distribution with statistically independent components and mix its components in a non-linear fashion. We apply the approximation algorithm on this probability distribution with different values of and compare the approximated minimum entropy we achieve (that is, the result of the upper-bound piecewise linear cost function) with the entropy of the vector. In addition, we apply the estimated parameters on the true objective (3.3), to obtain an even closer approximation. Figure 3.3 demonstrates the results we achieve, showing the convergence of the approximated entropy towards the real entropy as the number of linear pieces increases. As we repeat this experiment several times (that is, arbitrarily choose a probability distributions and examine our approach for every single value of ), we notice that the estimated parameters are equal to the independent parameters for as small as , on the average.

We further illustrate the use of the BICA tool by the following example on ASCII code. The ASCII code is a common standardized eight bit representation of western letters, numbers and symbols. We gather statistics on the frequency of each character, based on approximately 183 million words that appeared in the New York Times magazine (Jones & Mewhort, 2004). We then apply the BICA (with , which is empirically sufficient) on this estimated probability distribution, to find a new eight bit representation of characters, such that the bits are ”as statistically independent” as possible. We find that the entropy of the joint probability distribution is bits, the sum of marginal entropies using ASCII representation is bits and the sum of marginal entropies after applying BICA is just bits. This means that there exists a different eight bit representation of characters which allows nearly full statistical independence of bits. Moreover, through this representation one can encode each of the eight bit separately without losing more than bits, compared to encoding the eight bits altogether.

### 3.6 Generalized ICA Over Finite Alphabets

#### 3.6.1 Piecewise Linear Relaxation Algorithm - Exhaustive Search

Let us extend the notation of the previous sections, denoting the number of components as and the alphabet size as . We would like to minimize where is over an alphabet size . We first notice that we need parameters to describe the multinomial distribution of such that all of the parameters are not greater than . Therefore, we can bound from above the marginal entropy with a piecewise linear function in the range , for each of the parameters of . We refer to a ()-tuple of regions as cell. As in previous sections we consider , the number of linear pieces, to be fixed. Notice however, that as and increase, needs also to take greater values in order to maintain the same level of accuracy. As mentioned above, in this work we do not discuss methods to determine the value of for given and , and empirically evaluate it.

Let us denote the number of cells to be visited in our approximation algorithm (Section 3.5) as . Since each parameter is approximated by linear pieces and there are parameters, equals at most . In this case too, the parameters are exchangeable (in the sense that the entropy of a multinomial random variable with parameters is equal to the entropy of a multinomial random variable with parameters , for example). Therefore, we do not need to visit all cells, but only a unique subset which disregards permutation of parameters. In other words, the number of cells to be visited is bounded from above by the number of ways of choosing elements (the parameters) out of elements (the number of pieces in each parameter) with repetition and without order. Notice this upper-bound (as opposed to full equality) is a result of not every combination being a feasible solution, as the sum of parameters may exceed . Assuming is fixed and as this equals

 (q−1+k−1q−1)= (q−1+k−1k−1)≤(q−1+k−1)k−1(k−1)!=O(qk). (3.12)

Therefore, the number of cells we are to visit is simply . For sufficiently large it follows that . As in the binary case we would like to examine all combinations of entropy values in cells. The number of iterations to calculate all possibilities is equal to the number of ways of placing identical balls in boxes, which is

 (d+C−1d)=O(dC). (3.13)

In addition, in each iteration we need to solve a linear problem which takes a linear complexity in . Therefore, the overall complexity of our suggested algorithm is .

We notice however that for a simple case where only two components are mixed , we can calculate (3.13) explicitly

 (2+C−12)=C(C+1)2. (3.14)

Putting this together with (3.12), leads to an overall complexity which is polynomial in , for a fixed ,

 (qk(qk+1)2q2)=O(q2k+2). (3.15)

Either way, the computational complexity of our suggested algorithm may result in an excessive runtime, to a point of in-feasibility, in the case of too many components or an alphabet size which is too large. This necessitates a heuristic improvement to reduce the runtime of our approach.

#### 3.6.2 Piecewise Linear Relaxation Algorithm - Objective Descent Search

In Section 3.5 we present the basic step of our suggested piecewise linear relaxation to the generalized binary ICA problem. As stated there, for each combination of placing components in pieces (of the piecewise linear approximation function) we solve a linear problem (LP). Then, if the solution happens to meet the constraints (falls within the ranges we assume) we keep it. Otherwise, due to the concavity of the entropy function, there exists a different combination with a different constrained linear problem in which this solution that we found necessarily achieves a lower minimum, so we disregard it.

This leads to the following objective descent search method: instead of searching over all possible combinations we shall first guess an initial combination as a starting point (say, all components reside in a single cell). We then solve its unconstrained LP. If the solution meets the constraint we terminate. Otherwise we visit the cell that meets the constraints of the solution we found. We then solve the unconstrained LP of that cell and so on. We repeat this process for multiple random initialization.

This suggested algorithm is obviously heuristic, which does not guarantee to provide the global optimal solution. Its performance strongly depends on the number of random initializations and the concavity of the searched domain.

The following empirical evaluation demonstrates our suggested approach. In this experiment we randomly generate a probability distribution with independent and identically distributed components over an alphabet size . We then mix its components in a non-linear fashion. We apply the objective descent algorithm with a fixed number of initialization points () and compare the approximated minimum sum of the marginal entropies with the true entropy of the vector. Figure 3.4 demonstrates the results we achieve for different values of . We see that the objective descent algorithm approximates the correct components well for smaller values of but as increases the difference between the approximated minimum and the optimal minimum increases, as the problem becomes too involved.

### 3.7 Application to Blind Source Separation

Assume there exist independent (or practically ”almost independent”) sources where each source is over an alphabet size . These sources are mixed in an invertible, yet unknown manner. Our goal is to recover the sources from this mixture.

For example, consider a case with sources , where each source is over an alphabet size . The sources are linearly mixed (over a finite field) such that . However, due to a malfunction, the symbols of are randomly shuffled, before it is handed to the receiver. Notice this mixture (including the malfunction) is unknown to the receiver, who receives and strives to “blindly” recover . In this case any linearly based method such as (Yeredor, 2011) or (Attux et al., 2011) would fail to recover the sources as the mixture, along with the malfunction, is now a non-linear invertible transformation. Our method on the other hand, is designed especially for such cases, where no assumption is made on the mixture (other than being invertible).

To demonstrate this example we introduce two independent sources , over an alphabet size . We apply the linear mixture and shuffle the symbols of . We are then ready to apply (and compare) our suggested methods for finite alphabet sizes, which are the exhaustive search method (Section 3.6.1) and the objective descent method (Section 3.6.2). For the purpose of this experiment we assume both and are distributed according to a Zipf’s law distribution,

 P(k;s,q)=k−s∑qi=1i−s

with a parameter . The Zipf’s law distribution is a commonly used heavy-tailed distribution. This choice of distribution is further motivated in Chapter 7. We apply our suggested algorithms for different alphabet sizes, with a fixed , and with only random initializations for the objective descent method. Figure 3.5 presents the results we achieve.

We first notice that both methods are capable of finding a transformation for which the sum of marginal entropies is very close to the joint entropy. This means our suggested methods succeed in separating the non-linear mixture back to the statistically independent sources , as we expected. Looking at the chart on the right hand side of Figure 3.5, we notice that the difference between the two methods tends to increase as the alphabet size grows. This is not surprising since the search space grows while the number of random initializations remains fixed. However, the difference between the two methods is still practically negligible, as we can see from the chart on the left. This is especially important since the objective descent method takes significantly less time to apply as the alphabet size grows.

### 3.8 discussion

In this chapter we considered a generalized ICA over finite alphabets framework where we dropped the common assumptions on the underlying model. Specifically, we attempted to decompose a given multi-dimensional vector to its “as statistically independent as possible” components with no further assumptions, as introduced by Barlow et al. (1989).

We first focused on the binary case and proposed three algorithms to address this class of problems. In the first algorithm we assumed that there exists a set of independent components that were mixed to generate the observed vector. We showed that these independent components are recovered in a combinatorial manner in operations. The second algorithm drops this assumption and accurately solves the generalized BICA problem through a branch and bound search tree structure. Then, we proposed a third algorithm which bounds our objective from above as tightly as we want to find an approximated solution in with being the approximation accuracy parameter. We further showed that this algorithm can be formulated as a single matrix-vector multiplications and under some generative model assumption the complexity is dropped to be polynomial in . Following that we extended our methods to deal with a larger alphabet size. This case necessitates a heuristic approach to deal with the super exponentially increasing complexity. An objective descent search method is presented for that purpose. We concluded the chapter by presenting a simple Blind Source Separation application.

## Chapter 4 Generalized Independent Component Analysis - The Order Permutation

The material in this Chapter is partly covered in (Painsky et al., 2016c).

In the previous chapter we presented the generalized BICA problem. This minimization problem (3.3) is combinatorial in its essence and is consequently considered hard. Our suggested algorithms (described in detail in Sections 3.3, 3.4 and 3.5) strive to find its global minimum, but due to the nature of the problem, they result in quite involved methodologies. This demonstrates a major challenge in providing theoretical guarantees to the solutions they achieve. We therefore suggest a simplified greedy algorithm which is much easier to analyze, as it sequentially minimizes each term of the summation (3.3), , for . For the simplicity of presentation, we denote the alphabet size of a dimensional vector as .

With no loss of generality, let us start by minimizing , which corresponds to the marginal entropy of the most significant bit (msb). Since the binary entropy is monotonically increasing in the range , we would like to find a permutation of that minimizes a sum of half of its values. This means we should order the ’s so that half of the ’s with the smallest values are assigned to while the other half of ’s (with the largest values) are assigned to . For example, assuming and , a permutation which minimizes is

We now proceed to minimize the marginal entropy of the second most significant bit, . Again, we would like to assign the smallest possible values of ’s. However, since we already determined which ’s are assigned to the msb, all we can do is reorder the ’s without changing the msb. This means we again sort the ’s so that the smallest possible values are assigned to , without changing the msb. In our example, this leads to,

Continuing in the same manner, we would now like to reorder the ’s to minimize without changing the previous bits. This results in

Therefore, we show that a greedy solution to (3.3) which sequentially minimizes is attained by simply ordering the joint distribution in an ascending (or equivalently descending) order. In other words, the order permutation suggests to simply order the probability distribution in an ascending order, followed by a mapping of the symbol (in its binary representation) the smallest probability.

At this point it seems quite unclear how well the order permutation performs, compared both with the relaxed BICA we previously discussed, and the optimal permutation which minimizes (3.3). In the following sections we introduce some theoretical properties which demonstrate this method’s effectiveness.

### 4.1 Worst-case Independent Components Representation

We now introduce the theoretical properties of our suggested algorithms. Naturally, we would like to quantify how much we “lose” by representing a given random vector as if its components are statistically independent. We notice that our objective (3.3) depends on the distribution of a given random vector , and the applied invertible transformation . Therefore, we slightly change the notation of (3.3) and denote the cost function as .

Since our methods strongly depend on the given probability distribution , we focus on the worst-case and the average case of , with respect to . Let us denote the order permutation as and the permutation which is found by the piece-wise linear relaxation as . We further define as the permutation that results with a lower value of , between and . This means that

 gbst=argmin{glin,gord}C(p–,g).

In addition, we define as the optimal permutation that minimizes (3.3) over all possible permutations. Therefore, for any given , we have that