# Category coding with neural network application

###### Abstract

In many applications of neural network, it is common to introduce huge amounts of input categorical features, as well as output labels. However, since the required network size should have rapid growth with respect to the dimensions of input and output space, there exists huge cost in both computation and memory resources. In this paper, we present a novel method called category coding (CC), where the design philosophy follows the principle of minimal collision to reduce the input and output dimension effectively. In addition, we introduce three types of category coding based on different Euclidean domains. Experimental results show that all three proposed methods outperform the existing state-of-the-art coding methods, such as standard cut-off and error-correct output coding (ECOC) methods.

Category coding with neural network application

Qizhi Zhang qizhi.zqz@alibaba-inc.com Kuang-chih Lee kuang-chih.lee@alibaba-inc.com Hongying Bao hongying.bhy@alibaba-inc.com and Yuan You youyuan.yy@alibaba-inc.com and Dongbai Guo dongbai.gdb@alibaba-inc.com

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

In machine learning, many features are categorical, such as color, country, user id, item id, etc. In the multi-class classification problem, the labels are categorical too. The ordering relation doesn’t exist among different values for these categories. Usually those categorical variables are represented by one-hot feature vectors. For example, red is encoded to 100, yellow to 010 and blue to 001. But if the number of categories are very huge, for example the user id and item id in e-commerce applications, the one-hot encoding scheme needs too many resources to compute classification results.

In the past years while SVM is widely used, ECOC (error-correct output coding) method is proposed for handling huge numbers of output class labels. The idea of ECOC is to reduce a multi-class classification problem of huge number of classes to some two-class classification problems using binary error-correct coding. But for the solution of handling huge number of input categorical features, the similar method doesn’t exist, because the categories can not be separated by linear model, unless the one-hot encoding is used.

In recent year, the deep neural network has great improvement in terms of performance and speed. The coding method can be applied to deep neural network with some new beneficial reform.

In the classification problem, because the number of labels of a single neural network need not to be binary, if we use a deep learning network as a base learner, it is not necessary to limit the code to be binary. In fact, there is a trade-off between the class number of one base learner and the number of base learner used. According to information theory, if we use classes classifiers as basic classifiers to solve a classification problem of -class, we need at least ’s base learners. For example, if we need to solve a classifying problem of 1M’s classes, and we use the binary classifier as base learners, we need at least 20 base learners. For some classical applications, for example, the CNN image classification, we need to build a CNN network for every binary classifier. It is huge cost for computation and memory resources. But if we combine different base learners with 1000 classes, we need at least 2 base learners. We know that the number of parameters in a Deep neural network is usually big, hence using a small number of base learner benefits the reduction of the cost in computing and storage.

On the other hand, because the neural network has the ability of non-linear representation, we can use the encoding for categorical features too. Can we use classical error-correct coding for categorical features? We know that in machine learning, the sparsity is a basic rule to be satisfied, but the classical error-correct coding does not satisfy the sparsity. Hence we need to design a new sparse coding scheme for this application.

In this paper, we give some new encoding method, they can be applied to both label encoding and feature encoding and give better performance than classical method. In section 2, we give the definition of category coding (CC) and propose 3 classes of CC, namely Polynomial CC, Remainder CC and Gauss CC, which have good property. In section 3 we discuss the application of CC in label encoding. In section 4, we discuss the application of CC in feature encoding. Our main tool is finite field theory and number theory, which can refer to ff and NT.

## 2 Category coding

For a -class categorical feature or label, we define a category coding (CC) as a map

where each is called a “site-position function”. category coding, for .

Generally, is a huge number, and are some numbers of middle size.

We can reduce a -classes classification problem to ’s classification problems of middle size through a CC.

We can also use a -hot -bit binary encoding instead of the one-hot encoding as the representation of the feature, i.e., use the composite of the CC map and the nature embedding

to get a hot encoding.

For a CC , we call the collision number of , and denote . We have the following theorem.

###### Theorem 2.1.

For a CC , where , we have .

Proof. Let . Suppose , i.e

Hence for any , there are at most same site-position value between and . Hence is an injection, and hence . It is a contradiction with the definition of . ∎

If a CC satisfying , we call it has the minimal collision property. In both usage of label encoding and feature encoding, we wish the code has minimal collision property.

We give 3 classes of CC, i.e, Polynomial CC, Remainder CC and Gauss CC, which satisfies the minimal collision property.

### 2.1 Polynomial CC

For any prime number , we can represent any non-negative integral number less than as the unique form , which gives a bijection , where is the Galois field (finite field) of elements.

For the classification problem of -classes and any small positive integral number (for example, k=2, 3) and a small real number , we take a prime number in (According to the Prime Number Theorem ( Riemann, Prime_Number_Theorem), there are about such prime numbers.) , and get a injection by p-adic representation.

###### Theorem 2.2.

For ’s different elements in , the code defined by the composite map of the p-adic representation map and the map

and the map

has the minimal collision property.

Proof. We need proof that . Because we know that , hence we need just prove , i.e for any , .

Because the p-adic representation map and is an injection, and the map is a bijection, we need just to show that for any , . Suppose there are such that , it means the polynomial of degree at most has at least roots, it is a contradiction with the Algebraic Basic Theorem on fields. ∎

Remark. The composite map of and in above theorem is known as Reed-Solomon code also Reed_and_Solomon. The Reed-Solomon code is a class of non-binary MDS (maximal distinct separate) code Singleton. MDS property is a excellent property in error-corrected coding. But unfortunately, it has not find any nontrivial binary MDS code yet up to now. In fact, for some situation, the fact that there are not any nontrivial binary MDS code is proved. (Guerrini_and_Sala and Proposition 9.2 on p. 212 in Vermani ). This is an advantage of CC than ECOC in label encode also.

### 2.2 Remainder CC

For the original label’s set , a small number k like 2, or 3, etc., and a small positive number , select ’s pairwise co-prime numbers in the domain . (According to the Prime Number Theorem ( Riemann, Prime_Number_Theorem), there are about prime and hence pairwise co-prime numbers in this domain.)

We define the remainder CC as

where , and is called its modules. Then we have the following proposition:

###### Theorem 2.3.

The remainder CC has the minimal collision property.

Proof. We need only to show that, for any , there are at most ’s such, that .

Suppose there exist ’s different such, that , we can suppose that for . Then we have for all . Because are pairwise co-prime numbers, we have . But we know , which in , hence . ∎

### 2.3 Gauss CC

We propose a CC based on the ring of Gauss integers Gauss NT, and so called Gauss CC.

We write the ring of Gauss integers as . For a big integral number , let is the minimal positive real number such that the number of Gauss integers in the closed disc is not less than , i.e and for any small . In general, we have is about , hence we can get such about .

We can embed the original IDs to the Gauss integers in Gauss integers in the closed disc.

Let be a small positive integral number, like 2,3, and be a small positive real number. Let be pairwise co-prime Gauss integral numbers satisfying We define the category mapping

where means the principle ideal of generated by , . is called the modules of this Gauss CC, and we have the following theorem.

###### Theorem 2.4.

The Gauss CC has the minimal collision property.

Proof. From the method to take , we know . Hence we need only to show that, for any , there are at most ’s such, that .

Suppose there exist ’s different such, that , we can suppose that

Then we have for all .

Because are pairwise co-prime Gauss integral numbers, hence are pairwise co-prime ideal of , and we have . Hence i.e, , and hence . But we know , hence . On the other hand, we know , hence , and hence . ∎

## 3 Application for label encode

For a -class classification problem, we use a CC

to reduce a -classes classification problem to ’s classification problems of middle size through a LM. Suppose the training dataset is , where is feature and is label, then we train a base learner on the dataset for every . We call it the label encoding method.

A CC good for label encoding should satisfy the follow properties:

Classes high separable. For two different labels , there should be as many as possible site-position functions such that .

Base learners independence. When are selected randomly uniformly from , the mutual information of and approximate to 0 for .

The property “classes high separable” ensures that for any two different classes, there are as many as possible base learners are trained to separate them. The property “base learners independence” ensures that the common part of the information learned by any two different base learners is few.

Remark. These properties are the similar of the properties “Row separable” and “Column separable” of ECOC (Dietterich_and_Bakiri) in non-binary situation.

The minimal collision property ensure the CCs satisfy “Class high separable”, we will show that they satisfy “Base learner independence” also.

### 3.1 Polynomial CC

We will prove that, the Polynomial CC satisfies the property “Base learners independence” also.

###### Theorem 3.1.

If is a random variable with uniform distribution on , and are the i-site value and j-site value () of the codeword of under the simplex LM described above, then the mutual information of and approach to when grows up.

Proof.

For any in , the i-th site value is , where are the coefficients of the p-adic representation of . We denote this map by .

Let , consider the following commutative diagram: