Machine learning on images using a string-distance
We present a new method for image feature-extraction which is based on representing an image by a finite-dimensional vector of distances that measure how different the image is from a set of image prototypes. We use the recently introduced Universal Image Distance (UID)  to compare the similarity between an image and a prototype image. The advantage in using the UID is the fact that no domain knowledge nor any image analysis need to be done. Each image is represented by a finite dimensional feature vector whose components are the UID values between the image and a finite set of image prototypes from each of the feature categories. The method is automatic since once the user selects the prototype images, the feature vectors are automatically calculated without the need to do any image analysis. The prototype images can be of different size, in particular, different than the image size. Based on a collection of such cases any supervised or unsupervised learning algorithm can be used to train and produce an image classifier or image cluster analysis. In this paper we present the image feature-extraction method and use it on several supervised and unsupervised learning experiments for satellite image data.
Image classification research aims at finding representations of images that can be automatically used to categorize images into a finite set of classes. Typically, algorithms that classify images require some form of pre-processing of an image prior to classification. This process may involve extracting relevant features and segmenting images into sub-components based on some prior knowledge about their context [2, 3].
In  we introduced a new distance function, called Universal Image Distance (UID), for measuring the distance between two images. The UID first transforms each of the two images into a string of characters from a finite alphabet and then uses the string distance of  to give the distance value between the images. According to  the distance between two strings and is a normalized difference between the complexity of the concatenation of the strings and the minimal complexity of each of and . By complexity of a string we mean the Lempel-Ziv complexity .
In the current paper we use the UID to create a finite-dimensional representation of an image. The component of this vector is like a feature that measures how different the image is from the image prototype. One of the advantages of the UID is that it can compare the distance between two images of different sizes and thus the prototypes which are representative of the different feature categories may be relatively small. For instance, the prototypes of an urban category can be small images of size pixels of various parts of cities.
In this paper we introduce a process to convert the image into a labeled case (feature vector). Doing this systematically for a set of images each labeled by its class yields a data set which can be used for training any supervised and unsupervised learning algorithms. After describing our method in details we report on the accuracy results of several classification-learning algorithms on such data. As an example, we apply out method to satellite image classification and clustering.
We note that our process for converting an image into a finite dimensional feature vector is very straightforward and does not involve any domain knowledge about the images. In contrast to other image classification algorithms that extract features based on sophisticated mathematical analysis, such as, analyzing the texture, the special properties of an image, doing edge-detection, or any of the many other methods employed in the immense research-literature on image processing, our approach is very basic and universal. It is based on the complexity of the ’raw’ string-representation of an image. Our method extracts features automatically just by computing distances from a set of prototypes. It is therefore scalable and can be implemented using parallel processing techniques, such as on system-on-chip and FPGA hardware implementation [6, 7, 8].
Our method extracts image features that are unbiased in the sense that they do not employ any heuristics in contrast to other common image-processing techniques. The features that we extract are based on information implicit in the image and obtained via a complexity-based UID distance which is an information-theoretic measure. In our method, the feature vector representation of an image is based on the distance of the image from some fixed set of representative class-prototypes that are initially and only once picked by a human user running the learning algorithm.
Let us now summarize the organization of the paper: in section 2 we review the definitions of LZ-complexity and a few string distances. In section 3 we define the UID distance. In section 4 we describe the algorithm for selecting class prototypes. In section 5 we describe the algorithm that generates a feature-vector representation of an image. In section 6 we discuss the classification learning method and in section we conclude by reporting on the classification accuracy results.
2 LZ-complexity and string distances
The UID distance function  is based on the LZ- complexity of a string. The definition of this complexity follows : let , and be strings of characters that are defined over the alphabet . Denote by the length of S, and denotes the element of S. We denote by the substring of which consists of characters of between position and . An extension of is reproducible from (denoted as ) if there exists an integer such that for . For example, with and with . is obtained from (the seed) by copying elements from the location in to the end of .
A string is producible from its prefix (denoted ), if . For example, and both with pointers . The production adds an extra ’different’ character at the end of the copying process which is not permitted in a reproduction.
Any string can be built using a production process where at its step we have the production where is the location of a character at the step. (Note that
An -step production process of results in parsing of in which is called the history of and is called the component of . For example for we have as the history of .
If is not reproducible from then is called exhaustive meaning that the copying process cannot be continued and the component should be halted with a single character innovation. Moreover, every string has a unique exhaustive history .
Let us denote by the number of components in a history of . then the LZ complexity of is where the minimum is over all histories of . It can be shown that where is the number of components in the exhaustive history of .
A distance for strings based on the LZ-complexity was introduced in  and is defined as follows: given two strings and , denote by their concatenation then define
As in  we use the following normalized distance function
We note in passing that (1) resembles the normalized compression distance of  except that here we do not use a compressor but rather the LZ-complexity of a string. Note that is not a metric since it does not satisfy the triangle inequality and a distance of implies that the two strings are close but not necessarily identical.
This is universal in the sense that it is not based on some specific representation of a string (such as the alphabet of symbols), nor on some heuristics that are common to other string distances, e.g., edit-distances . Instead it only relies on the string’s LZ-complexity which is purely an information quantity independent of the string’s context or representation.
3 Universal Image Distance
Based on we now define a distance between images. The idea is to convert each of two images and into strings and of characters from a finite alphabet of symbols. Once in string format, we use as the distance between and . The details of this process are described in Algorithm 1 below.
In the next section we describe the process of selecting the image prototypes.
4 Prototype selection
In this section we describe the algorithm for selecting image prototypes from each of the feature categories . This process runs only once before the stage of converting the images into finite dimensional vectors, that is, it does not run once per image but once for all images. For an image we denote by a sub-image of where can be any rectangular-image obtained by placing a window over the image where the window is totally enclosed by .
From the theory of learning pattern recognition, it is known that the dimensionality of a feature-vector is usually taken to be small compared to the data size . A large will obtain better feature representation accuracy of the image, but it will increase the time for running Algorithm 3 (described below).
Algorithm 2 convergence is based on the user’s ability to select good prototype images. We note that from our experiments this is easily achieved primarily because the UID permits to select prototypes which are considerably smaller in size and hence simpler than the full images . For instance, in our experiments we used 7 pixels prototype size for all feature categories. This fact makes it easy for a user to quickly choose typical representative prototypes from every feature-cateory. This way it is easy to find informative prototypes, that is, prototypes that are distant when they are from different feature-categories and close when they are from the same feature category. Thus Algorithm 2 typically converges rapidly.
As an example, Figure (a)a displays prototypes selected by a user from a corpus of satellite images. The user labeled prototypes as representative of the feature category urban, prototypes as representatives of class sea, prototypes as representative of feature roads and prototypes as representative of feature arid. The user easily found these representative prototypes as it is easy to fit in a single picture of size pixels a typical image. The dendrogram produced in step 4 of Algorithm 2 for these set of prototypes is displayed in Figure (b)b. It is seen that the following four clusters were found which indicates that the prototypes selected in Algorithm 2 are good.
5 Image feature-representation
In the previous section we described Algorithm 2 by which the prototypes are manually selected. This algorithm is now used to create a feature-vector representation of an image. It is described as Algorithm 3 below (in  we used a similar algorithm UIC to soft-classify an image whilst here we use it to only produce a feature vector representation of an image which later serves as a single labeled case for training any supervised learning algorithm or a single unlabeled case for training an unsupervised algorithm).
6 Supervised and unsupervised learning on images
Given a corpus of images and a set of labeled prototypes we use Algorithm 3 to generate the feature-vectors corresponding to each image in . At this point we have a database of size equal to which consists of feature vectors of all the images in . This database can be used for unsupervised learning, for instance, discover interesting clusters of images. It can also be used for supervised learning provided that each of the cases can be labeled according to a value of some target class variable which in general may be different from the feature categories. Let us denote by the class target variable and the database which consists of the feature vectors of with the corresponding target class values. The following
7 Experimental setup and results
We created a corpus of images of size pixels from GoogleEarth©of various types of areas (Figure 2 displays a few scaled-down examples of such images). From these images we let a user define four feature-categories: sea, urban, arid, roads and choose three relatively-small image-prototype of size pixels from each feature-category, that is, we ran Algorithm 2 with and for all . We then ran Algorithm 3 to generate the feature-vectors for each image in the corpus and obtained a database .
We then let the user label the images by a target variable Humidity with possible values or . An image is labeled if the area is of low humidity and labeled if it is of higher humidity. We note that an image of a low humidity region may be in an arid (dry) area or also in the higher-elevation areas which are not necessarily arid. Since elevation information is not available in the feature-categories that the user has chosen then the classification problem is hard since the learning algorithm needs to discover the dependency between humid regions and areas characterized only by the above four feature categories.
With this labeling information at hand we produced the labeled database . We used Algorithm 4 to learn an image classifier with target Humidity. As the learning algorithm we used the following standard supervised algorithms: , , which learn decision trees, NaiveBayes and Multi-Layer Perceptrons (backpropagation) all of which are available in the WEKA©toolkit.
We performed -fold cross validation and compared their accuracies to a baseline classifier (denoted as ZeroR) which has a single decision that corresponds to the class value with the highest prior empirical probability. As seen in Table 1 (generated by WEKA©) , CART, NaiveBayes and Backpropagation performed with an accuracy of , , , and , respectively, compared to achieved by the baseline classifier. The comparison concludes that all three learning algorithms are significantly better than the baseline classifier, based on a T-test with a significance level of .
|Classify Image into Humidity:||50.00||86.50||81.50||89.25||87.25|
|, statistically significant improvement or degradation|
|(1)||rules.ZeroR ” 48055541465867954|
|(2)||trees.J48 ’-C 0.25 -M 2’ -217733168393644444|
|(3)||trees.SimpleCart ’-S 1 -M 2.0 -N 5 -C 1.0’ 4154189200352566053|
|(4)||bayes.NaiveBayes ” 5995231201785697655|
|(5)||functions.MultilayerPerceptron ’-L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a’ -5990607817048210779|
Next, we performed clustering on the unlabeled database . Using the k-means algorithm, we obtained 3 significant clusters, shown in Table 2.
The first cluster captures images of highly urban areas that are next to concentration of roads, highways and interchanges while the second cluster contains less populated (urban) areas in arid locations (absolutely no sea feature seen) with very low concentration of roads. The third cluster captures the coastal areas and here we can see that there can be a mixture of urban (but less populated than images of the first cluster) with roads and extremely low percentage of arid land.
The fact that such interesting knowledge can be extracted from raw images using our feature-extraction method is very significant since as mentioned above our method is fully automatic and requires no image analysis or any sophisticated preprocessing stages that are common in image pattern analysis.
We introduced a method for automatically defining and measuring features of colored images.The method is based on a universal image distance that is measured by computing the complexity of the string-representation of the two images and their concatenation. An image is represented by a feature-vector which consists of the distances from the image to a fixed set of small image prototypes, defined once by a user. There is no need for any sophisticated mathematical-based image analysis or pre-processing since the universal image distance regards the image as a string of symbols which contains all the relevant information of the image. The simplicity of our method makes it very attractive for fast and scalable implementation, for instance on a specific-purpose hardware acceleration chip. We applied our method to supervised and unsupervised machine learning on satellite images. The results show that standard machine learning algorithms perform well based on our feature-vector representation of the images.
-  U.A. Chester and J. Ratsaby. Universal distance measure for images. In Electrical Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1 –4, nov. 2012.
-  M.J. Canty. Image Analysis, Classification and Change Detection in Remote Sensing: With Algorithms for Envi/Idl. CRC/Taylor & Francis, 2007.
-  T.M. Lillesand, R.W. Kiefer, and J.W. Chipman. Remote sensing and image interpretation. John Wiley & Sons, 2008.
-  K. Sayood and H. H. Otu. A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16):2122–2130, 2003.
-  J. Ziv and A. Lempel. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(3):75–81, 1976.
-  J. Ratsaby and V. Sirota. Fpga-based data compressor based on prediction by partial matching. In Electrical Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1 –5, nov. 2012.
-  J. Ratsaby and D. Zavielov. An fpga-based pattern classifier using data compression. In Proc. of IEEE Convention of Electrical and Electronics Engineers in Israel, Eilat, Nov. 17-20, pages 320–324, 2010.
-  G. Kaspi and J. Ratsaby. Parallel processing algorithm for bayesian network inference. In Electrical Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1 –5, nov. 2012.
-  R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523–1545, 2005.
-  M. Deza and E. Deza. Encyclopedia of Distances, volume 15 of Series in Computer Science. Springer-Verlag, 2009.