Machine learning on images using a string-distance

Machine learning on images using a string-distance

Uzi Chester Electrical and Electronics Engineering Department, Ariel University of Samaria, ARIEL 40700,
WWW home page:
   Joel Ratsaby2 Electrical and Electronics Engineering Department, Ariel University of Samaria, ARIEL 40700,
WWW home page:
11Corresponding author.

We present a new method for image feature-extraction which is based on representing an image by a finite-dimensional vector of distances that measure how different the image is from a set of image prototypes. We use the recently introduced Universal Image Distance (UID) [1] to compare the similarity between an image and a prototype image. The advantage in using the UID is the fact that no domain knowledge nor any image analysis need to be done. Each image is represented by a finite dimensional feature vector whose components are the UID values between the image and a finite set of image prototypes from each of the feature categories. The method is automatic since once the user selects the prototype images, the feature vectors are automatically calculated without the need to do any image analysis. The prototype images can be of different size, in particular, different than the image size. Based on a collection of such cases any supervised or unsupervised learning algorithm can be used to train and produce an image classifier or image cluster analysis. In this paper we present the image feature-extraction method and use it on several supervised and unsupervised learning experiments for satellite image data.

1 Introduction

Image classification research aims at finding representations of images that can be automatically used to categorize images into a finite set of classes. Typically, algorithms that classify images require some form of pre-processing of an image prior to classification. This process may involve extracting relevant features and segmenting images into sub-components based on some prior knowledge about their context [2, 3].

In [1] we introduced a new distance function, called Universal Image Distance (UID), for measuring the distance between two images. The UID first transforms each of the two images into a string of characters from a finite alphabet and then uses the string distance of [4] to give the distance value between the images. According to [4] the distance between two strings and is a normalized difference between the complexity of the concatenation of the strings and the minimal complexity of each of and . By complexity of a string we mean the Lempel-Ziv complexity [5].

In the current paper we use the UID to create a finite-dimensional representation of an image. The component of this vector is like a feature that measures how different the image is from the image prototype. One of the advantages of the UID is that it can compare the distance between two images of different sizes and thus the prototypes which are representative of the different feature categories may be relatively small. For instance, the prototypes of an urban category can be small images of size pixels of various parts of cities.

In this paper we introduce a process to convert the image into a labeled case (feature vector). Doing this systematically for a set of images each labeled by its class yields a data set which can be used for training any supervised and unsupervised learning algorithms. After describing our method in details we report on the accuracy results of several classification-learning algorithms on such data. As an example, we apply out method to satellite image classification and clustering.

We note that our process for converting an image into a finite dimensional feature vector is very straightforward and does not involve any domain knowledge about the images. In contrast to other image classification algorithms that extract features based on sophisticated mathematical analysis, such as, analyzing the texture, the special properties of an image, doing edge-detection, or any of the many other methods employed in the immense research-literature on image processing, our approach is very basic and universal. It is based on the complexity of the ’raw’ string-representation of an image. Our method extracts features automatically just by computing distances from a set of prototypes. It is therefore scalable and can be implemented using parallel processing techniques, such as on system-on-chip and FPGA hardware implementation [6, 7, 8].

Our method extracts image features that are unbiased in the sense that they do not employ any heuristics in contrast to other common image-processing techniques[2]. The features that we extract are based on information implicit in the image and obtained via a complexity-based UID distance which is an information-theoretic measure. In our method, the feature vector representation of an image is based on the distance of the image from some fixed set of representative class-prototypes that are initially and only once picked by a human user running the learning algorithm.

Let us now summarize the organization of the paper: in section 2 we review the definitions of LZ-complexity and a few string distances. In section 3 we define the UID distance. In section 4 we describe the algorithm for selecting class prototypes. In section 5 we describe the algorithm that generates a feature-vector representation of an image. In section 6 we discuss the classification learning method and in section we conclude by reporting on the classification accuracy results.

2 LZ-complexity and string distances

The UID distance function [1] is based on the LZ- complexity of a string. The definition of this complexity follows [5]: let , and be strings of characters that are defined over the alphabet . Denote by the length of S, and denotes the element of S. We denote by the substring of which consists of characters of between position and . An extension of is reproducible from (denoted as ) if there exists an integer such that for . For example, with and with . is obtained from (the seed) by copying elements from the location in to the end of .

A string is producible from its prefix (denoted ), if . For example, and both with pointers . The production adds an extra ’different’ character at the end of the copying process which is not permitted in a reproduction.

Any string can be built using a production process where at its step we have the production where is the location of a character at the step. (Note that

An -step production process of results in parsing of in which is called the history of and is called the component of . For example for we have as the history of .

If is not reproducible from then is called exhaustive meaning that the copying process cannot be continued and the component should be halted with a single character innovation. Moreover, every string has a unique exhaustive history [5].

Let us denote by the number of components in a history of . then the LZ complexity of is where the minimum is over all histories of . It can be shown that where is the number of components in the exhaustive history of .

A distance for strings based on the LZ-complexity was introduced in [4] and is defined as follows: given two strings and , denote by their concatenation then define

As in [1] we use the following normalized distance function


We note in passing that (1) resembles the normalized compression distance of [9] except that here we do not use a compressor but rather the LZ-complexity of a string. Note that is not a metric since it does not satisfy the triangle inequality and a distance of implies that the two strings are close but not necessarily identical.

This is universal in the sense that it is not based on some specific representation of a string (such as the alphabet of symbols), nor on some heuristics that are common to other string distances, e.g., edit-distances [10]. Instead it only relies on the string’s LZ-complexity which is purely an information quantity independent of the string’s context or representation.

3 Universal Image Distance

Based on we now define a distance between images. The idea is to convert each of two images and into strings and of characters from a finite alphabet of symbols. Once in string format, we use as the distance between and . The details of this process are described in Algorithm 1 below.

  1. Input: two color images , in jpeg format (RGB representation)

  2. Transform the RGB matrices into gray-scale by forming a weighted sum of the R, G, and B components according to the following formula: , (used in Matlab©). Each pixel is now a single numeric value in the range of to . We refer to this set of values as the alphabet and denote it by .

  3. Scan each of the grayscale images from top left to bottom right and form a string of symbols from . Denote the two strings by and .

  4. Compute the LZ-complexities: , and the complexity of their concatenation

  5. Output: .

Algorithm 1 UID distance measure

In the next section we describe the process of selecting the image prototypes.

4 Prototype selection

In this section we describe the algorithm for selecting image prototypes from each of the feature categories . This process runs only once before the stage of converting the images into finite dimensional vectors, that is, it does not run once per image but once for all images. For an image we denote by a sub-image of where can be any rectangular-image obtained by placing a window over the image where the window is totally enclosed by .

  1. Input: image feature categories, and a corpus of unlabeled colored images .

  2. for ( to ) do

    1. Based on any of the images in , let the user select prototype images and set them as feature category . Each prototype is contained by some image, , and the size of can vary, in particular it can be much smaller than the size of the images , .

    2. end for;

  3. Enumerate all the prototypes into a single unlabeled set , where and calculate the distance matrix where the component of is the UID distance between the unlabeled prototypes and .

  4. Run hierarchical clustering on and obtain the associated dendrogram.

  5. If there are clusters with the cluster consisting of the prototypes then terminate and go to step .

  6. Else go to step 2.

  7. Output: the set of labaled prototypes where is the number of prototypes.

Algorithm 2 Prototypes selection

From the theory of learning pattern recognition, it is known that the dimensionality of a feature-vector is usually taken to be small compared to the data size . A large will obtain better feature representation accuracy of the image, but it will increase the time for running Algorithm 3 (described below).

Algorithm 2 convergence is based on the user’s ability to select good prototype images. We note that from our experiments this is easily achieved primarily because the UID permits to select prototypes which are considerably smaller in size and hence simpler than the full images . For instance, in our experiments we used 7 pixels prototype size for all feature categories. This fact makes it easy for a user to quickly choose typical representative prototypes from every feature-cateory. This way it is easy to find informative prototypes, that is, prototypes that are distant when they are from different feature-categories and close when they are from the same feature category. Thus Algorithm 2 typically converges rapidly.

As an example, Figure (a)a displays prototypes selected by a user from a corpus of satellite images. The user labeled prototypes as representative of the feature category urban, prototypes as representatives of class sea, prototypes as representative of feature roads and prototypes as representative of feature arid. The user easily found these representative prototypes as it is easy to fit in a single picture of size pixels a typical image. The dendrogram produced in step 4 of Algorithm 2 for these set of prototypes is displayed in Figure (b)b. It is seen that the following four clusters were found which indicates that the prototypes selected in Algorithm 2 are good.

(a) Labeled prototypes of feature-categories urban, sea , roads, and arid (each feature has three prototypes, starting from top left and moving right in sequence)
(b) Dendrogram produced in step 4 of Algorithm 2.
Figure 1: Prototypes and dendrogram of Algorithm 4

5 Image feature-representation

In the previous section we described Algorithm 2 by which the prototypes are manually selected. This algorithm is now used to create a feature-vector representation of an image. It is described as Algorithm 3 below (in [1] we used a similar algorithm UIC to soft-classify an image whilst here we use it to only produce a feature vector representation of an image which later serves as a single labeled case for training any supervised learning algorithm or a single unlabeled case for training an unsupervised algorithm).

  1. Input: an image to be represented on the following feature categories , and given a set of labeled prototype images (obtained from Algorithm 2).

  2. Initialize the count variables ,

  3. Let be a rectangle of size equal to the maximum prototype size.

  4. Scan a window across from top-left to bottom-right in a non-overlapping way, and let the sequence of obtained sub-images of be denoted as .

  5. for ( to ) do

    1. for ( to ) do

      1. for ( to ) do

        1. end for;

      2. end for;

    2. Let , this is the decided feature category for sub-image .

    3. Increment the count,

    4. end for;

  6. Normalize the counts, ,

  7. Output: the normalized vector as the feature-vector representation for image

Algorithm 3 Feature-vector generation

6 Supervised and unsupervised learning on images

Given a corpus of images and a set of labeled prototypes we use Algorithm 3 to generate the feature-vectors corresponding to each image in . At this point we have a database of size equal to which consists of feature vectors of all the images in . This database can be used for unsupervised learning, for instance, discover interesting clusters of images. It can also be used for supervised learning provided that each of the cases can be labeled according to a value of some target class variable which in general may be different from the feature categories. Let us denote by the class target variable and the database which consists of the feature vectors of with the corresponding target class values. The following

  1. Input: (1) a target class variable taking values in a finite set of class categories, (2) a database which is based on the -dimensional feature-vectors database labeled with values in (3) any supervised learning algorithm

  2. Partition using -fold cross validation into Training and Testing sets of cases

  3. Train and test algorithm and produce a classifier which maps the feature space into

  4. Define Image classifier as follows: given any image the classification is , where is the -dimensional feature vector of

  5. Output: classifier

Algorithm 4 Image classification learning

7 Experimental setup and results

We created a corpus of images of size pixels from GoogleEarth©of various types of areas (Figure 2 displays a few scaled-down examples of such images). From these images we let a user define four feature-categories: sea, urban, arid, roads and choose three relatively-small image-prototype of size pixels from each feature-category, that is, we ran Algorithm 2 with and for all . We then ran Algorithm 3 to generate the feature-vectors for each image in the corpus and obtained a database .

Figure 2: Examples of images in the corpus

We then let the user label the images by a target variable Humidity with possible values or . An image is labeled if the area is of low humidity and labeled if it is of higher humidity. We note that an image of a low humidity region may be in an arid (dry) area or also in the higher-elevation areas which are not necessarily arid. Since elevation information is not available in the feature-categories that the user has chosen then the classification problem is hard since the learning algorithm needs to discover the dependency between humid regions and areas characterized only by the above four feature categories.

With this labeling information at hand we produced the labeled database . We used Algorithm 4 to learn an image classifier with target Humidity. As the learning algorithm we used the following standard supervised algorithms: , , which learn decision trees, NaiveBayes and Multi-Layer Perceptrons (backpropagation) all of which are available in the WEKA©toolkit.

We performed -fold cross validation and compared their accuracies to a baseline classifier (denoted as ZeroR) which has a single decision that corresponds to the class value with the highest prior empirical probability. As seen in Table 1 (generated by WEKA©) , CART, NaiveBayes and Backpropagation performed with an accuracy of , , , and , respectively, compared to achieved by the baseline classifier. The comparison concludes that all three learning algorithms are significantly better than the baseline classifier, based on a T-test with a significance level of .

Dataset (1) (2) (3) (4) (5)
Classify Image into Humidity: 50.00 86.50 81.50 89.25 87.25
, statistically significant improvement or degradation
(1) rules.ZeroR ” 48055541465867954
(2) trees.J48 ’-C 0.25 -M 2’ -217733168393644444
(3) trees.SimpleCart ’-S 1 -M 2.0 -N 5 -C 1.0’ 4154189200352566053
(4) bayes.NaiveBayes ” 5995231201785697655
(5) functions.MultilayerPerceptron ’-L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a’ -5990607817048210779
Table 1: Percent correct results for classifying Humidity

Next, we performed clustering on the unlabeled database . Using the k-means algorithm, we obtained 3 significant clusters, shown in Table 2.

Feature Full data Cluster#1 Cluster#2 Cluster#3
urban 0.3682 0.6219 0.1507 0.2407
sea 0.049 0.0085 0 0.1012
road 0.4074 0.2873 0.0164 0.655
arid 0.1754 0.0824 0.8329 0.003
Table 2: k-means clusters found on unsupervised database

The first cluster captures images of highly urban areas that are next to concentration of roads, highways and interchanges while the second cluster contains less populated (urban) areas in arid locations (absolutely no sea feature seen) with very low concentration of roads. The third cluster captures the coastal areas and here we can see that there can be a mixture of urban (but less populated than images of the first cluster) with roads and extremely low percentage of arid land.

The fact that such interesting knowledge can be extracted from raw images using our feature-extraction method is very significant since as mentioned above our method is fully automatic and requires no image analysis or any sophisticated preprocessing stages that are common in image pattern analysis.

8 Conclusion

We introduced a method for automatically defining and measuring features of colored images.The method is based on a universal image distance that is measured by computing the complexity of the string-representation of the two images and their concatenation. An image is represented by a feature-vector which consists of the distances from the image to a fixed set of small image prototypes, defined once by a user. There is no need for any sophisticated mathematical-based image analysis or pre-processing since the universal image distance regards the image as a string of symbols which contains all the relevant information of the image. The simplicity of our method makes it very attractive for fast and scalable implementation, for instance on a specific-purpose hardware acceleration chip. We applied our method to supervised and unsupervised machine learning on satellite images. The results show that standard machine learning algorithms perform well based on our feature-vector representation of the images.


  • [1] U.A. Chester and J. Ratsaby. Universal distance measure for images. In Electrical Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1 –4, nov. 2012.
  • [2] M.J. Canty. Image Analysis, Classification and Change Detection in Remote Sensing: With Algorithms for Envi/Idl. CRC/Taylor & Francis, 2007.
  • [3] T.M. Lillesand, R.W. Kiefer, and J.W. Chipman. Remote sensing and image interpretation. John Wiley & Sons, 2008.
  • [4] K. Sayood and H. H. Otu. A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16):2122–2130, 2003.
  • [5] J. Ziv and A. Lempel. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(3):75–81, 1976.
  • [6] J. Ratsaby and V. Sirota. Fpga-based data compressor based on prediction by partial matching. In Electrical Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1 –5, nov. 2012.
  • [7] J. Ratsaby and D. Zavielov. An fpga-based pattern classifier using data compression. In Proc. of IEEE Convention of Electrical and Electronics Engineers in Israel, Eilat, Nov. 17-20, pages 320–324, 2010.
  • [8] G. Kaspi and J. Ratsaby. Parallel processing algorithm for bayesian network inference. In Electrical Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1 –5, nov. 2012.
  • [9] R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523–1545, 2005.
  • [10] M. Deza and E. Deza. Encyclopedia of Distances, volume 15 of Series in Computer Science. Springer-Verlag, 2009.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description