Learning Semantics for Image Annotation

Learning Semantics for Image Annotation

Hassan Foroosh    Amara Tariq and Hassan Foroosh Amara Tariq was with the Department of Computer Science, University of Central Florida, Orlando, FL, 32816 USA at the time this project was conducted (email: amara_tariq@knights.ucf.edu).Hassan Foroosh is with the Department of Computer Science, University of Central Florida, Orlando, FL, 32816 USA (email: foroosh@cs.ucf.edu).
Abstract

Image search and retrieval engines rely heavily on textual annotation in order to match word queries to a set of candidate images. A system that can automatically annotate images with meaningful text can be highly beneficial for such engines. Currently, the approaches to develop such systems try to establish relationships between keywords and visual features of images. In this paper, We make three main contributions to this area: (i) We transform this problem from the low-level keyword space to the high-level semantics space that we refer to as the “image theme”, (ii) Instead of treating each possible keyword independently, we use latent Dirichlet allocation to learn image themes from the associated texts in a training phase. Images are then annotated with image themes rather than keywords, using a modified continuous relevance model, which takes into account the spatial coherence and the visual continuity among images of common theme. (iii) To achieve more coherent annotations among images of common theme, we have integrated ConceptNet in learning the semantics of images, and hence augment image descriptions beyond annotations provided by humans. Images are thus further annotated by a few most significant words of the prominent image theme. Our extensive experiments show that a coherent theme-based image annotation using high-level semantics results in improved precision and recall as compared with equivalent classical keyword annotation systems.

Image Annotation, High-Level Image Semantics, Image Themes, ConceptNet

I Introduction

With the advancement in information search and retrieval techniques, annotation of images with keywords has been a popular area of research [125, 124, 126, 123, 122]. Annotations often contain content-related information such as objects and shapes/patterns present in the scene [35, 34, 130, 95, 102, 2, 1, 51, 3, 33, 49, 114, 98, 100, 106, 99], structural scene models [83, 30, 69, 10, 72, 78, 119, 85, 118, 14, 121, 76, 75, 77, 71, 70, 15, 9, 116], humans and their actions [112, 16, 11, 120, 108, 117, 17, 107, 110, 12, 13, 32, 109, 111, 8], or information about the location of the image [80, 79, 86, 87, 84, 88]. As a result there is often a semantic gap between the annotations and the image, since the annotations are dealing with content and seldom with context/theme. From an application point of view, search engines and retrieval systems rely on annotation with textual data to match images with textual queries. Such queries may be used in applications such as image editing and post-production [43, 44, 113, 27, 128, 101, 4, 5, 64, 21, 129, 41], matching places by alignment [63, 58, 28, 29, 6, 18, 19, 105, 59, 104, 25, 24, 62, 23, 60, 56, 103, 22, 20, 57, 55]. Images that are of low resolution and low quality may be improved by preprocessing methods [63, 58, 28, 29, 6, 18, 19, 105, 59, 104, 25, 24, 62, 23, 60, 56, 103, 22, 20, 57, 55], in order to identify content, or may benefit from camera pose and motion quantification methods for scene modeling [42, 40, 47, 81, 37, 39, 36, 73, 82, 46, 61, 74, 90, 91, 92, 7, 89, 96, 26, 38, 38, 45]. However, such preprocessing and restoration methods cannot help in extracting high-level semantics about the images, i.e. image themes. On the other hand, most images uploaded on the web have some sort of accompanying textual information e.g. image caption, neighboring text on the same web-page, etc. These forms of textual information can be extremely noisy, or may be dependent on user input like in image caption. The main task of automatic image annotation (AIA) is thus to develop a system to automatically generate keywords for input images. Generated keywords need to be meaningful enough to be used to match images to queries. Most of the previously developed approaches for AIA have been tested over the Corel5K dataset, used initially by Duygulu et al. [50]. Most popular techniques for AIA have used translation models from natural language processing (NLP) to establish relationships between low-level visual features and keywords.

Many of the previously popular techniques have the shortcoming of treating each keyword independently of all the other keywords. It is evident that keywords used as annotation for an image are heavily correlated to each other. For example, if a certain image has been annotated with ‘people’ , ‘sand’ and ‘water’, chances of ‘beach’ being another correct annotation are much higher than that of ‘snow’. Any system which completely ignores this quality of correlation between keywords actually misses an important piece of evidence. Over the years, several papers have been published, attempting at incorporating correlation between keywords in automatic image annotation [67, 127]. The basis of techniques used to exploit correlation between keywords range from expectation maximization to incorporation of natural language processing (NLP) tools such as WordNet [67, 68]. In this paper, we attempt to exploit the correlation between keywords by using higher level semantics of available annotations. Our technique is based on image theme modeling using Latent Dirichlet Allocation. Thus, we have transformed the problem of low-level keyword annotation to high-level image theme annotation.

Our motivation is based on the fact that low-level visual features may not provide sufficient visual cues for each object in the image to be identified separately, and hence used for annotation. Objects can by partially obstructed from the view or may occupy a too-small size in the image to generate enough evidence in the form of low-level visual features [67, 127]. Overall, the semantic gap between visual features and meaningful annotations remains unbridgeable. But all these visual features combined with their spatial information can provide enough information to identify a theme for an image. The question is therefore how we can find a good annotation for that theme. Since the Corel5K dataset uses limited keyword annotations for each image, we decided to use the IAPR TC 12 dataset, because it provides each image with more complete descriptions in sentences. We have used Latent Dirichlet Allocation (LDA) to perform image theme modeling over descriptions of these images. LDA provides a means of modeling all image themes present in the textual description of images in the form of word distribution. The image theme models generated through this process are based on words used in documents provided for training of the system. Therefore, these image themes implicitly employ correlation between words. We use these image themes for annotation of our training dataset, and then use relevance model to establish the relationship between the annotated images and the visual themes. We thus transform the problem of relevance between keywords and visual objects to image themes and visual objects. Later, we tag each image with the most significant words of each image theme model associated with the image. Thus our final output is similar to other popular image annotation systems and can be directly compared against them.

An important advantage of annotation with image themes versus keywords is that image themes can be elaborated using NLP resources like the ConceptNet [94]. The use of ConceptNet shows that image semantics, when represented by significant words of each image theme, are readily understandable and meaningful for humans. Therefore, annotating images with even the top few words of an image theme will be helpful for humans when searching for images. The elaborated image themes may contain words which are not present in ground-truth image descriptions available with the datasets, but provide additional contextual knowledge about the images. We have demonstrated this effect on a smaller part of the test datasets, through manual tagging.
We have used a relevance model, similar to the Continuous Relevance Model (CRM) to annotate images with image themes. Our relevance model has two main distinctive features: (i) we take into account the spatial position of visual features; (ii) We incorporate image clustering based on visual features. Each image cluster contains images with a certain level of similarity between their visual features. The size of the pool of possible image themes for each of these clusters is smaller than the overall number of image themes for the whole dataset. Moreover, these image themes will have some level of similarity between them as they are visually similar images. We show that incorporating spatial information and using clusters provide better performance for image annotation.

Ii Related Work

The idea of annotating images with keywords has been vastly studied using different approaches in the literature, with most papers using the Corel 5K dataset as their benchmark [50, 66, 93]. All different approaches essentially try to learn the relationships between words and image features. Zhang et al. have provided a comprehensive review of all the popular techniques used for automatic image annotation [131].

Relevance models from machine translation were introduced to solve this problem by Jeon et al. [66]. To apply the relevance models, it is necessary to represent images in terms of visual features in a manner similar to the way documents are represented in terms of word-counts. Therefore, Jeon et al.  and many other researchers used the bag-of-words approach for image representation, which clusters image features to produce a finite number of visual-words. Blobworld by Carson et al.  was popularly used for dividing images into meaningful patches of similar color and texture [48]. Lavrenko et al.  introduced the relevance model in the continuous space named as the continuous-relevance-model (CRM) [93], and showed considerable improvement by removing the constraint of finite number of visual-words. Feng et al. introduced the multiple-bernoulli-relevance model and observed that dividing images into a fixed size grid works better than the complex system of Blobworld [52].

The annotation problem has been also sometimes treated as a classification problem with class-labels as keywords to be used for annotating images [65]. This approach works well with primitive datasets of very small number of keywords. Some attempts have also been made to incorporate language models and natural language processing tools such as WordNet in the process of image annotation [67, 115]. Some researchers have tried to exploit the correlation between keywords during the process of image annotation, rather than treating each keyword independently of all others [67, 127, 68]. Latent Dirichlet Allocation based image theme modeling was introduced to produce annotation for news images [53]. In this case, each image is accompanied by a news article, which provides additional information regarding that image. Feng et al.  worked to establish a similar approach to unify visual and linguistic characteristics of images [54]. Makadia et al.  conducted a detailed survey of automatic image annotation techniques and arrived at the conclusion that greedy label transfer based approaches can beat complex relevance based algorithms in many cases. They presented two such label transfer based techniques [97].

In this paper, we propose a solution to a related but new problem of theme-based image annotation, where the goal is to annotate images with textual information that model image semantics at a higher level than keywords associated with individual objects in the scene. We generate these image theme models using Latent Dirichlet Allocation (LDA) and each image theme is modeled in terms of word distributions. The process of image theme modeling implicitly employs the correlation between words. Therefore, our system overcomes the shortcoming of treating each keyword independently, while they are actually heavily correlated in human perception. These image themes can be represented by a group of few significant words i.e. words with highest probability values in the corresponding word distribution. These words may also be used to generate proper phrases generated by natural language generation techniques. Our motivation is that these image theme models provide better annotation for images, since they provide contextual information rather than reflecting only the content. Our approach is integrating ConceptNet [94] in learning the semantics of images to prove commonsense basis of our annotation and ground-truth augmentation beyond the description provided by users. We have used the IAPR TC 12 111http://imageclef.org/photodata data set for evaluation. This data set is considerably more challenging than the Corel 5K dataset.

In the remainder of this paper, we first discuss the problem of theme-based image annotation and our solution based on a modified CRM and ConceptNet for learning high-level image semantics. We then describe the data sets used to assess our solution, followed by the results from a comprehensive set of experiments to: (i) evaluate the performance of our method, (ii) to compare our results with state-of-the-art annotation methods that use low-level keywords in terms of precision and recall, and (iii) to demonstrate visually how augmenting low-level keywords with high-level image theme concepts can enrich the image semantics captured by annotations. We finish the paper with some concluding remarks.

Iii Image Theme Annotation

In this section, we discuss the various parts of our overall solution for learning high-level semantics for theme-based annotation of images.

Iii-a Image Theme Modeling

Image theme modeling through Latent Dirichlet Allocation (LDA) presented by Blei et al. [31] has gained tremendous popularity among natural language processing (NLP) researchers. The basic framework of LDA is a generative probabilistic model, which assumes that documents are random mixtures of latent image themes, while each latent image theme can be represented by a distribution over words. Given a set of documents, LDA assumes: (i) A word is the basic discrete unit of data. (ii) A document is a sequence of words. In our case, the description of each image is a document i.e. a sequence of words. (iii) A corpus is a collection of documents. In our case, the collection of descriptions of all images constitutes the corpus.

According to Blei et al. , the system assumes the following generative process for generating all documents with and as system parameters.

  • choose from a Poisson distribution P()

  • choose from a Dirichlet distribution Dir()

  • for each of the words

    • choose the image theme from Multinomial()

    • choose the word from a multinomial distribution conditioned on i.e. P(/ , )

Several simplifying assumptions are made, details of which can be found in the paper by Blei et al.  [31]. The final expression for the joint distribution of an image theme mixture and the sets of image themes z and words w is as follows [31]:

(1)

The process takes as input a corpus of documents and assumes that the above mentioned generative process was at play when these documents were generated. The system estimates the image theme distribution for each document and word distribution, conditioned on image theme distribution. Blei et al.  have described a variational inference algorithm to estimate the posterior distribution of the hidden variables given a document.

image theme1 ‘shorts’: , ‘cyclist’: , ‘helmet’: , ‘jersey’: , ‘cycling’:
image theme2 ‘table’: , ‘wooden’: , ‘walls’: , ‘restaurant’: , ‘glasses’:
image theme3 ‘forest’: , ‘bushes’: , ‘dense’: , ‘path’: , ‘vegetation’:
TABLE I: Sample word distribution conditioned over image theme distribution generated using LDA over image descriptions from the IAPR data set. These word distributions hint towards three distinct visual themes.

We adopted a basic LDA model from the NLP community to apply to image annotation. For this purpose, we used the IAPR dataset, which provides a complete description of each image in three languages. We restricted our research to the English language only. We employed the basic LDA framework using the description of each image as one document. The image theme distribution for each document tells us which image themes are strongly present in each image description in the training set. The word distribution conditioned over the image theme distribution tells us which words are strongly associated with each of the image themes. Thus, we transform the annotation information from the word to the image theme space.

Table I provides some sample word distributions generated for some image themes. These lists have been ranked according to the strength of the probability values and show only the top few words along with their probability values. The word distributions provided in Table I strongly hint towards distinct visual themes such as ‘cycle race’, ‘inside of a restaurant’ and ‘dense forest’. These word distributions support our assumption that image themes correspond to visual themes of images. We use low-level visual features with spatial information to find visual themes in images and later find the relevance between those themes and image themes.

Iii-B Integration of ConceptNet

ConceptNet is a freely available commonsense knowledge-base [94], which has been popularly used for reasoning tasks on documents. This knowledge-base is basically a semantic network, which connects various words and phrases if they are conceptually related, e.g. “learn” , “teacher” and “classroom” are strongly connected to each other in ConceptNet, although these words do not have standard lexical relationships of synonymy, hypernymy or meronymy between them. The data for this knowledge-base was collected as part of the OpenMind Common Sense (OMSC) project, where internet users were asked to fill in templates of information. For example, a certain template ‘– is used for –’ can be filled as ‘KNIFE is used for CUTTING’. This simple template will provide information regarding a certain type of relation between words that has been named as ‘UsedFor’ relation. Relationships between words have not only been extracted directly from templates filled in by users, but also through a complex system of inference using the templates as input. The idea is that commonsense knowledge is possessed by every person. Therefore, contributors to this knowledge base do not need to have some specific qualifications as long as the application interface is easy enough to be understood by an average Internet user.

ConceptNet is dedicated to contextual reasoning. This semantic network has about million assertions between nodes. A major portion of these assertions are generic conceptual assertions called k-lines [94]. Other natural language processing resources like WordNet do not have these conceptual assertions between words, but rely on standard lexical relations of synonymy, hypernymy or meronymy. These conceptual assertions enable ConceptNet to perform reasoning over textual input. Overall, there are twenty different types of assertions between nodes. Every assertion is weighted based on the number of times it occurs in the OMSC corpus, and how well it can be inferred indirectly from other assertions. These weights basically measure relatedness between words, and we have used them to reflect the fact that the top few words in each of the image theme models are strongly related to each other in this commonsense based semantic network.

We have used image theme modeling to transform keywords into image themes in a manner similar to image theme modeling used in the NLP community to generate image themes for the documents given as input. Therefore, image themes generated through image theme modeling will be better representatives of semantic contents of the input image data. On the other hand, ConceptNet includes generally acceptable commonsense relationships between words. There may be image themes specific to a data set, which may not be represented with high-confidence in ConceptNet. For example, if a data set contains many images of men wearing blue jackets (and described so in image description), one image theme generated through image theme modeling will have “men” , “blue”, and “jacket” as the top three words in terms of the probability distribution. ConceptNet, on the other hand, will not have these three words linked to each other with high confidence, because these words do not represent any generally popular concept. Still, we have shown that significant words of many image themes generated in the given data set are indeed linked with high-confidence in ConceptNet. Therefore, if an image is annotated with the top few words of an image theme, these annotations will convey a strong hint towards a common-sense acceptable concept or theme, providing thus a commonsense basis for our idea of image theme annotation of images. Moreover, the API of conceptnet2.0 provides a tool named ‘Projection’, which takes as input a list of words and returns an extended list of words ranked according to their measure of aggregate relatedness to all words in the input. We provide this tool with the annotation generated by using the significant words of the associated image themes, and then use the output list of words to augment the annotations generated for a specific image.

Iii-C Modified Relevance Model

Relevance models are basically statistical formalisms to model the relationship between contents of two corpora. These models have been particularly popular in the natural language processing community for tasks such as machine translation, where it is necessary to establish the relationships between two corpora of text in different languages. In the case of text or language processing, data is usually represented in the form of word counts and is in the discrete domain. Lavrenko et al.  transformed the relevance model from machine translation to adapt to visual features in the continuous space, and named the new model as the Continuous Relevance Model (CRM) [93]. Suppose is the set of training images, is a member of , is represented in the form of image regions and the annotation for are . Lavrenko et al.  assume that (i) the words in are i.i.d. random samples from the underlying distribution of . (ii) the regions in correspond to the generator vector , generated by some function , with , which is independent of . (iii) the generator vectors are also i.i.d. random samples from the multi-variate density function .

Now if is an image not in , with regions and some arbitrary sequence of words , the goal is to find the joint distribution of observing with words in . CRM computes an expectation over all images in to estimate this joint distribution. The overall process of jointly generating and is as follows [93]:

  • Pick a training image from the set with some probability

  • For , pick a word from the multinomial distribution . denotes the overall vocabulary set.

  • For

    • Sample the generator vector from

    • Pick the region with probability

The formal expression for the joint probability of observation {} follows the Drichlet posterior expectation with parameters

(2)

where is an empirically selected constant, is the relative frequency of in the training set, and is the number of times occurs in the observation , and is used to get the Beysian estimate of the multinomial [93].

(3)

Gaussian kernel is used for smoothing, while estimating [93].

(4)

can be assumed constant and Lavrenko et al.  have used a particularly simple expression for the distribution .

(5)

where is a constant independent of .

Another relevance model was introduced by Feng et al.  with the name of Multiple Bernoulli Relevance Model (MBRM). The difference between MBRM and CRM is that MBRM assumes a Bernoulli distribution of the vocabulary for . Since it is rare for words to be repeated in the description of images, the Bernoulli distribution seems to be better suited to estimate with the following expression (6), and the MBRM shows improvement over the CRM [52].

(6)

In equation (6), represents the presence/absence of a word in the annotations of image , is the number of images in the training set containing as annotation, and is the total number of images in the training set.

We propose two modifications to adapt the CRM to image theme modeling based on the following two observations: (i) images of the same themes often have similar spatial coherence, (ii) images of the same themes often exhibit similar visual characteristics. The following two sections describe the modification of the CRM according to these two observations. To emphasize the spatial and visual coherence among images of common themes, we call this model a Coherent Continuous Relevance Model (CCRM).

Iii-C1 Relevance Model with Spatial Coherence

Over the years, different methods have been explored to divide images in regions to generate e.g. Blobworld [48] which is a complex method to divide an image in regions of similar appearance, which can generate different number of blobs in different images. In our proposed model, we want to generate equal number of regions in all images to preserve spatial coherence. Therefore, we have generated visual features by dividing images in a fixed grid and then representing each tile of grid with features representing color and texture of these tiles. Our features include color features: mean and standard deviation of RGB, LUV and LAB color components, and texture features: output of the Gabor filter with different scales and orientations. It has been previously proven that grid-based visual features perform better than complex blob based features [52].

Using the same notation as used by Lavrenko et al., represents a region of image A and the function produces as the corresponding visual feature (generator vector) for the region . In CCRM, the joint probability of observing with image is estimated by the same process of expectation over all images in the training set as described for CRM in equation (2).

As argued earlier, the visual themes are not only captured by the visual features of tiles, but also by their relative spatial coherence. Therefore, we modify the equation for (i.e. equation (4)) to incorporate the spatial information of the visual features. If represents one tile at a certain position in all images (as all images have the same number of tiles), represents the visual feature corresponding to the th tile in image , and represents the corresponding image region in image from the training set.

(7)

In equation (7), depends only on the corresponding tile of image . An additional advantage of this modification to CRM is the huge reduction in time-complexity of the procedure. The complexity is actually reduced by a factor of , where is the number of tiles in the grid. To execute this approach, it is necessary that all images are divided by the same size grid, and that ordering of tiles are fixed in the representation of visual features. Another point to be noted is that in the case of theme-based image annotation, each word is actually an image theme, which may be represented as . Therefore, in estimating using the Dirichlet prior, the count of word is replaced by the strength of the image theme .

Iii-C2 Clustering Based on Visual Features

Theme-based image annotation assumes the presence of significant visual themes in images. To take advantage of such common visual themes among multiple images in a data set, we incorporate a step of clustering images based on their visual features before image theme annotation. The underlying assumption is that there is visual coherence among images belonging to one cluster. We fixed the size of clusters and dropped clusters with too-low membership based on the assumption that those clusters represent images with rarely-occurring themes. Image theme modeling is applied over each cluster to generate image themes specific to that cluster only. For annotation, an image is matched to a suitable cluster based on its visual features, and then the annotation process is carried out using the information from that particular cluster only. Once an image is matched to its corresponding cluster, the pool of possible image themes for annotation is smaller than the total number of image themes for the complete data set. Even within that smaller pool of image themes, image themes share an underlying common context, since these image themes are shared by images with some level of visual coherence.

Equations provided in the previous section to estimate the probabilities are therefore modified based on this additional clustering step. Let the test image be matched to the cluster , and let be the set of all training images belonging to this cluster. The following equation represents our modified probability estimation:

(8)

As mentioned earlier, represents the image themes instead of the words for theme-based image annotation. is estimated over the samples of one particular cluster.

Iv Data Set and Results

Our main data set for evaluation is the IAPR TC 12 consisting of 20,000 images. Each image has accompanying description provided in three languages. At present, we have restricted our experiments to use only the English descriptions. We have conducted experiments to confirm the quality of each of the modification we have made to the relevance models. In this section, we shall present the details of these experiments.

Iv-a Evaluating Spatial Coherence

The first set of experiments are used to demonstrate the validity of spatial coherence even within classical relevance models (i.e. MBRM [52] in this case). Makadia et al. have used the classical MBRM to annotate two rather challenging data sets i.e. the IAPR TC 12 and the ESP222http://www.espgame.org game. We ran MBRM, while preserving spatial coherence, and compared our results against those provided by Makadia et al. To generate comparable results, we used similar data set and vocabulary sizes. The number of annotations per image is the same as the average length of annotation per image in the training set. Our results in Table II confirm that incorporating spatial information produces better results with much lower computational time, confirming thus the validity of the idea of spatial coherence. An additional advantage is the considerable reduction in computational complexity of the new method. Comparing equations (4) and (7), it is evident that preserving spatial coherence in the relevance model reduces the complexity by an order of , where is the total number of regions in each image.

Iv-B Evaluating CCRM Without Clustering

For the remaining experiments to evaluate the CCRM, we used the IAPR TC 12 data set restricted to English descriptions. For this purpose, we generated image themes from the descriptions by using the Latent Dirichlet Allocation 333http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm with fixed number of image themes. As described in section III-A, LDA takes as input a collection of documents, which in our case is the set of all image descriptions. LDA uses variational inference algorithms to estimate word distributions conditioned over image theme distribution. It returns as output word distributions for a certain number of image themes. The number of image themes is a user input and therefore can be changed. Examples of these word distributions have been provided in table I. It also returns vectors indicating which image themes are present in each document, i.e. each image description.

For experimental purposes, we used LDA over the description of all images, and generated a vector indicating the presence or the absence of an image theme corresponding to each image. We, thus, converted our problem from the word domain to the theme domain, and generated the ground truth for evaluation. We then separated the test and the training sets and used CCRM to annotate the images in the test set with image themes, using the expectation over all images in the training data, as described in section III-C. Note that annotations generated in this case are actually image themes and not words. Theme-based image annotations are useful, as we have already established that image themes corresponds to visual themes. Therefore, a correct theme-based image annotation means also a correct identification of visual themes. Later on, suchthis visual themes can be represented in terms of top few words from the word distribution corresponding to the image themes. These words may also be used to generate phrases if used with some natural language generation process.

We compared image theme annotation for test images generated through CCRM with the ground-truth generated by using LDA in the first step. We measured the performance of the proposed theme-based annotation in terms of mean precision, recall and F-measure per image theme. Results are provided in Table III, with different number of image themes generated through LDA.

Iv-C Evaluating CCRM With Clustering

We conducted another set of experiments to apply the second modification to the classical relevance model that we suggested in section III-C2. For these experiments, we first clustered the available images based on the similarity in their visual features. We used standard Euclidean distance as the measure of similarity between images for clustering and then dropped clusters with extremely low membership. We were left with about of the original data set and clusters.

For theme-based image annotation, we applied LDA over image descriptions in each cluster separately, and the number of image themes was decided based on the membership of that cluster. Typically, a cluster with larger number of images is likely to have images representing greater number of themes, therefore larger number of image themes. Again, the test and the training sets were separated after generating ground truth in the theme-domain. CCRM was then applied to estimate the joint probability of visual features and themes for each of the test images, using only the training images from its corresponding cluster.

Results are provided in Table IV. Image themes generated for each cluster are treated as distinct from image themes generated for other clusters. Therefore, a large number of image themes are generated, which makes the theme space more fine-grained. Results show improvement over experiments described in the previous section even though the theme-space is now larger and more fine-grained. the performance depends also on the number of image themes generated. Overall, clustering helps the process of identification of correct visual themes and the association of images with the corresponding themes.

As described earlier, image themes are represented by their word-distribution. No two image themes generated will have the exact same word-distribution. If a large number of image themes are generated, then the average similarity between image themes increases, i.e. some image themes share similar but not equivalent word distributions.

Iv-D Image Theme Annotation

We annotated images with the image theme number (label), because image themes correspond to visual themes and we assumed that each theme is sufficiently described by a few top words or word distribution conditioned over that image theme, generated using LDA. The top few words of the word distribution of each image theme are conceptually highly related. All the words assigned to an image, based on the word distributions of themes for that image, may not be present in the original description of the image. However, the additional words are commonsense extensions to already provided descriptions by users. For example, “tree”, “trunk”, and “leaves” are top three words of the word distribution of an image theme, and they have been assigned to an image that has a ‘tree’ in it, but the original image description did not contain the word “trunk”. Standard precision and recall measures will consider “trunk” as wrong annotation, while it is actually a conceptually related detail to the visual theme of “tree”.

Iv-E Comparison With Other Methods

We ran another set of experiments to compare our results against state-of-the-art algorithms for keyword annotation. Although, image theme annotation is a higher-level semantic annotation, we can compare the results by treating the top words form the word distribution of image themes as keywords. Table V shows that, on comparable problem sizes, our theme-based image annotation beats the CRM with spatial coherence, the MBRM, and the two greedy label transfer algorithms described by Makadia et al.  [97], i.e. JEC and Lasso. To make the results comparable, we used the same sizes of data set. Makadia et al.  have used most frequently occurring keywords as vocabulary. We ran CRM with spatial coherence using the same vocabulary size. We annotated the images with image themes using CCRM with clustering and used the top words of image themes assigned to the image with highest probability as the annotating keywords. There were a total of distinct words appearing in annotation in this case. Results in Table V are expressed in terms of precision and recall, averaged over the total number of keywords used in the experiment, making thus the numerical results comparable.

The improvement generated by our method is even more significant considering the fact that the state-of-the-art relevance models were beaten in performance by the simple greedy approach algorithms [97]. Many of the variations suggested over time use computationally expensive approaches for restricted data sets like Corel5K, where image contents basically belong to a few broad categories, e.g. ‘cars’ , ‘animals’, etc. Methods using language models or ontology like WordNet enjoye limited success. We have used a larger dataset with significant variation in image contents. Our method shows not only performance improvement over other relevance models, but also over the greedy label transfer based algorithms.

Dataset Annotation algorithms Mean precision Mean recall Mean F-measure No. of keywords Total no. of
per keyword per keyword per keyword with recall 0 keywords
ESP MBRM 21.0% 17.0% 218 268
ESP MBRM-spatial coherence 25.0% 17.0% 19.0% 235 268
IAPR MBRM 21.0% 14.0% 186 291
IAPR MBRM-spatial coherence 24.0% 15.0% 16.0% 213 291
TABLE II: MBRM vs. MBRM with spatial coherence; Results of typical MBRM (described by Feng et a.l [52]) have been taken from Makadia et al. [97], which does not report F-measure values. Similar vocabulary and dataset size have been used in both cases, as used by Makadia et al.
Mean precision Mean recall Mean F-measure Number of themes
per theme per theme per theme with recall 0
25 themes over all dataset 44.2% 35.0% 33.0% 25
50 themes over all dataset 31.0% 24.0% 22.6% 50
75 themes over all dataset 26.3% 19.0% 18.0% 75
TABLE III: Image theme annotation performance
Mean precision Mean recall Mean F-measure Number of themes Total number of
per theme per theme per theme with recall 0 themes
2 themes per 100 images 62.2% 57.1% 55.0% 323 327
3 themes per 100 images 51.0% 47.0% 44.0% 472 498
4 themes per 100 images 44.0% 40.3% 37.2% 601 671
TABLE IV: Image theme annotation performance with clustering; number of image themes for each cluster is decided on the basis of number of members of cluster
Mean precision Mean recall Mean F-measure No. of words Total no. of
per word per word per word with recall 0 words
MBRM 21.0% 14.0% 186 291
JEC 25.0% 16.0% 196 291
Lasso 26.0% 16.0% 199 291
CRM-spatial coherence 26.0% 13.0% 14.0% 181 291
Words generated from image themes 35.0% 16.0% 20.0% 253 360
TABLE V: Keyword annotation performance comparison between CRM-spatial coherence , MBRM [97], Lasso [97] , JEC [97] and words generated from image annotation with image themes using CCRM with clustering. All results have been computed over images of IAPR dataset; with almost of dataset used as testing set. Results have been averaged out over total number of keywords as explained in section IV. Results of MBRM, JEC and Lasso have been taken from Makadia et al., who have not reported F-measure [97].
No. of top words Avg. score for all themes Avg. score for best 10 themes Avg. score for best 5 themes
5 28.1 37.5 41.3
6 27.8 36.3 38.5
TABLE VI: Average relatedness score of top few words of image theme generated through image theme modeling over complete IAPR TC 12 dataset

Iv-F Ground Truth Augmentation Using ConceptNet

Our assumption is that images annotated with image themes provide useful and understandable information to humans. To prove our point, we employed ConceptNet. We explained in details in section III-B that connections between nodes are weighted in the semantic network of ConceptNet. The relatedness score provided by ConceptNet takes into account all paths between two nodes in the network and also their weights. Heavier weights mean more conceptual relatedness between nodes. Table VI provides a few statistics of the relatedness scores between the top few words of each image theme. We estimated that average numerical relatedness score of top words of all image themes was about 444This estimation was performed over 50 image themes generated over the entire data set.

We explained in section III-B the difference between image themes generated through LDA and concepts present in ConceptNet. Our observations indicate that the top few words of each image theme convey a strong, commonsense acceptable hint about image content to humans. There can be two ways to represent an image annotated with an image theme. One simple way is to annotate it with a fixed number of top words from word distributions of the selected image themes. Although, these words may not describe all objects in the image, they would definitely provide a strong indication of the image context. The second way is to generate a phrase encompassing the top words of each image theme to annotate the image. For example, top few words of one of the image themes were “sand”, “beach”, and “water”, for which “sandy beach” could be an appropriate annotation. Phrase generation is a problem that has been studied in NLP.

Additional advantage of using image theme modeling with the aid of ConceptNet is that we can augment our ground-truth for a given annotated data set. ConceptNet can be used to elaborate at least those image themes that are present in the image. ConceptNet provides the possibility of extending a word list provided as input with conceptually related words, which is called ‘projection’ within the context of ConceptNet. The API of ConceptNEt 2.0 has this facility with the same name. This API even provides distinction between different types of projections e.g. spatial projections, where projected words have spatial relations to input words, e.g. California is part of spatial projection of Los Angeles [94]. Other types of projections include consequence, detail, etc., where projected words are consequence or part of detailed descriptions of input words, respectively. The output list is a list of words, sorted according to their aggregate relatedness scores with the words in input list. We have used this API to augment the annotation generated by CCRM and image theme modeling. We first annotated each image with top few words of the image themes found using CCRM. Then we provided the list of annotation as input to the ‘projection’ facility of the ConceptNet API, and found an extended list of conceptually related words. We argue that using top words from these extended list as annotations for images provides an even more detailed description of images. The words provided by ‘projection’ may not have been used by user as the descriptions of the images in our test data set. However, these words are clearly conceptually related to the visual themes of the images. Therefore, these words were used to augment the ground truth information i.e. image descriptions provided by users. We have included some examples to demonstrate this fact.

Table VII provides an example list of top words from the word distributions conditioned over a few image themes in our data set, and useful projections provided by ConceptNet. Another observation that we made was that ConceptNet can provide more abstract ideas in projected words, e.g. ‘nature’ is a word projected for images of forest. When added to the description of an image, this word can appropriately make the image relevant to queries dealing with the general idea of ‘nature’.

Top words of image theme Augmented words using ConceptNet
‘walls’ , ‘children’ , ‘classroom’ , ‘board’ , ‘desk’ ‘in-school’ , ‘student’
‘clouds’ ,‘sky’ , ‘sun’ , ‘shade’ ‘sunset’ , ‘yellow’ , ‘blue’
‘forest’ , ‘bushes’ , ‘dense’ , ‘path’ ‘nature’
‘spectators’ , ‘stadium’ , ‘grandstand’ , ‘court’ ‘game-play’ , ‘watch’
‘room’ , ‘wood’ , ‘walls’ , ‘lamp’ ‘furniture’
‘building’ , ‘city’ , ‘view’ , ‘night’ ‘look-through-telescope’ , ‘dark’
‘road’ , ‘gravel’ , ‘car’ , ‘dirt’ ‘ride’ , ‘track’
‘streets’ , ‘building’ , ‘people’ , ‘pavement’ ‘road’ , ‘walk’ ,‘in-city’
‘shorts’ ‘cyclist’ , ‘helmet’ , ‘jersey’ ‘bike-ride’ , ‘athlete’
‘man’ , ‘woman’ , ‘shirt’ , ‘hands’ , ‘clothes’ ‘wear’ , ‘outfit’ , ‘dress’
TABLE VII: Sample of top words of a few image themes and augmented words from ConceptNet; These image themes have been selected from image themes generated over all dataset

Figures 1 2 3 4 show sample of images annotated with a certain image theme; with top few words from word distribution of the image theme and additional projected words provided in caption. It is evident that projected words , if augmented to ground-truth, provide useful context about images. Space restrictions prohibit us from providing more samples.

Fig. 1: Top words from image theme : ‘shorts’ ‘cyclist’ , ‘helmet’ , ‘jersey’ ; Augmented words: ‘bike-ride’ , ‘athlete’
Fig. 2: Top words from image theme :‘building’ , ‘city’ , ‘view’ , ‘night’ ; Augmented words: ‘look-through-telescope’ , ‘dark’
Fig. 3: Top words from image theme :‘room’ , ‘wood’ , ‘walls’ , ‘lamp’ ; Augmented words: ‘furniture’
Fig. 4: Top words from image theme :‘forest’ , ‘bushes’ , ‘dense’ , ‘path’ ; Augmented words: ‘nature’
Fig. 5: Top words from image theme :‘spectators’ , ‘stadium’ , ‘grandstand’ , ‘court’; Augmented words: ‘game-play’ , ‘watch’

V Conclusion

Automatic image annotation has been a focus of research because of its potential application to benefit image search and retrieval engines, as well as many other applications in image/video processing. Most of the algorithms presented previously perform unsatisfactorily when tested over challenging data sets like IAPR TC 12. We have radically transformed the problem of automatic image annotation from the keyword space to the image theme space. We have employed techniques popular in natural language processing (NLP) to annotate images with image themes corresponding to visual themes rather than independent keywords corresponding to individual objects. Annotated images, when represented by a few significant words from the word distribution of the image themes, can provide strong conceptually-acceptable hint towards the overall theme of the image. We have employed for the first time a semantic network (i.e. the ConceptNet) to provide commonsense basis of our image theme annotation idea. We have also shown that matching an image to a cluster of images with similar visual themes helps narrow down possible image themes for annotation, while providing a performance boost. By using top words from the word distribution of the image themes as annotations, we have compared the performance against standard keyword-based annotation methods, and have shown superior results. Performance improvement is even more significant considering the fact that most previously developed methods have been beaten by the greedy label transfer approaches, when used for a challenging data set like IAPR - our system is able to beat these greedy approaches.

References

  • [1] Muhamad Ali and Hassan Foroosh. Natural scene character recognition without dependency on specific features. In Proc. International Conference on Computer Vision Theory and Applications, 2015.
  • [2] Muhamad Ali and Hassan Foroosh. A holistic method to recognize characters in natural scenes. In Proc. International Conference on Computer Vision Theory and Applications, 2016.
  • [3] Muhammad Ali and Hassan Foroosh. Character recognition in natural scene images using rank-1 tensor decomposition. In Proc. of International Conference on Image Processing (ICIP), pages 2891–2895, 2016.
  • [4] Mais Alnasser and Hassan Foroosh. Image-based rendering of synthetic diffuse objects in natural scenes. In Proc. IAPR Int. Conference on Pattern Recognition, volume 4, pages 787–790, 2006.
  • [5] Mais Alnasser and Hassan Foroosh. Rendering synthetic objects in natural scenes. In Proc. of IEEE International Conference on Image Processing (ICIP), pages 493–496, 2006.
  • [6] Mais Alnasser and Hassan Foroosh. Phase shifting for non-separable 2d haar wavelets. IEEE Transactions on Image Processing, 16:1061–1068, 2008.
  • [7] Nazim Ashraf and Hassan Foroosh. Robust auto-calibration of a ptz camera with non-overlapping fov. In Proc. International Conference on Pattern Recognition (ICPR), 2008.
  • [8] Nazim Ashraf and Hassan Foroosh. Human action recognition in video data using invariant characteristic vectors. In Proc. of IEEE Int. Conf. on Image Processing (ICIP), pages 1385–1388, 2012.
  • [9] Nazim Ashraf and Hassan Foroosh. Motion retrieval using consistency of epipolar geometry. In Proceedings of IEEE International Conference on Image Processing (ICIP), pages 4219–4223, 2015.
  • [10] Nazim Ashraf, Imran Junejo, and Hassan Foroosh. Near-optimal mosaic selection for rotating and zooming video cameras. Proc. of Asian Conf. on Computer Vision, pages 63–72, 2007.
  • [11] Nazim Ashraf, Yuping Shen, Xiaochun Cao, and Hassan Foroosh. View-invariant action recognition using weighted fundamental ratios. Journal of Computer Vision and Image Understanding (CVIU), 117:587–602, 2013.
  • [12] Nazim Ashraf, Yuping Shen, Xiaochun Cao, and Hassan Foroosh. View invariant action recognition using weighted fundamental ratios. Computer Vision and Image Understanding, 117(6):587–602, 2013.
  • [13] Nazim Ashraf, Yuping Shen, and Hassan Foroosh. View-invariant action recognition using rank constraint. In Proc. of IAPR Int. Conf. Pattern Recognition (ICPR), pages 3611–3614, 2010.
  • [14] Nazim Ashraf, Chuan Sun, and Hassan Foroosh. Motion retrieval using low-rank decomposition of fundamental ratios. In Proc. IEEE International Conference on Image Processing (ICIP), pages 1905–1908, 2012.
  • [15] Nazim Ashraf, Chuan Sun, and Hassan Foroosh. Motion retrival using low-rank decomposition of fundamental ratios. In Image Processing (ICIP), 2012 19th IEEE International Conference on, pages 1905–1908, 2012.
  • [16] Nazim Ashraf, Chuan Sun, and Hassan Foroosh. View-invariant action recognition using projective depth. Journal of Computer Vision and Image Understanding (CVIU), 123:41–52, 2014.
  • [17] Nazim Ashraf, Chuan Sun, and Hassan Foroosh. View invariant action recognition using projective depth. Computer Vision and Image Understanding, 123:41–52, 2014.
  • [18] Vildan Atalay and Hassan Foroosh. In-band sub-pixel registration of wavelet-encoded images from sparse coefficients. Signal, Image and Video Processing, 2017.
  • [19] Vildan A. Aydin and Hassan Foroosh. Motion compensation using critically sampled dwt subbands for low-bitrate video coding. In Proc. IEEE International Conference on Image Processing (ICIP), 2017.
  • [20] Murat Balci, Mais Alnasser, and Hassan Foroosh. Alignment of maxillofacial ct scans to stone-cast models using 3d symmetry for backscattering artifact reduction. In Proceedings of Medical Image Understanding and Analysis Conference, 2006.
  • [21] Murat Balci, Mais Alnasser, and Hassan Foroosh. Image-based simulation of gaseous material. In Proc. of IEEE International Conference on Image Processing (ICIP), pages 489–492, 2006.
  • [22] Murat Balci, Mais Alnasser, and Hassan Foroosh. Subpixel alignment of mri data under cartesian and log-polar sampling. In Proc. of IAPR Int. Conf. Pattern Recognition, volume 3, pages 607–610, 2006.
  • [23] Murat Balci and Hassan Foroosh. Estimating sub-pixel shifts directly from phase difference. In Proc. of IEEE International Conference on Image Processing (ICIP), pages 1057–1060, 2005.
  • [24] Murat Balci and Hassan Foroosh. Estimating sub-pixel shifts directly from the phase difference. In Proc. of IEEE Int. Conf. Image Processing (ICIP), volume 1, pages I–1057, 2005.
  • [25] Murat Balci and Hassan Foroosh. Inferring motion from the rank constraint of the phase matrix. In Proc. IEEE Conf. on Acoustics, Speech, and Signal Processing, volume 2, pages ii–925, 2005.
  • [26] Murat Balci and Hassan Foroosh. Metrology in uncalibrated images given one vanishing point. In Proc. of IEEE International Conference on Image Processing (ICIP), pages 361–364, 2005.
  • [27] Murat Balci and Hassan Foroosh. Real-time 3d fire simulation using a spring-mass model. In Proc. of Int. Multi-Media Modelling Conference, pages 8–pp, 2006.
  • [28] Murat Balci and Hassan Foroosh. Sub-pixel estimation of shifts directly in the fourier domain. IEEE Trans. on Image Processing, 15(7):1965–1972, 2006.
  • [29] Murat Balci and Hassan Foroosh. Sub-pixel registration directly from phase difference. Journal of Applied Signal Processing-special issue on Super-resolution Imaging, 2006:1–11, 2006.
  • [30] Adeel A Bhutta, Imran N Junejo, and Hassan Foroosh. Selective subtraction when the scene cannot be learned. In Proc. of IEEE International Conference on Image Processing (ICIP), pages 3273–3276, 2011.
  • [31] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
  • [32] Hakan Boyraz, Syed Zain Masood, Baoyuan Liu, Marshall Tappen, and Hassan Foroosh. Action recognition by weakly-supervised discriminative region localization.
  • [33] Ozan Cakmakci, Gregory E. Fasshauer, Hassan Foroosh, Kevin P. Thompson, and Jannick P. Rolland. Meshfree approximation methods for free-form surface representation in optical design with applications to head-worn displays. In Proc. SPIE Conf. on Novel Optical Systems Design and Optimization XI, volume 7061, 2008.
  • [34] Ozan Cakmakci, Brendan Moore, Hassan Foroosh, and Jannick Rolland. Optimal local shape description for rotationally non-symmetric optical surface design and analysis. Optics Express, 16(3):1583–1589, 2008.
  • [35] Ozan Cakmakci, Sophie Vo, Hassan Foroosh, and Jannick Rolland. Application of radial basis functions to shape description in a dual-element off-axis magnifier. Optics Letters, 33(11):1237–1239, 2008.
  • [36] X Cao and H Foroosh. Metrology from vertical objects. In Proceedings of the British Machine Conference (BMVC), pages 74–1.
  • [37] Xiaochun Cao and Hassan Foroosh. Camera calibration without metric information using 1d objects. In Proc. International Conf. on Image Processing (ICIP), volume 2, pages 1349–1352, 2004.
  • [38] Xiaochun Cao and Hassan Foroosh. Camera calibration without metric information using an isosceles trapezoid. In Proc. International Conference on Pattern Recognition (ICPR), volume 1, pages 104–107, 2004.
  • [39] Xiaochun Cao and Hassan Foroosh. Simple calibration without metric information using an isoceles trapezoid. In Proc. of IAPR Int. Conf. Pattern Recognition (ICPR), volume 1, pages 104–107, 2004.
  • [40] Xiaochun Cao and Hassan Foroosh. Camera calibration using symmetric objects. IEEE Transactions on Image Processing, 15(11):3614–3619, 2006.
  • [41] Xiaochun Cao and Hassan Foroosh. Synthesizing reflections of inserted objects. In Proc. IAPR Int. Conference on Pattern Recognition, volume 2, pages 1225–1228, 2006.
  • [42] Xiaochun Cao and Hassan Foroosh. Camera calibration and light source orientation from solar shadows. Journal of Computer Vision & Image Understanding (CVIU), 105:60–72, 2007.
  • [43] Xiaochun Cao, Yuping Shen, Mubarak Shah, and Hassan Foroosh. Single view compositing with shadows. The Visual Computer, 21(8-10):639–648, 2005.
  • [44] Xiaochun Cao, Lin Wu, Jiangjian Xiao, Hassan Foroosh, Jigui Zhu, and Xiaohong Li. Video synchronization and its application on object transfer. Image and Vision Computing (IVC), 28(1):92–100, 2009.
  • [45] Xiaochun Cao, Jiangjian Xiao, and Hassan Foroosh. Camera motion quantification and alignment. In Proc. International Conference on Pattern Recognition (ICPR), volume 2, pages 13–16, 2006.
  • [46] Xiaochun Cao, Jiangjian Xiao, and Hassan Foroosh. Self-calibration using constant camera motion. In Proc. of IAPR Int. Conf. Pattern Recognition (ICPR), volume 1, pages 595–598, 2006.
  • [47] Xiaochun Cao, Jiangjian Xiao, Hassan Foroosh, and Mubarak Shah. Self-calibration from turn table sequence in presence of zoom and focus. Computer Vision and Image Understanding (CVIU), 102(3):227–237, 2006.
  • [48] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: image segmentation using expectation-maximization and its application to image querying. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(8):1026 – 1038, aug 2002.
  • [49] Kristian L Damkjer and Hassan Foroosh. Mesh-free sparse representation of multidimensional LIDAR data. In Proc. of International Conference on Image Processing (ICIP), pages 4682–4686, 2014.
  • [50] P. Duygulu, K. Barnard, J. De Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. Computer Vision—ECCV 2002, pages 349–354, 2002.
  • [51] Farshideh Einsele and Hassan Foroosh. Recognition of grocery products in images captured by cellular phones. In Proc. International Conference on Computer Vision and Image Processing (ICCVIP), 2015.
  • [52] SL Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–1002. IEEE, 2004.
  • [53] Yansong Feng and Mirella Lapata. Topic models for image annotation and text illustration. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 831–839, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
  • [54] Yansong Feng and Mirella Lapata. Visual information in semantic representation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 91–99. Association for Computational Linguistics, 2010.
  • [55] H Foroosh. Adaptive estimation of motion using generalized cross validation. In 3rd International (IEEE) Workshop on Statistical and Computational Theories of Vision, 2003.
  • [56] Hassan Foroosh. A closed-form solution for optical flow by imposing temporal constraints. In Proc. of IEEE International Conf. on Image Processing (ICIP), volume 3, pages 656–659, 2001.
  • [57] Hassan Foroosh. An adaptive scheme for estimating motion. In Proc. of IEEE International Conf. on Image Processing (ICIP), volume 3, pages 1831–1834, 2004.
  • [58] Hassan Foroosh. Pixelwise adaptive dense optical flow assuming non-stationary statistics. IEEE Trans. on Image Processing, 14(2):222–230, 2005.
  • [59] Hassan Foroosh and Murat Balci. Sub-pixel registration and estimation of local shifts directly in the fourier domain. In Proc. International Conference on Image Processing (ICIP), volume 3, pages 1915–1918, 2004.
  • [60] Hassan Foroosh and Murat Balci. Subpixel registration and estimation of local shifts directly in the fourier domain. In Proc. of IEEE International Conference on Image Processing (ICIP), volume 3, pages 1915–1918, 2004.
  • [61] Hassan Foroosh, Murat Balci, and Xiaochun Cao. Self-calibrated reconstruction of partially viewed symmetric objects. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages ii–869, 2005.
  • [62] Hassan Foroosh and W Scott Hoge. Motion information in the phase domain. In Video registration, pages 36–71. Springer, 2003.
  • [63] Hassan Foroosh, Josiane Zerubia, and Marc Berthod. Extension of phase correlation to subpixel registration. IEEE Trans. on Image Processing, 11(3):188–200, 2002.
  • [64] Tao Fu and Hassan Foroosh. Expression morphing from distant viewpoints. In Proc. of IEEE International Conference on Image Processing (ICIP), volume 5, pages 3519–3522, 2004.
  • [65] M.P. Gangan and R. Karthi. Automatic image annotation by classification using mpeg-7 features.
  • [66] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pages 119–126, New York, NY, USA, 2003. ACM.
  • [67] R. Jin, J.Y. Chai, and L. Si. Effective automatic image annotation via a coherent language model and active learning. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 892–899. ACM, 2004.
  • [68] Yohan Jin, Latifur Khan, Lei Wang, and Mamoun Awad. Image annotations by combining multiple evidence & wordnet. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 706–715. ACM, 2005.
  • [69] I Junejo, A Bhutta, and Hassan Foroosh. Dynamic scene modeling for object detection using single-class svm. In Proc. of IEEE International Conference on Image Processing (ICIP), volume 1, pages 1541–1544, 2010.
  • [70] Imran Junejo, Xiaochun Cao, and Hassan Foroosh. Configuring mixed reality environment. In Proc. of IEEE International Conference on Advanced Video and Signal-based Surveillance, pages 884–887, 2006.
  • [71] Imran Junejo, Xiaochun Cao, and Hassan Foroosh. Geometry of a non-overlapping multi-camera network. In Proc. of IEEE International Conference on Advanced Video and Signal-based Surveillance, pages 43–48, 2006.
  • [72] Imran Junejo, Xiaochun Cao, and Hassan Foroosh. Autoconfiguration of a dynamic non-overlapping camera network. IEEE Trans. Systems, Man, and Cybernetics, 37(4):803–816, 2007.
  • [73] Imran Junejo and Hassan Foroosh. Dissecting the image of the absolute conic. In Proc. of IEEE Int. Conf. on Video and Signal Based Surveillance, pages 77–77, 2006.
  • [74] Imran Junejo and Hassan Foroosh. Robust auto-calibration from pedestrians. In Proc. IEEE International Conference on Video and Signal Based Surveillance, pages 92–92, 2006.
  • [75] Imran Junejo and Hassan Foroosh. Euclidean path modeling from ground and aerial views. In Proc. International Conference on Computer Vision (ICCV), pages 1–7, 2007.
  • [76] Imran Junejo and Hassan Foroosh. Trajectory rectification and path modeling for surveillance. In Proc. International Conference on Computer Vision (ICCV), pages 1–7, 2007.
  • [77] Imran Junejo and Hassan Foroosh. Using calibrated camera for euclidean path modeling. In Proceedings of IEEE International Conference on Image Processing (ICIP), pages 205–208, 2007.
  • [78] Imran Junejo and Hassan Foroosh. Euclidean path modeling for video surveillance. Image and Vision Computing (IVC), 26(4):512–528, 2008.
  • [79] Imran Junejo and Hassan Foroosh. Camera calibration and geo-location estimation from two shadow trajectories. Computer Vision and Image Understanding (CVIU), 114:915–927, 2010.
  • [80] Imran Junejo and Hassan Foroosh. Gps coordinates estimation and camera calibration from solar shadows. Computer Vision and Image Understanding (CVIU), 114(9):991–1003, 2010.
  • [81] Imran Junejo and Hassan Foroosh. Optimizing ptz camera calibration from two images. Machine Vision and Applications (MVA), pages 1–15, 2011.
  • [82] Imran N Junejo, Nazim Ashraf, Yuping Shen, and Hassan Foroosh. Robust auto-calibration using fundamental matrices induced by pedestrians. In Proc. International Conf. on Image Processing (ICIP), volume 3, pages III–201, 2007.
  • [83] Imran N. Junejo, Adeel Bhutta, and Hassan Foroosh. Single-class svm for dynamic scene modeling. Signal Image and Video Processing, 7(1):45–52, 2013.
  • [84] Imran N Junejo, Xiaochun Cao, and Hassan Foroosh. Calibrating freely moving cameras. In Proc. International Conference on Pattern Recognition (ICPR), volume 4, pages 880–883, 2006.
  • [85] Imran N. Junejo and Hassan Foroosh. Trajectory rectification and path modeling for video surveillance. In Proc. International Conference on Computer Vision (ICCV), pages 1–7, 2007.
  • [86] Imran N. Junejo and Hassan Foroosh. Estimating geo-temporal location of stationary cameras using shadow trajectories. In Proc. European Conference on Computer Vision (ECCV), 2008.
  • [87] Imran N. Junejo and Hassan Foroosh. Gps coordinate estimation from calibrated cameras. In Proc. International Conference on Pattern Recognition (ICPR), 2008.
  • [88] Imran N Junejo and Hassan Foroosh. Gps coordinate estimation from calibrated cameras. In Proc. International Conference on Pattern Recognition (ICPR), pages 1–4, 2008.
  • [89] Imran N. Junejo and Hassan Foroosh. Practical ptz camera calibration using givens rotations. In Proc. IEEE International Conference on Image Processing (ICIP), 2008.
  • [90] Imran N. Junejo and Hassan Foroosh. Practical pure pan and pure tilt camera calibration. In Proc. International Conference on Pattern Recognition (ICPR), 2008.
  • [91] Imran N. Junejo and Hassan Foroosh. Refining ptz camera calibration. In Proc. International Conference on Pattern Recognition (ICPR), 2008.
  • [92] Imran N. Junejo and Hassan Foroosh. Using solar shadow trajectories for camera calibration. In Proc. IEEE International Conference on Image Processing (ICIP), 2008.
  • [93] V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. NIPS, 2003.
  • [94] H. Liu and P. Singh. Conceptnet-a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
  • [95] Sina Lotfian and Hassan Foroosh. View-invariant object recognition using homography constraints. In Proc. IEEE International Conference on Image Processing (ICIP), 2017.
  • [96] Fei Lu, Xiaochun Cao, Yuping Shen, and Hassan Foroosh. Camera calibration from two shadow trajectories. In Proc. of IEEE International Conference on Advanced Video and Signal-based Surveillance, volume 2.
  • [97] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. Computer Vision–ECCV 2008, pages 316–329, 2008.
  • [98] Brian Milikan, Aritra Dutta, Qiyu Sun, and Hassan Foroosh. Compressed infrared target detection using stochastically trained least squares. IEEE Transactions on Aerospace and Electronics Systems, page accepted, 2017.
  • [99] Brian Millikan, Aritra Dutta, Nazanin Rahnavard, Qiyu Sun, and Hassan Foroosh. Initialized iterative reweighted least squares for automatic target recognition. In Military Communications Conference, MILCOM, IEEE, pages 506–510, 2015.
  • [100] Brian A. Millikan, Aritra Dutta, Nazanin Rahnavard, Qiyu Sun, and Hassan Foroosh. Initialized iterative reweighted least squares for automatic target recognition. In Proc. of MILICOM, 2015.
  • [101] Brendan Moore, Marshall Tappen, and Hassan Foroosh. Learning face appearance under different lighting conditions. In Proc. IEEE Int. Conf. on Biometrics: Theory, Applications and Systems, pages 1–8, 2008.
  • [102] Dustin Morley and Hassan Foroosh. Improving ransac-based segmentation through cnn encapsulation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [103] H Shekarforoush and R Chellappa. A multifractal formalism for stabilization and activity detection in flir sequences. In Proceedings, ARL Federated Laboratory 4th Annual Symposium, pages 305–309, 2000.
  • [104] Hassan Shekarforoush, Marc Berthod, and Josiane Zerubia. Subpixel image registration by estimating the polyphase decomposition of the cross power spectrum. PhD thesis, INRIA-Technical Report, 1995.
  • [105] Hassan Shekarforoush, Marc Berthod, and Josiane Zerubia. Subpixel image registration by estimating the polyphase decomposition of cross power spectrum. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 532–537, 1996.
  • [106] Hassan Shekarforoush and Rama Chellappa. A multi-fractal formalism for stabilization, object detection and tracking in flir sequences. In Proc. of International Conference on Image Processing (ICIP), volume 3, pages 78–81, 2000.
  • [107] Yuping Shen, Nazim Ashraf, and Hassan Foroosh. Action recognition based on homography constraints. In Proc. of IAPR Int. Conf. Pattern Recognition (ICPR), pages 1–4, 2008.
  • [108] Yuping Shen and Hassan Foroosh. View-invariant action recognition using fundamental ratios. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–6, 2008.
  • [109] Yuping Shen and Hassan Foroosh. View invariant action recognition using fundamental ratios. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [110] Yuping Shen and Hassan Foroosh. View-invariant recognition of body pose from space-time templates. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pages 1–6, 2008.
  • [111] Yuping Shen and Hassan Foroosh. View invariant recognition of body pose from space-time templates. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [112] Yuping Shen and Hassan Foroosh. View-invariant action recognition from point triplets. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 31(10):1898–1905, 2009.
  • [113] Yuping Shen, Fei Lu, Xiaochun Cao, and Hassan Foroosh. Video completion for perspective camera under constrained motion. In Proc. of IAPR Int. Conf. Pattern Recognition (ICPR), volume 3, pages 63–66, 2006.
  • [114] Chen Shu, Luming Liang, Wenzhang Liang, and Hassan Forooshh. 3d pose tracking with multitemplate warping and sift correspondences. IEEE Trans. on Circuits and Systems for Video Technology, 26(11):2043–2055, 2016.
  • [115] M. Srikanth, J. Varner, M. Bowden, and D. Moldovan. Exploiting ontologies for automatic image annotation. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 552–558. ACM, 2005.
  • [116] Chuan Sun and Hassan Foroosh. Should we discard sparse or incomplete videos? In Proceedings of IEEE International Conference on Image Processing (ICIP), pages 2502–2506, 2014.
  • [117] Chuan Sun, Imran Junejo, and Hassan Foroosh. Action recognition using rank-1 approximation of joint self-similarity volume. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 1007–1012, 2011.
  • [118] Chuan Sun, Imran Junejo, and Hassan Foroosh. Motion retrieval using low-rank subspace decomposition of motion volume. In Computer Graphics Forum, volume 30, pages 1953–1962. Wiley, 2011.
  • [119] Chuan Sun, Imran Junejo, and Hassan Foroosh. Motion sequence volume based retrieval for 3d captured data. Computer Graphics Forum, 30(7):1953–1962, 2012.
  • [120] Chuan Sun, Imran Junejo, Marshall Tappen, and Hassan Foroosh. Exploring sparseness and self-similarity for action recognition. IEEE Transactions on Image Processing, 24(8):2488–2501, 2015.
  • [121] Chuan Sun, Marshall Tappen, and Hassan Foroosh. Feature-independent action spotting without human localization, segmentation or frame-wise tracking. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2689–2696, 2014.
  • [122] Amara Tariq and Hassan Foroosh. Scene-based automatic image annotation. In Proc. of IEEE International Conference on Image Processing (ICIP), pages 3047–3051, 2014.
  • [123] Amara Tariq and Hassan Foroosh. Feature-independent context estimation for automatic image annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1958–1965, 2015.
  • [124] Amara Tariq, Asim Karim, and Hassan Foroosh. A context-driven extractive framework for generating realistic image descriptions. IEEE Transactions on Image Processing, 26(2):619–632, 2002.
  • [125] Amara Tariq, Asim Karim, and Hassan Foroosh. Nelasso: Building named entity relationship networks using sparse structured learning. IEEE Trans. on on Pattern Analysis and Machine Intelligence, 2017.
  • [126] Amara Tariq, Asim Karim, Fernando Gomez, and Hassan Foroosh. Exploiting topical perceptions over multi-lingual text for hashtag suggestion on twitter. In The Twenty-Sixth International FLAIRS Conference, 2013.
  • [127] Changhu Wang, Feng Jing, Lei Zhang, and Hong-Jiang Zhang. Content-based image annotation refinement. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
  • [128] Jiangjian Xiao, Xiaochun Cao, and Hassan Foroosh. 3d object transfer between non-overlapping videos. In Proc. of IEEE Virtual Reality Conference, pages 127–134, 2006.
  • [129] Jiangjian Xiao, Xiaochun Cao, and Hassan Foroosh. A new framework for video cut and paste. In Proc. of Int. Conf. on Multi-Media Modelling Conference Proceedings, pages 8–pp, 2006.
  • [130] Changqing Zhang, Xiaochun Cao, and Hassan Foroosh. Constrained multi-view video face clustering. IEEE Transactions on Image Processing, 24(11):4381–4393, 2015.
  • [131] Dengsheng Zhang, Md Monirul Islam, and Guojun Lu. A review on automatic image annotation techniques. Pattern Recognition, 45(1):346–362, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
5513
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description