Utilizing Complex-valued Network for Learning to Compare Image Patches
At present, the great achievements of convolutional neural network(CNN) in feature and metric learning have attracted many researchers. However, the vast majority of deep network architectures have been used to represent based on real values. The research of complex-valued networks is seldom concerned due to the absence of effective models and suitable distance of complex-valued vector.
Motived by recent works, complex vectors have been shown to have a richer representational capacity and efficient complex blocks have been reported, we propose a new approach for learning image descriptors with complex numbers to compare image patches. We also propose a new architecture to learn image similarity function directly based on complex-valued network. We show that our models can significantly outperform the state-of-the art on benchmark datasets. We make the source code of our models publicly available.
One of the fundamental tasks in the field of computer vision is image matching, which plays an important role in some large tasks, such as image retrieval , robot navigation  and texture classification . The general idea explored here is to learn a embedding vector for each image by using deep neural network, especially, the deep convolutional neural network . Traditionally, siamese architecture are widely adopted [9, 32, 13, 6, 30, 2]. As shown in Figure 1, the positive distance (D1) between the descriptor of similar images is reduced, while the negative distances (D2, D3) between descriptor of different images are enlarged, by optimizing the weights of the siamese network. In this way, different images can be dispersed to the greatest extent in the embedding space. And some other tasks can be also accomplished by the descriptor, including image clustering, face verification and semantic similarity.
With the reported of complex blocks , the new solutions for the problem of the initialization, the batch normalization  and the activation function of complex network are available. Deep learning based on complex-valued are increasingly focused by researchers.
According to [24, 12, 22, 14, 26], there is a potential in complex network to enable easier learning, better generalization characteristics and to allow for preservation of detached detail. In the encoding network based on real-valued, details of the image may be lost due to improper length of descriptor. But in the complex-valued network, the phase value of descriptor vector can help the decoder recover it. Recent works have shown that, phase value provides a detailed description of objects because it contains the information of shapes, edges,and orientations of images .
In order to make full use of the advantage of complex-valued feature representation in the complex field, we propose the distance metric of the complex vector, improve the triple loss for the complex numbers, and apply it to the learning of image matching and descriptor. Our main contributions in this paper are as follows:
1.We propose the complex triple network, which is used to extract complex descriptors.
2.We utilize 2-ch net to process the real part and the imaginary part of complex feature.
3.We present a new metric formula of the complex-valued embedding distance, which is referenced to the absolute distance.
4.We achieve a state of the art result in the task of image patches comparing on the Photo-Tour Dataset.
2 Related Work
In the past, the primary approach to patches comparing was using hand-crafted descriptors and comparing the squared euclidean distance, such as SIFT . The original method of using CNN to extract features had many problems. But with the development of neural network in recent years, new architecture have been emerging, such as AlexNet , VggNet  and ResNet . At the same time, The method of data processing in deep learning are gradually improved, such as dropout  and batch normalization . Utilizing CNN for feature extraction has become the major approach of image patches matching.
As a branch of the development of convolutional network, siamese network is a common method for descriptor learning and patches comparison. In 1993, LeCun and Y et al. utilized siamese networks for signature verification on American checks, that is, to verify whether the signature on checks is consistent with the bank’s reserved signature . This is believed to be the first appearance of siamese networks. Similarly, siamese networks were utilized for face recognition . The descriptors of face images can be used to match new samples of previously unknown categories, by comparing the euclidean distances between image descriptors. However, because of the difficulty of descriptor learning in training, in the paper , the author proposed 2-ch net. Instead of comparing the distance by descriptors, it learns the image similarity function to match image patches through the neural network
In the paper [13, 23], the triplet network is proposed to change the twice branches structure into three branches. Its advantage lies in the distinction of details, that is, if input Â images are similar, triple net can be used to describe the details better. The idea of triple net equivalent to adding two measures of input differences and learning a better representation of the inputs. Triplet net is also utilized for face recognition due to its advantage in descriptor learning . In further work, the author of  improved the loss function of triple net, named PNSoft Loss.
Similar to siamese networks, complex networks also started in the 1990s. The theory and error propagation mechanism of complex networks has been explored by researchers [10, 20, 18, 17]. In the complex number field, the characteristics of data representation are more easily to abstracted. In some tasks, the complex-valued network has achieved significant advantages [14, 26, 8]. Although the research has shown that the development potential and expandable space of complex networks, the development of complex networks has been marginalized because the training tricks of complex-valued networks need to be improved. In addition, complex networks require longer training time than the network based on real-value numbers. Recently, c. Trabelsi et al. proposed a solution for initialization, activation, and batch normalization in complex networks , which greatly simplified the training difficulty of complex-valued network.
Based on the above work, we propose a new approach of utilizing complex network for learning to compare image patches. We present two architectures for comparing image patchesï¼
1.Complex Channels Net:learning to compare image patches directly
2.Complex Triple Net:learning to compare image patches via descriptors
For the representation problem of complex-valued vector, we propose two solutions. The first of them is to output the index of image similarity via learning a similarity function, called complex channel net. The Second is to learn the complex-valued descriptor of images, called complex triple net. The training and test approaches are different in these two ways. In training step, the input of complex channel net is a pair of image patches each time, and complex triple net is a triple of image patches. In the test step, complex channel net outputs the index of similarity Â via a pair of images directly, and the complex triple net gets the descriptor for each image, and then compares the distance between them to get the result. We refer to the absolute value distance of the real number field and propose the distance of complex-valued vector. We introduced these in section 3.4 and 3.5 respectively.
In addition, the three modules that constitute the two structures named complex feature moudle, complex decision moudle and complex metric moudle, respectively. In section 3.1, we introduce complex feature moudle, which consist of three complex-valued blocks. The two networks we proposed all utilized it as the underlying network. In section 3.2, we propose the decision moudle. We utilize it to combine realÂ andÂ imaginary Â parts of complex-valued vector. In section 3.3, we introduce the metric moudle and the formula for measuring the distances of different complex-valued vectors is described in detail.
3.1 Complex Feature Moudle
In the paper , He et al. proposed a deep residual network, which allows the network to increase the number of layers and avoid degradation by means of identity mappings. It has been widely used in various network frameworks. The traditional residual block is shown on the Figure 2(a). The output of this block can be expressed as:
Refer to the residual blocks of the real number to provide ideas for building a complex-valued residual blocks. Complex-valued networks have proven to be competitive, but the initialization, the complex BN algorithm, and the activation of complex-valued neural networks have hindered its development. We utilized the blocks proposed in  and made some improvements. As shown in Figure 2(b), the structure of complex-valued blocks is the same as that of real Numbers, except that the operations of BN, Relu and Conv in them are based on complex number. The specific approache is as follows:
Where represent the operations present within a single real-valued residual block
and represents the parameters of the network to be learned.
Referring to the structure of the real-valued residual block, it can provide ideas for the construction of the complex number residual block. Complex-valued network have proven to be competitive, we utilized the blocks proposed in  and made some improvements. As shown in Figure 2(b), the structure of complex-valued blocks is the same as the structure based on real number, but the operation of BN, Relu and Conv in them are based on complex number. The specific approaches are as follows.
a. Complex BN Layer
In the real field, the formula for the BN algorithm is:
Where and . is the result of the normalization of the current sample in the entire batch , represented as . In , the author used the covariance matrix V to calculate it, and the formula is as follows:
Where and represent the real and imaginary parts of the eigenvector , respectively. Different from , in order to simplify the calculation procedure, we use the BN algorithm for the real part and the imaginary part respectively. The expression is:
The improved complex BN algorithm also achieved competitive performance.
b. CRelu Layer
The comparison of different complex activation functions has been given in . The complex activation method we give in the CRelu layer is the optimal way in that paper. The output of complex-BN layer is separated into the real part and imaginary part. They are activated respectively by the Relu function, which can be expressed as:
c. Complex Conv Layer
In the complex conv layer, we give a complex convolution kernel , where and are both complex-valued matrices. The complex-valued vector where and are both real vector. We can give the complex convolution operation as:
We construct complex feature moudle by stacking complex-block. The input and output of this moudle are complex-valued. We constructed complex feature moudle using three complex-blocks, which is the basis of complex channel net and complex triple net.
3.2 Decision Moudle
Decision Moudle at the top of complex channel net. In complex-valued feature moudle, each set of images will get the corresponding complex feature vectors. We believe that these complex-valued vectors can express the similarity features of the two images. The decision moudle relies on this complex vectors to determine whether the input images are similar.
As shown in Figure 3, decision moudle separated the complex vectors into real and imaginary parts as the input of siamese network. According to the two forms of siamese network, we present the following two schemes:
Siamese net: As shown in Figure 3(a), the part of the dashed box is a siamese network. Real and Imag means the real part and the imaginary part of the feature vector. They are sent to the network respectively. After passing through the three fully connected layers where the weight is shared, the top output is connected and sent to a single output fully connected layer with sigmoid activation function.
Pesudo-siamese net: Create two networks that do not share variables, as shown in Figure 3(b). Real and Imag are sent into the two networks in the dotted box respectively. Similar to the structure of siamese net, except the weights of networks are not shared. Their output is connected and fed into the top level single output fully connected layer with sigmoid activation function.
In this paper, we use the siamese net. Unlike metric learning, decision moudle can learn a similarity function to output the index of similarity end to end. The advantage of this method is easier to train. But when classifying unknown samples, we need to compare all known samples by exhaustive method, which is very expensive.
3.3 Metric Moudle
Metric moudle at the top of complex triple net, and its main role is to learn descriptors from complex-valued eigenvectors, extracted by complex feature moudle.
In the training step, the input of complex triple net are a triple of images. Complex feature moudle was used to extract the complex-valued vector. The functions of metric moudle includes dimensionality reduction, l2-normalization and loss optimization.
In the test step, metric moudle converts the complex feature vector extracted by complex feature moudle into complex descriptor. The image similarity can be determined, by comparing the distance between descriptors.
As shown in Fig 3(a), during the training step, enter three images into the complex feature moudle, two of which are similar and the other is different. After the complex feature moudle, three complex vectors are obtained, which are respectively P1, P2 and N. Where P1 and P2 are the complex-valued eigenvectors of two similar images, and N is the complex-valued eigenvector of the third image.
There are two layers in metric moudle, including a complex fully connected layer and a complex l2-normalization layer. PNSoft loss has been used for descriptor learning and has achieved the state-of-the art results . We use this loss function and changed the euclidean distance in it to the distance of the complex-valued vector. Finally, image descriptors are learned by minimizing the improved PNSoft Loss.
The author gives the following formula for the calculation of PNSoft Loss in :
Where and represents the output of and in metric moudle, namely the descriptor we learned. represents the euclidean distance between the descriptor and . We will get three different distances, including a pairs of negative , and a positive . is the minimum of and .
This formula means that, among these three distances, the smallest negative distance should be greater than the positive distance. When the positive distance is 0, the first term of the formula is 0, and when the negative distance is infinity, the second term of the formula is 0. Through this formula, the positive example distance can be reduced and the negative example distance can be increased.
However, euclidean distance cannot describe the distance between complex vectors. Therefore, we modified the distance function D in PNSoft Loss, and proposed the following formula to measure the distance between vectors in complex fields.
We also modified l2-normalization layer. When dealing with complex-valued features, the complex vector is separated into the real part and the imaginary part and treated separately. Our complex l2-normalized function is as follows:
In the test step, as shown in Figure 4(b), the images to be detected (T1, T2) are input into the bottom complex feature moudle, and the metric moudle will get the image descriptors according to the complex eigenvectors . By comparing the distance between different image descriptors, the similarity of the image can be determined.
3.4 Complex Channel Net
The complex channel net consists of complex feature moudle and decision moudle. The main function of it is to learn a similarity function and directly output whether the input images are similar.
As shown in Figure 5(a), its input is a pair of images connected together, also known as 2-channel image . After the first convolutional layer, the 2-channel image is converted into multiple Â channels feature maps. We convert it to a complex-valued form by the Fast Fourier Transform (FFT) and send it to the complex feature moudle.
The complex feature moudle in complex channel net consists of three complex blocks, consistent with the previous description, the output of it is divided into real parts and imaginary parts as input to the decision moudle. We use siamese network in decision moudle to process the real parts and imaginary parts, respectively, and connect their output as input to the full-connection layer with 1 output at the top of the network, activated with the sigmoid function.
In complex feature moudle, we add the max-pooling layer after each complex-block. Instead of pooling according to modulus length and amplitude, we separate the real part from the imaginary part. It can express as:
3.5 Complex Triple Net
The complex triple net consists of complex feature moudle and metric moudle. The main function of it is to learn the descriptor of the image. As shown in Figure 5(b) ,in the training step, the inputs of complex triple net are a triple of images, two of which are similar and the other is different from the other two. They input from different branches, and the weight of the first convolutional layer and complex feature moudle of the three branches are shared with each other. In this model, the descriptor can be obtained by sending the output of complex feature moudle to the metric moudle to reduce the dimension and use (equation 10). Obviously, this is a complex-valued descriptor. According to the distance between the complex-valued descriptors obtained by using equation (7), we can increase the negative distance and reduce the positive distance by minimizing the loss function which is described equation (8).
In the test step, each image is directly converted to a descriptor. We can analyze the similarity of images by comparing the distances between descriptors,as shown in Figure 5(c)
The structure of convolution layer and complex feature moudle used by the two networks are the same, and complex triple net also uses the complex pooling. During the training step, complex channel net has a faster training speed than complex triple net. However, it cannot give a representation of the image in the embedded space. In the test step, it is necessary to compare the test images with each group of images in the training set. Complex triple net can embed images into complex spaces. When performing image classification, we can quickly get results through a simple fully connected network, KNN, SVM and other classification algorithms.
|FC/Complex FC||4096, relu||128,Cl2-norm|
In this section, we provide the experimental details and effect evaluation of image matching. In section 4.1, we introduce the data set used in this paper. Training details of complex channel net and complex triple net are introduced respectively in sections 4.2 and 4.3. The results of the experiment can be found in section 4.4.
Photo-Tour dataset was referenced in the paper , which has been widely used in image matching and descriptor learning. The dataset consists of three subsets totaling more than 1.5 million patches. Each subset contains multiple sets of similar images. It is used by many researchers as a standard benchmark dataset to test the effects of metric learning [32, 30, 2, 4]. The three subsets of it, named Notredame, Liberty and Yosemite. For each subset, the author gives 100k, 200k and 500k matching pairs, and the labels of them. The proportion of positive and negative samples is 50% each, which is helpful for training. We used 500K pairs given by the author of the dataset as the training set, and the100K patch pairs of the other subsets as the test to verify the effect of our model.
4.2 Complex Channel Net
Complex channel Net, as described in section 3.4, uses an end-to-end training method to directly learn a similarity function with input as two pictures and output as a probability value. Complex channel net is trained with Adam optimization function. We set the learning rate to 1e-3, and each batch has a size of 128. Mean square error(MSE) as a loss function of the network.
After 20 epochs of iteration, we achieved better results than previous descriptor learning without data augment.
Although our model results in a few percentage points more error than 2-ch Net, as shown in table 1, it’s worth noting that we don’t use any ways to augment data.
4.3 Complex Metric Net
Complex metric net, as described in section 3.5, determines the similarity of the image based on the descriptor distance. Although it is slower to train than complex channel net, it is able to give descriptors of different images and measure distances of different images.
In Complex metric net, we use the improved SoftPN Loss as the loss function. The image descriptor are learnned in the complex space, by optimizing the distance among the triples descriptors. Similar to the complex channel net, we set the learning rate to 1e-3, and each batch has a size of 128, using Adam optimization function. Complex metric net completed the training around 20 epochs. The training data were obtained on the basis of image pairs of 500k given by the author.
The test results of our proposed architecture on the Photo Tour are shown in table 1. The structure of Complex channel net (CCN) and Complex triple net (CTN) is described in detail in table 2. We report the false positive rate at 95% recall (FPR95) on each of the six combinations of training and test sets, as well as the mean across all combinations. Complex triple net gets start-of-the art results.The ROC curve of our test is shown in Figure 6. The results shows that complex triple net can learn more about the description of image features, especially details, based on the measurement in complex space.
You must include your signed IEEE copyright release form when you submit your finished paper. We MUST have this form before your paper can be published in the proceedings.
Please direct any questions to the production editor in charge of these proceedings at the IEEE Computer Society Press: Phone (714) 821-8380, or Fax (714) 761-1784.
-  A. K. A. Schmidt, M. Kraft. An evaluation of image feature detectors and descriptors for robot navigation, 2010. Proc. Int. Conf. Computer Vision and Graphics, pp. 251-259.
-  J. E. T. L. M. K. Balntas, V. Pn-net: conjoined triple deep network for learning local image descriptors, 2016. In: arXiv Preprint.
-  B. J. W. B. L. G. I. L. Y. M. C. S. E. Bromley, J. and R. Shah. Signature verification using a siamese time delay neural network., 1993. International Journal of Pattern Recognition and Artificial Intelligence, 7(4):669â688.
-  G. H. M. Brown and S. Winder. Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), pp.43-57, 2011.
-  V. N. Cai, Z. Cascade r-cnn: Delving into high quality object detection. arXiv preprint arXiv:1712.00726 (2017).
-  H. R. Chopra, S. and Y. LeCun. Learning a similarity metric iscriminatively, with application to face verification. In CVPR, pp. 539â546, Washington, DC, USA, 2005. IEEE Computer Society.
-  C. T. et al. Deep complex networks, 2017. ArXiv e-prints (May 2017). arXiv: 1705.09792.
-  C. T. et al. Deep complex networks, 2017. ArXiv e-prints (May 2017). arXiv: 1705.09792.
-  D. K. F. Schroff and J. Philbin. Facenet: A unified embedding for face recognition and clustering., 2015. arXiv preprint arXiv:1503.03832.
-  G. M. Georgiou and C. Koutsougeras. Complex domain backpropagation. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing,39(5):330-334,1992.
-  Z. X. R. S. S. J. He, K. Deep residual learning for image recognition. In: CVPR (2016).
-  A. Hirose and S. Yoshida. Generalization characteristics of complex-valued feedforward neural networks in relation to signal coherence., 2012. IEEE Transactions on Neural Networks and learning systems, 23(4):541â551.
-  E. Hoffer and N. Ailon. Deep metric learning using triplet network. CoRR, /abs/1412.6622, 2015.
-  B. U. N. K. Ivo Danihelka, Greg Wayne and A. Graves. Associative long short-term memory, 2016. arXiv preprint arXiv:1602.03032.
-  S. H. e. a. J. Wan, D. Wang. Deep learning for content-based image retrieval: a comprehensive study, 2014. Proceedings of the Multimedia.
-  A. V. K. Simonyan and A. Zisserman. Learning local feature descriptors using convex optimisation, 2014. PAMIx.
-  T. Kim and T. Adali. Approximation by fully complex multilayer perceptrons. Neural computation, 15(7):1641-1666x,2003.
-  T. Kim and T. Adali. Fully complex multi-layer perceptron network for nonlinear signal processing, 2002. Journal of VLSI signal processing systems for signal,image and video technology, 32(1-2):29-43.
-  S. I. Krizhevsky, A. and G. E. Hinton. magenet classification with deep convolutional neural networks. 2012. In NIPS, pp. 1106â1114.
-  H. Leung and S. Haykin. he complex backpropagation algorithm. IEEE Transactionson Signal Processing, 39(9):2101-2104,1991.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91â110, 2004.
-  A. S. Martin Arjovsky and Y. Bengio. Unitary evolution recurrent neural networks.
-  D. J. F. Mohammad Norouzi and R. Salakhutdinov. Hamming distance metric lea. In Advances in Neural Information Processing Systems (NIPS) 25, pages 1070â 1078, 2012a.
-  T. Nitta. On the critical points of the complex-valued neural network. Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP ’02.
-  F. P. O. Tuzel and P. Meer. An evaluation of image feature detectors and descriptors for robot navigation, 2006. ECCV.
-  J. H. J. L. R. Scott Wisdom, Thomas Powers and L. Atlas. Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems, pages 4880â4888,2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. 2015. ICLR.
-  H. G. K. A. S. I. Srivastava, Nitish and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929â1958, January 2014.
-  R. G. K. H. T.-Y. Lin, P. Goyal and P. Dollar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
-  Y. J. R. S. X. Han, T. Leung and A. C. Berg. Matchnet: Unifying feature and metric learning for patchbased matching, 2015. In CVPR.
-  X. W. X. Yu, T. Liu and D. Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7370â7379, 2017.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks.