Hyperdimensional computing as a framework for systematic aggregation of image descriptors

Hyperdimensional computing as a framework for systematic aggregation of image descriptors

Abstract

Image and video descriptors are an omnipresent tool in computer vision and its application fields like mobile robotics. Many hand-crafted and in particular learned image descriptors are numerical vectors with a potentially (very) large number of dimensions. Practical considerations like memory consumption or time for comparisons call for the creation of compact representations. In this paper, we use hyperdimensional computing (HDC) as an approach to systematically combine information from a set of vectors in a single vector of the same dimensionality. HDC is a known technique to perform symbolic processing with distributed representation in numerical vectors with thousands of dimensions. We present a HDC implementation that is suitable for processing the output of existing and future (deep-learning based) image descriptors. We discuss how this can be used as a framework to process descriptors together with additional knowledge by simple and fast vector operations. A concrete outcome is a novel HDC-based approach to aggregate a set of local image descriptors together with their image positions in a single holistic descriptor. The comparison to available holistic descriptors and aggregation methods on a series of standard mobile robotics place recognition experiments shows a 20% improvement in average performance compared to runner-up and 3.6x better worst-case performance.

1 Introduction

Image descriptors are very useful tools for recognition tasks in computer vision. Many hand crafted and in particular deep learning based descriptors are numerical vectors with a potentially large number of dimensions, e.g. NetVLAD [1] uses 4,096-D vectors (after PCA), DELF [42] uses 1,024-D vectors (before PCA). Approaches like BoW [54], VLAD [21], or ASMK [58] aggregate the information from multiple vectors in a single holistic vector representation to reduce memory consumption and computational efforts during comparison. For example, deciding whether two images show the same place based on a set of local landmarks from each image, can then be done by a single distance measure between the two aggregated vectors. Although these techniques are able to combine large numbers of descriptors in a compact vector, for certain tasks, like place recognition, it is beneficial to encode additional information in the final vector representation, e.g., information about the image locations of aggregated vectors.

The central idea of this paper is to use binding and bundling of vectors as a flexible framework to combine image descriptors and additional information. The underlying technique of binding and bundling vectors is taken from a field known as hyperdimensional computing (HDC) or vector symbolic architectures (VSA). This is an established class of approaches to solve symbolic computational problems using mathematical operations on large numerical vectors with thousands of dimensions [23, 40]. The bundling operator superposes information of a variable number of vectors in a single vector; we can think of it as some form of averaging. The binding operator can, for example, express role-filler or variable-value pairs as required in symbolic processing. An important property is that the output of the operations are vectors from the same vector space. This allows to chain HDC operations and enables versatile encoding of structured data from a set of d-dimensional vectors in a single d-dimensional vector.

We will present a HDC implementation that allows the processing of existing and future (deep learning based) image descriptors in Sec. 3. This section will also describe how HDC can be used as a framework to aggregate holistic or local image descriptors and to combine them with additional information. A concrete outcome is a novel approach to create a holistic image descriptor from a set of local descriptors with position information in Sec. 3.2.2. For example, we can create a holistic descriptor from three local descriptors with poses as simple as . The poses serve as “roles“ that are associated with landmarks as “fillers“. When comparing two such holistic descriptors (e.g. by using simple cosine similarity), the similarity of the roles decides to what extend the similarities of the associated local descriptors are incorporated in the overall similarity. Prerequisite are appropriate preprocessing of the descriptors as well as a suitable encoding of image positions in the same vector space as the descriptors, both will be presented in Sec. 3.1. The experiments in Sec. 4 will evaluate properties in a series of mobile robotics place recognition experiments.

2 Related Work

2.1 Descriptors for place recognition

Visual place recognition [34] is a basic problem in mobile robotics, e.g., for loop closure detection in SLAM or candidate selection for visual localization [49]. In contrast to 6-DOF pose estimation that often uses local features (e.g. keypoints [33, 10, 42, 11]), place recognition typically builds upon holistic image descriptors that compute a single descriptor vector for a whole image [1, 60, 55, 38, 41]. Important reasons are the memory consumption and the required time for exhaustively comparing local features.

The existing and steadily increasing (e.g., [47, 35, 53]) variety of local feature extractors can also be the basis for holistic image descriptors. Existing approaches include BoW [54, 8], Fisher vectors [44], and VLAD [21, 2]. Aggregated selective match kernels [58] aim at unifying aggregation-based techniques with matching-based approaches like Hamming Embedding [19]. VLAD in combination with soft-assignment is fully differentiable and seamlessly integrates in deep learning approaches, e.g. NetVLAD [1]. Other deep learning variants of local feature aggregation for image matching include sum-pooling [59], max-pooling [18], and mean-pooling [6]. The latter also outputs global and local descriptors.

Besides their descriptors, the location of local features can provide important information, e.g. for geometric verification [48]. Regarding holistic descriptors, BoW can integrate spatial information via voting [52]. Pyramid match kernels [16] can evaluate matchings at multiple resolutions. Based on this, spatial pyramid matching [31] can approximate global geometric correspondence between sets of local features. Multi-VLAD [2] extends this idea to VLAD, Pyramid-Enhanced NetVLAD [63] extends it to deep learning. The typical usage of flattened AlexNet-conv3 [29] descriptors (or similar) as in [55][51] is an implicit encoding of local landmarks (one landmark per feature map vector) together with their image location (encoded by the position in the concatenated output vector). Similar encodings can be applied to other local features.

To reduce memory consumption and runtime for comparisons, descriptors are often combined with dimensionality reduction approaches like PCA [42] or Gaussian random projections [55], or compression techniques like product quantization [20]. Approximate nearest neighbor and inverted indexes play an important role for large-scale image matching [39, 32, 54].

2.2 Hyperdimensional Computing

Hyperdimensional computing (HDC) is also known as Vector Symbolic Architectures (VSA) or computing with large random vectors. It is an established class of approaches to solve symbolic computational problems using mathematical operations on large numerical vectors with thousands of dimensions [23, 45, 14, 50]. Using embeddings in high-dimensional vector spaces to deal with ambiguities is well established in natural language processing [5]. HDC makes use of additional operations on high-dimensional vectors. So far, HDC has been applied in various fields including addressing catastrophic forgetting in deep neural networks [7], medical diagnosis [62], robotics [40], fault detection [26], analogy mapping [46], reinforcement learning [25], long-short term memory [9], text classification [27], and synthesis of finite state automata [43]. They have been used in combination with deep-learned descriptors before, e.g. for sequence encoding [40]. Another related HDC work are spatial semantic pointers [28], a variant of the Semantic Pointer Architecture [12], that processes vector encodings of symbols with positions in images using a complex vector space and fractional binding.

3 Algorithmic Approach

We will first describe the basic elements of the proposed HDC framework in Sec. 3.1 followed by examples, how these elements can be used to approach image retrieval tasks with different types of available information in Sec. 3.2.

3.1 HDC architecture

In simple words, we use elementwise addition and elementwise multiplication of 4,096 dimensional real vectors of small magnitude (typical vector entries are in range ) to systematically encode information. The vectors are either systematic encodings of systematic information (e.g., images or distances) or random vectors for discrete symbols (e.g., a descriptor type). The way how we process vectors is borrowed from the HDC literature. We will use the HDC terminology of binding and bundling operators for the multiplication and addition operators. The main reason is that there are other HDC implementations available that implement binding and bundling differently, and which could be used to replace our particular HDC architecture for the later presented aggregation approaches in Sec. 3.2. Our implementation is similar to the Multiply-Add-Permute (MAP) architecture by Gaylor [13, 14]. However, we change the vector space for combination with image descriptors which has some ramifications for the operators as well.

Figure 1: Distributions of distances for descriptors of the same place (true matchings) or of different places (false matchings).

Vector space and random vectors for symbols

For the vector space we use real-valued d-dimensional vectors with in the order of thousands (we will use in our experiments). However, based on the mechanisms how vectors are created and processed, most (if not all) vector entries will be in the range . For measuring the similarity of vectors, we use cosine similarity (normalized dot-product). Vectors can be created by three mechanisms: (1) Systematic encoding of a mathematical entity, a sensor measurement, or similar. Image encodings are topic of Sec. 3.1.4, position encodings of Sec. 3.1.5. (2) The VSA operations binding or bundling combine multiple vectors of space to a new vector from the same space (3) Random vector are used to encode symbolic entities, e.g. to represent elements of a finite (enumerable) set of classes. We create random vectors by sampling each dimension independently from the two-elemental set with equal probability for each of the two values. The reason for this very special initialization are the properties of the binding operator.

Binding

Binding is the first of two implemented HDC operations. In the general context of HDC [23], this operation is used to bind fillers to roles, e.g. to assign a particular value to a variable. More specifically, a vector that represents the value is bound to another vector that represents the variable. (Vector representations for variables are also an example for the above mentioned symbolic entities that are encoded with random vectors.) The result of binding is a new vector with two important properties:

(1) It is not similar to the two input vectors but allows to (approximately) recover any of the input vectors given the output vector and the other input vector. In the general HDC context, recovering is done by an unbinding operator, for those HDC implementations where vectors are also (approximately) self inverse, unbinding and binding are the same operation (e.g. [22, 13]). Self-inverse means: where is the neutral element of binding in the space .

(2) Binding is similarity preserving. The distance of the output of binding a vector to two different vectors depends on the similarity of these vectors:

We use elementwise multiplication for binding two vectors. In a vector space , binding by element-wise multiplication is exactly self-inverse () and the resulting vector has a large distance to each of the input vectors (since the sign is switched for each entry in the other vector, which is expected to happen for about 50% of the dimensions). When using the vector space instead of , both properties only hold approximately [50]. However, this modification is required to implement the bundling operator.

Bundling

The purpose of the bundling operation is to combine (“superpose”) two input vectors such that the output vector is similar to both inputs. In almost all HDC implementations, the bundling operator is some kind of elementwise sum or averaging of the vector elements.

We use the elementwise sum. When adding two vectors , the vector can be seen as noise that disturbs the similarity of and . In high-dimensional vectors spaces, random vectors are very likely almost orthogonal (a property called quasi-orthogonality) [40]. This has two important effects: (1) If and are quasi-orthogonal and of similar magnitude, the influence of adding the noise to on the direction of is limited, in particular, does not point anywhere close to the opposite direction of . This limits the influence of on the angular similarity between and . (2) If the expected angle between random (unrelated) vectors is close to 90, than any considerably smaller angular distance indicates a relation between the compared vectors. Thus the noise can reduce the similarity but the remaining similarity will remain considerably about chance and and can be considered similar.

Bundling several image descriptors is the simplest way to use HDC for feature aggregation (cf. Sec. 3.2.1). However, for more sophisticated encoding of information, we will combine the bundling and binding operators. According to Kanerva [24] bundling and binding should “form an algebraic field or approximate a field”. In particular, our bundling and binding are associative and commutative and binding distributes over bundling.

Preprocessing of image descriptors

We will use different image descriptors (e.g. NetVLAD [1] or DELF [42]) in our experiments. Each outputs a high-dimensional vector. We use Gaussian random projection to control the number of dimensions and to distribute information across dimensions. We use L2 normalization to standardize the descriptor magnitudes, followed by mean-centering.

Very much in line with our requirements, typical image descriptors aim to encode multiple images of the same scene with similar vectors and those of different scenes with different vectors. A typical distribution can be seen in Fig. 1 (left). The middle part of this figure also illustrates the effect of mean-centering the descriptors (i.e. subtracting the mean descriptor of the database and query sets). This is known to considerably improve place recognition results (i.e. it improves the separation of the blue and the red distributions) [51]. However, it also improves the quasi-orthogonality property for descriptors of different places. We include mean-centering for both reasons.

Figure 2: Illustration of the pose encoding from Sec. 3.1.5. (left) Layout of basis vectors and combination of two basis vectors and to create encoding of the horizontal image position of the white marked location. (right) The Hamming distance of this vector to all other image location encodings (however, in the HDC framework, angular distance will be used).

Encoding of positions

The goal is to reflect the spatial distance of different landmarks with positions in the image by the angular distance of the pose vector encodings . There are several alternatives for the creation of such encodings, this section will propose a simple yet flexible approach.

We create vectors and to encode and independently and compute the final pose encoding by . This paragraph and Fig. 2 explain the creation of . is computed accordingly. To encode from range for an image of width , we equally divide this range in subintervals and associate each border of subintervals with one of random basis vectors (including the beginning of the first and the end of the last subinterval). The encoding of is then computed by concatenating parts of the basis vectors and of the subinterval in which is located. With Matlab-style syntax this is:

(1)

is the splitting index based on the distances , of to the two subinterval borders and the dimensionality :

(2)

This approach is flexible since parameters and can be used to weight the spatial distances in each direction for a particular application. In our place recognition experiments, we will use and which allows larger horizontal displacements of landmarks than vertical displacements. An example resulting distance map can be seen in Fig. 2. The normalized dot product of this encoding approximates the rectilinear distance (1-norm) of the encoded image locations. The distortions in Fig. 2 result from interference between X and Y encodings. It is important to note that although we divide the image in a grid, this approach is able to evaluate similarities across the grid borders.

3.2 HDC for feature aggregation

Unordered aggregation of multiple descriptors

Based on the effect of bundling on similarities explained in Sec. 3.1.3, this operator can be used to combine the information of multiple image descriptors in a single vector . This can, e.g., be used to aggregate multiple holistic image descriptors , (e.g., NetVLAD, AlexNet-conv3, and DenseVLAD) for a single image. Fig. 1 shows a typical distribution of similarities when comparing different descriptors. Not surprisingly, computing the distance between a NetVLAD descriptor and an AlexNet-conv3 descriptor is not useful since they behave to each other very much like random vectors. However, as described in Sec. 3.1.3, this allows to safely bundle these different descriptors in a single vector (after the preprocessing from Sec. 3.1.4):

(3)

Basically, this is a simple averaging of descriptors. However, due to the quasi-orthogonality of different descriptor classes, evaluating the angular distance of such bundled vectors of two images by normalized dot product will approximate the average distance of the bundled descriptor classes if they were evaluated individually.

We want to emphasize, that we do not claim a performance benefit of this approach compared to, e.g., concatenating dimensionality-reduced input descriptors (the experiments will show roughly equal performance). It is rather intended to illustrate the type of computation in the proposed HDC framework and its flexibility. For example, if we want to be able to recover an approximate version of each input descriptor from the resulting combined descriptor, we can bind each input descriptor (before bundling) to a fixed random vector that represents the descriptor type: . A particular descriptor of type can then be approximately recovered by unbinding: .

Systematic encoding of local feature descriptors and positions

This section describes an approach to compute a holistic descriptor that encodes a variably sized set of local descriptors together with their poses , . and are from the same vector space (i.e. they have the same number of dimensions) and the angular distance of two holistic vectors and of images A and B will approximate the distance from an exhaustive pairwise comparison of the local features and their poses.

To generate a holistic descriptor , all local feature descriptors are first preprocessed according to Sec. 3.1.4. The mean-centering is done using all descriptors from the current image. Each pose is encoded in a vector as described in Sec. 3.1.5. The holistic descriptor of local features with poses is then computed by:

(4)

To use this for image matching applications like place recognition, we compute a holistic descriptor for each image from all local features. The number and spatial arrangement of local features can vary between images. Holistic descriptors are then compared by normalized dot product.

The result is an approximation of the exhaustive pairwise comparison of all feature pairs using their descriptor and spatial distances. We want to give an intuition to this approximation using the example of comparing an image with descriptor and an image with descriptor . From an HDC perspective, comparing and multiplies out and creates individual comparisons of the form vs. and the overall similarity is the accumulated similarity of all such comparisons. We can assume that each pose vector is quasi-orthogonal to each descriptor vector. Based on this and the two properties of binding from Sec. 3.1.2, the two vectors in each term can only be similar if the descriptors are similar () and the poses are similar (). The normalized dot product of the holistic descriptors from Eq. 4 is only an approximation of the exhaustive comparison since, in reality, multiplying-out is prevented by the required normalization of the vectors. The later experiments in Fig. 5 will show the effect on local feature similarity. For more information, please refer to the HDC literature, e.g. [23, 50].

Extensions and framework character

The concept of bundling and binding to aggregate information can be easily extended to other information than position of local features. For example, we can integrate information about local feature scale or orientation by binding to appropriate encodings (e.g., similarly created as X in eq. 1). This also applies to sequences of images. Exploiting the similarity of temporally neighbored images can significantly improve place recognition performance in mobile robotics [38, 17, 41]. SeqSLAM [38] is a simple yet powerful approach that accumulates similarities of short sequence of image comparisons. This requires the computation of all similarities individually. An appropriate bundling of image descriptors (each bound to its position within the sequence) can achieve very similar results with a single vector comparison [40, 50]. We want to emphasize that our choices of , bundling, and binding are only one possible HDC implementation, there are several others available [50] that can potentially also be used in the presented approaches.

4 Experimental Results

4.1 Experimental setup

We will evaluate the HDC approach on standard place recognition datasets from mobile robotics. We use 23 sequence comparisons from six datasets with different characteristics regarding environment, appearance changes, single or multiple visits of places, possible stops, or viewpoint changes: Nordland1k [56], StLucia (Various Times of the Day) [15], CMU Visual Localization [3], GardensPointWalking1, OxfordRobotCar [37], and SFUMountain [4]. For OxfordRobotCar, we sampled sequences at 1Hz with the recently published accurate ground truth data [36]. For Nordland1k, we sampled 1k images of unique places from each season (without tunnels).

We decided to use DELF [42] for local features since it provides good results using standard exhaustive pairwise comparison (see Table 1, in our (not shown) experiments it performed better than, e.g., [11] and [57]). Moreover, with DELG, there is already a deep-learned holistic descriptor available for comparison that builds upon DELF.

We compare against the following descriptors. NetVLAD NV [1]: We use the authors’ VGG-16 version2 with whitening trained on the Pitts30k dataset (4,096-D). AlexNet AN [30]: We use the conv3 output of Matlab’s ImageNet model and the full 65k dimensional descriptor. DenseVLAD DV [60]: We use the authors’ version3 with 128-dimensional SIFT descriptors and 128 words trained on 24/7 Tokyo dataset, as well as PCA projection to 4,096-D. DELG [6]: We use the implementation from TensorFlow models with ResNet101 trained on a subset of the Google Landmarks Dataset v2 (GLDv2-clean) which was amongst best in [6].

Besides these holistic descriptors from the literature, we use two ways to exhaustively compute image similarities from local features: DELF [42]: We use the implementation from TensorFlow Hub4. For each image we extract the 200 1024-dimensional descriptors with highest score at scale 1. The descriptors are standardized per image [51]. Following [57] image similarity is computed from mutual matchings for exhaustive pairwise comparison of features with uniform position weighting :

(5)

DELF-pos: Same as before but we incorporate spatial distance of local features with weighting as follows

(6)

are image dimensions, are the same as for HDC.

Further, we compare against four existing methods that create a holistic descriptor from local features. We combine all with DELF: DELF-V is DELF with VLAD [21]: We use the same 200 1024-D DELF descriptors as before in a VLAD representation that is trained from all datasets that are solely used as database as well as additional night and winter images from Oxford and the Örebro Seasons dataset [61]. We use VLFeat for kmeans and VLAD implementations, vocabulary size , VLAD with hard assignment, L2-normalized components, as well as L2-normalization of the -D descriptors. DELF-V-PCA: Same as above but with PCA projection to 4096-D, PCA is trained on the vocabulary training data. DELF-MV: Following [2], we compute a MultiVLAD representation by concatenating 14 VLAD-PCA descriptors over different regions of the images as described in [2]. DELF-Grid: Computes a regular grid of DELF descriptors at scale 1. The descriptors are dimensionality reduced with PCA to 40-D (provided by DELF implementation) and concatenated to get a -D holistic descriptor.

For evaluation, we compute pairwise similarity matrices between database and query image sets and compare them to ground-truth knowledge about place matchings using a series of thresholds. We report average precision (AP) computed as area under the resulting precision-recall curve, as well as achieved recall using the best k-matchings.

4.2 Evaluation of unordered aggregation of multiple holistic descriptors

For a first impression of the potential of the simple HDC operators, the boxplots in Fig. 3 show a considerable improvement in median average precision as well as outlier statistics, when combining multiple different holistic descriptors as described in Sec- 3.2.1. Before bundling, all descriptors are projected to 4,096-D using Gaussian random projection. The boxplots include all datasets from Table 1 (details not shown). However, very similar results can be achieved by concatenating the descriptors (right part of Fig. 3).

Figure 3: Average precision statistics over a large number of datasets. A simple HDC bundling of multiple descriptors can considerably improve the performance. However, this is also possible by concatenating descriptors.
Dataset DB Query NV AN DV DELG DELF-V DELF-V-PCA DELF-MV DELF-Grid DELF-HDC DELF DELF-Pos
(ours) (exhaustive) (exhaustive)
GardensPointWalking dayright dayleft 0.99 0.46 0.98 0.95 0.93 0.95 0.94 0.56 0.82 0.94 0.95
dayright nightright 0.59 0.62 0.52 0.44 0.41 0.41 0.62 0.70 0.79 0.52 0.80
dayleft nightright 0.48 0.12 0.22 0.32 0.24 0.26 0.38 0.23 0.46 0.29 0.73
OxfordRobotCar 2014-12-09-13-21-02 2015-05-19-14-06-38 0.89 0.77 0.85 0.70 0.75 0.76 0.87 0.87 0.90 0.94 0.97
2014-12-09-13-21-02 2015-08-28-09-50-22 0.66 0.41 0.62 0.23 0.33 0.33 0.36 0.56 0.70 0.37 0.60
2014-12-09-13-21-02 2014-11-25-09-18-32 0.91 0.67 0.90 0.68 0.77 0.78 0.81 0.88 0.80 0.79 0.88
2014-12-09-13-21-02 2014-12-16-18-44-24 0.11 0.27 0.11 0.12 0.05 0.06 0.09 0.66 0.78 0.17 0.64
2015-05-19-14-06-38 2015-02-03-08-45-10 0.93 0.84 0.33 0.72 0.44 0.38 0.71 0.66 0.75 0.81 0.92
2015-08-28-09-50-22 2014-11-25-09-18-32 0.59 0.34 0.46 0.38 0.43 0.38 0.52 0.57 0.69 0.53 0.72
SFUMountain dry dusk 0.48 0.54 0.79 0.34 0.30 0.29 0.45 0.76 0.79 0.74 0.81
dry jan 0.22 0.40 0.63 0.10 0.11 0.10 0.24 0.64 0.55 0.61 0.82
dry wet 0.40 0.42 0.75 0.25 0.22 0.21 0.29 0.66 0.73 0.71 0.82
CMU 20110421 20100901 0.71 0.52 0.69 0.80 0.56 0.61 0.69 0.59 0.74 0.80 0.82
20110421 20100915 0.77 0.65 0.76 0.78 0.59 0.62 0.73 0.70 0.73 0.75 0.77
20110421 20101221 0.54 0.36 0.49 0.59 0.54 0.57 0.55 0.49 0.63 0.62 0.67
20110421 20110202 0.62 0.39 0.49 0.64 0.42 0.44 0.51 0.45 0.71 0.72 0.82
Nordland1k spring winter 0.02 0.25 0.06 0.07 0.03 0.03 0.05 0.08 0.74 0.54 0.86
spring summer 0.20 0.66 0.43 0.45 0.29 0.27 0.28 0.67 0.72 0.45 0.71
summer winter 0.05 0.57 0.05 0.16 0.06 0.05 0.04 0.21 0.46 0.17 0.44
summer fall 0.53 0.92 0.82 0.79 0.54 0.49 0.51 0.87 0.89 0.83 0.91
StLucia 1009090845 1808091545 0.08 0.46 0.28 0.15 0.10 0.10 0.25 0.43 0.44 0.29 0.45
1009091000 1908091410 0.19 0.57 0.46 0.22 0.37 0.39 0.61 0.55 0.63 0.48 0.64
1009091210 2108091210 0.61 0.66 0.84 0.63 0.76 0.77 0.81 0.62 0.69 0.65 0.68
worst case 0.02 0.12 0.05 0.07 0.03 0.03 0.04 0.08 0.44 0.17 0.44
best case 0.99 0.92 0.98 0.95 0.93 0.95 0.94 0.88 0.90 0.94 0.97
average case (mAP) 0.50 0.52 0.55 0.46 0.40 0.40 0.49 0.58 0.70 0.60 0.76
Table 1: Average precision of the proposed DELF-HDC and other methods at standard place recognition datasets from mobile robotics. The best result of all holistic descriptors per dataset is highlighted, DELF and DELF-Pos are exhaustive local comparisons.
Figure 4: Achieved recall using the best k matchings. Averaged over all datasets from Table 1. DELF and DELF-Pos are exhaustive local comparisons, all others are fast holistic descriptor.

4.3 Evaluation of systematic aggregation of local descriptors and positions

We implement the HDC approach to local feature aggregation from Sec. 3.2 using the same 200 highest-scored DELF descriptors as for the compared methods. We refer to our approach as DELF-HDC. We use dimensional vectors and set spatial weighting parameters and to allow more horizontal than vertical viewpoint deviation (in accordance with the mobile robot place recognition task). An evaluation of all three parameters follows in Sec. 4.3.3.

Place recognition performance

Table 1 shows the average precision for each of the compared methods on each dataset. The proposed HDC-DELF approach is the best holistic descriptor for 11 of 23 comparisons and provides the best average case (+20% to runner-up) and worst case performance (3.6x of runner-up) (this excludes DELF and DELF-Pos which are exhaustive local comparisons and much more time consuming, cf. Sec. 4.4). Fig. 4 evaluates the application for candidate selection for visual pose estimation [49]. Again, DELF-HDC is only outperformed by the (prohibitively expensive) exhaustive pairwise comparison of DELF-Pos features.

The importance of binding

Why does this simple HDC approach work? To provide some intuition, Fig. 5 shows the outcome of a simplified image matching experiment: We take the encoding of a single local feature in a database image and find the most similar feature in a truly matching query image. The similarity of their HDC encodings is shown at x-axis=1, for all other points on the curve, we bundle with an increasing number of other feature encodings from the query image, that act as noise on the similarity of the only true matching and . Fig. 5 shows average results on Nordland spring-summer for this experiment. As illustrated by the yellow line, after preprocessing of descriptors, the expected or average similarity (normalized dot-product) of random pairs of descriptors from an image pair is close to 0, they are quasi-orthogonal. A simple bundling of preprocessed descriptors (similar to the experiments in Sec. 4.2) is able to maintain a similarity considerably above chance for a few vectors (red). When we additionally use binding to control similarities (according to the properties from Sec. 3.1.3) by incorporating additional position information, we can bundle significantly more “noise” vectors and still maintain a considerable similarity (compared to random pairs) of and (blue). Similar effects can be expected when binding with scale or sequential information.

Figure 5: The similarity to an included single true matching in a bundle with an increasing number of distracting descriptors stays considerably above the average similarity of descriptors.

Properties and parameter evaluation

A basic assumption in HDC is a high-dimensional vector space. The left part of Fig. 6 evaluates the performance for a varying number of used dimensions. With 512 dimensions, the mean performance is equal or above the compared holistic algorithms. However, the lower the number of dimensions, the larger the variation on the individual datasets.

The right part of Fig. 6 evaluates the dependency on the number of features on the GardensPointWalking dataset. As expected, the performance of HDC-DELF increases with increasing number of features. The roughly consistent distance to the exhaustive comparison (DELF-Pos) indicates that the capacity of the HDC representation is not exceeded.

Figure 6: (left) Average precision statistics over a varying number of dimensions in the HDC representations. Blue is the mean of all individual (gray) dataset evaluations (all datasets). (right) Influence of a varying number of used features for the exhaustive and the HDC approach on GardensPointWalking, thin curves are individual comparisons, thick curves are means.

In the presented HDC-DELF approach, parameters and can be used to control the sensitivity to viewpoint changes. Fig. 7 shows the results for all combinations of values from range for the three sequence comparisons of the GardensPointWalking dataset. This dataset is particularly interesting for this evaluation since it provides several sequences from day and night of a hand-held camera on the same pathway, but either on the left side or the right side of the pathway which results in a considerable horizontal viewpoint change. For place recognition where only small viewpoint changes are expected, higher values (e.g. 7) of these parameters are preferable since they assign more different encodings to features at large spatial distance (cf. Fig. 2). To account for larger horizontal viewpoint changes, smaller values for can be use. We use the same parameters for all datasets in Table 1. However, tuning these parameters to a particular dataset is intuitive and can considerably improve results, e.g. with we can achieve AP=0.59 (+0.13) on the GardensPointWalking Day Left - Nigh Right comparison.

4.4 Computational effort

Using holistic descriptors allows to compare two images by a single vector comparison with normalized dot-product. This is significantly more efficient than exhaustive pairwise comparison of local features, e.g. in our setup with 200 local features, we would have to compute vector distances per image comparison. Even if the local vectors are smaller and we can use ANN [32] techniques, there remains a discrepancy.

With the presented HDC approach, computing a holistic HDC descriptor from landmarks requires vector sums and elementwise multiplications, one within the pose encoding and one for binding to the pose. The pose encoding also requires two times a concatenation of vectors. Additionally, the descriptor preprocessing requires L2 normalization and mean-centering as well as potentially a projection to the used vector space. The latter might be the most time-consuming step in the overall computation. In the HDC approach, binding and bundling operations can be computed by a single run over the vector. They operate locally on the vectors, i.e. only corresponding vector dimensions influence each other. This allows for massive parallelization. Since we work with distributed representations, an approximate similarity of holistic vectors can be easily computed by evaluating only a reduced number of dimensions (cf. Fig. 6). We did not yet evaluate the combination with techniques like product quantization [20] for very large scale image retrieval (the largest used sequences were of size 4k).

Figure 7: The heatmaps show average precision for different combinations of parameters and on GardensPointWalking. The graphs below show the mean column (for ) and row (for ) values of the heatmaps. To address viewpoint changes in the left-right combinations, a lower should be used.

5 Conclusions

We presented HDC as a simple to implement and flexible approach to combine descriptors and other information. The presented HDC architecture was used to implement the HDC-DELF approach that combines local descriptors with their poses in a single holistic descriptor. It can benefit from the advantages of local features like viewpoint and occlusion robustness, as well as feature selection (attention), without the drawbacks of exhaustive pairwise comparisons or fixed grid layouts. Parameters , can be used to adjust the weighting of the pose similarity computation (which works across grid borders). The HDC-DELF approach show improved performance on a series of standard place recognition datasets from mobile robotics on average and in worst case.

We consider HDC as a flexible framework since it allows to integrate further information like scale or orientation of local features, or to aggregate information across multiple images. It can be combined with different existing and future image descriptors. Further, the notation in HDC operations like bundling and binding allows to potentially improve our simple implementation using other HDC architectures (e.g., from [50]).

Footnotes

  1. https://goo.gl/tqmWyq
  2. https://github.com/Relja/netvlad
  3. http://www.ok.ctrl.titech.ac.jp/~torii/project/247/
  4. https://tfhub.dev/google/delf/1

References

  1. R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. Trans. on Pattern Analysis and Machine Intelligence, 40(6), 2018.
  2. R. Arandjelovic and A. Zisserman. All about vlad. In Conf. on Computer Vision and Pattern Recognition, 2013.
  3. H. Badino, D. Huber, and T. Kanade. Visual topometric localization. In Intelligent Vehicles Symposium (IV), 2011.
  4. Jake Bruce, Jens Wawerla, and Richard Vaughan. The SFU mountain dataset: Semi-structured woodland trails under changing environmental conditions. In Int. Conf. on Robotics and Automation 2015, Workshop on Visual Place Recognition in Changing Environments, 2015.
  5. José Camacho-Collados and Mohammad Taher Pilehvar. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res., 63:743–788, 2018.
  6. Bingyi Cao, André Araujo, and Jack Sim. Unifying deep local and global features for image search. In European Conference on Computer Vision (ECCV), 2020.
  7. Brian Cheung, Alexander Terekhov, Yubei Chen, Pulkit Agrawal, and Bruno A. Olshausen. Superposition of many models into one. In NeurIPS, 2019.
  8. Mark Cummins and Paul M. Newman. Appearance-only slam at large scale with fab-map 2.0. Int. J. Robotics Res., 30(9):1100–1123, 2011.
  9. Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative Long Short-Term Memory. In Int. Conf. on Machine Learning, 2016.
  10. Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In CVPR Workshops, 2018.
  11. M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler. D2-net: A trainable cnn for joint description and detection of local features. In Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
  12. Chris Eliasmith. How to build a brain: from function to implementation. Synthese, 159(3):373–388, 2007.
  13. Ross W. Gayler. Multiplicative binding, representation operators, and analogy. In Advances in analogy research: Integr. of theory and data from the cogn., comp., and neural sciences, Bulgaria, 1998.
  14. Ross W. Gayler. Vector Symbolic Architectures answer Jackendoff’s challenges for cognitive neuroscience. In Int. Conf. on Cognitive Science, 2003.
  15. A. J. Glover, W. P. Maddern, M. J. Milford, and G. F. Wyeth. Fab-map + ratslam: Appearance-based slam for multiple times of day. In Int. Conf. on Robotics and Automation (ICRA), 2010.
  16. Kristen Grauman and Trevor Darrell. The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8(26):725–760, 2007.
  17. P. Hansen and B. Browning. Visual place recognition using HMM sequence matching. In International Conference on Intelligent Robots and Systems (IROS), 2014.
  18. Syed Sameed Husain and Miroslaw Bober. Remap: Multi-layer entropy-guided pooling of dense cnn features for image retrieval. IEEE Trans. Image Process., 28(10):5201–5213, 2019.
  19. Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Improving bag-of-features for large scale image search. Int. J. Comput. Vis., 87(3):316–336, 2010.
  20. Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011.
  21. H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In Conference on Computer Vision and Pattern Recognition, 2010.
  22. Pentti Kanerva. Fully distributed representation. In Real World Computing Symposium, 1997.
  23. Pentti Kanerva. Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors. Cognitive Computation, 1(2):139–159, 2009.
  24. P. Kanerva. Computing with 10,000-bit words. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sept 2014.
  25. Denis Kleyko, Evgeny Osipov, Ross W. Gayler, Asad I. Khan, and Adrian G. Dyer. Imitation of honey bees’ concept learning processes using Vector Symbolic Architectures. Biologically Inspired Cognitive Architectures, 14:57 – 72, 2015.
  26. D. Kleyko, E. Osipov, N. Papakonstantinou, V. Vyatkin, and A. Mousavi. Fault detection in the hyperspace: Towards intelligent automation systems. In International Conference on Industrial Informatics (INDIN), 2015.
  27. Denis Kleyko, Abbas Rahimi, Dmitri A. Rachkovskij, Evgeny Osipov, and Jan M. Rabaey. Classification and Recall With Binary Hyperdimensional Computing: Tradeoffs in Choice of Density and Mapping Characteristics. IEEE Transactions on Neural Networks and Learning Systems, 29(12):5880–5898, 2018.
  28. Brent Komer, Terrence C. Stewart, Aaron Voelker, and Chris Eliasmith. A neural representation of continuous space using fractional binding. In CogSci, pages 2038–2043, 2019.
  29. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  30. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 2012.
  31. S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
  32. W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on high dimensional data — experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 32(8):1475–1488, 2020.
  33. David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.
  34. S. Lowry, N. Sünderhauf, P. Newman, John J. Leonard, David Cox, Peter Corke, and Michael J. Milford. Visual place recognition: A survey. Trans. Rob., 32(1), 2016.
  35. Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. Aslfeat: Learning local features of accurate shape and localization. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  36. Will Maddern, Geoffrey Pascoe, Matthew Gadd, Dan Barnes, Brian Yeomans, and Paul Newman. Real-time Kinematic Ground Truth for the Oxford RobotCar Dataset. arXiv preprint arXiv: 2002.10152, 2020.
  37. Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The oxford robotcar dataset. The Int. Journal of Robotics Research, 36(1):3–15, 2017.
  38. M. Milford and G. F. Wyeth. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In Int. Conf. on Robotics and Automation, 2012.
  39. Marius Muja and David G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In Int. Conf. on Computer Vision Theory and Applications, 2009.
  40. Peer Neubert, Stefan Schubert, and Peter Protzel. An introduction to hyperdimensional computing for robotics. Künstliche Intell., 33(4):319–330, 2019.
  41. Peer Neubert, Stefan Schubert, and Peter Protzel. A neurologically inspired sequence processing model for mobile robot place recognition. IEEE Robotics and Automation Letters, 4(4):3200–3207, 2019.
  42. H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han. Large-scale image retrieval with attentive deep local features. In Int. Conf. on Computer Vision (ICCV), 2017.
  43. E. Osipov, D. Kleyko, and A. Legalov. Associative synthesis of finite state automata model of a controlled object with hyperdimensional computing. In Conference of the IEEE Industrial Electronics Society (IECON), 2017.
  44. F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In Conference on Computer Vision and Pattern Recognition, 2007.
  45. Tony Alexander Plate. Distributed Representations and Nested Compositional Structure. PhD thesis, Toronto, Ont., Canada, Canada, 1994.
  46. Dmitri A. Rachkovskij and Serge V. Slipchenko. Similarity-based retrieval with structure-sensitive sparse binary distributed representations. Computational Intelligence, 28(1):106–129, 2012.
  47. Jérôme Revaud, César Roberto de Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Reliable and repeatable detector and descriptor. In NeurIPS, 2019.
  48. Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Improving image-based localization by active correspondence search. In European Conf. on Computer Vision (ECCV), 2012.
  49. Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  50. Kenny Schlegel, Peer Neubert, and Peter Protzel. A comparison of vector symbolic architectures, 2020.
  51. Stefan Schubert, Peer Neubert, and Peter Protzel. Unsupervised learning methods for visual place recognition in discretely and continuously changing environments. In Int. Conf. on Robotics and Automation (ICRA), 2020.
  52. X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu. Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In Conference on Computer Vision and Pattern Recognition, 2012.
  53. X. Shen, C. Wang, X. Li, Z. Yu, J. Li, C. Wen, M. Cheng, and Z. He. RF-Net: An end-to-end image matching network based on receptive field. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  54. Josef Sivic and Andrew Zisserman. Video google: Efficient visual search of videos. In Toward Category-Level Object Recognition, volume 4170 of Lecture Notes in Computer Science, pages 127–144. Springer, 2006.
  55. N. Sünderhauf, F. Dayoub, S. Shirazi, B. Upcroft, and M. Milford. On the Performance of ConvNet Features for Place Recognition. CoRR, abs/1501.04158, 2015.
  56. Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. Int. Conf. on Robotics and Automation (ICRA) Workshop on Long-Term Autonomy, 2013.
  57. Niko Sünderhauf, Sareh Shirazi, Adam Jacobson, Feras Dayoub, Edward Pepperell, Ben Upcroft, and Michael Milford. Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. In Robotics: Science and Systems (RSS), 2015.
  58. Giorgos Tolias, Yannis Avrithis, and Hervé Jégou. Image search with selective match kernels: Aggregation across single and multiple images. Int. J. Comput. Vis., 116(3):247–261, 2016.
  59. G. Tolias, T. Jenícek, and O. Chum. Learning and aggregating deep local descriptors for instance-level recognition. In European Conf. on Computer Vision, 2020.
  60. A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In Conf. on Computer Vision and Pattern Recognition, 2015.
  61. C. Valgren and A. J. Lilienthal. Sift, surf and seasons: Long-term outdoor localization using local features. In European Conference on Mobile Robots (ECMR), 2007.
  62. Dominic Widdows and Trevor Cohen. Reasoning with Vectors: A Continuous Model for Fast Robust Inference. Logic journal of the IGPL / Interest Group in Pure and Applied Logics, (2):141–173, 2015.
  63. J. Yu, C. Zhu, J. Zhang, Q. Huang, and D. Tao. Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE Transactions on Neural Networks and Learning Systems, 31(2):661–674, 2020.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
426420
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description