The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search
Abstract
With the everincreasing quantity and variety of data worldwide, the Web has become a rich repository of mathematical formulae. This necessitates the creation of robust and scalable systems for Mathematical Information Retrieval, where users search for mathematical information using individual formulae (querybyexpression) or a combination of keywords and formulae. Often, the pages that best satisfy users’ information needs contain expressions that only approximately match the query formulae. For users trying to locate or refind a specific expression, browse for similar formulae, or who are mathematical nonexperts, the similarity of formulae depends more on the relative positions of symbols than on deep mathematical semantics.
We propose the Maximum Subtree Similarity (MSS) metric for querybyexpression that produces intuitive rankings of formulae based on their appearance, as represented by the types and relative positions of symbols. Because it is too expensive to apply the metric against all formulae in large collections, we first retrieve expressions using an inverted index over tuples that encode relationships between pairs of symbols, ranking hits using the Dice coefficient. The top formulae are then reranked using MSS. Our approach obtains stateoftheart performance on the NTCIR11 Wikipedia formula retrieval benchmark and is efficient in terms of both index space and overall retrieval time. Retrieval systems for other graphical forms, including chemical diagrams, flowcharts, figures, and tables, may also benefit from adopting our approach.
The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search
Richard Zanibbi 
Kenny Davila 
Rochester Institute of Technology, USA 
{rxzvcs,kxd7282}@rit.edu 
Andrew Kane 
Frank Wm. Tompa 
University of Waterloo, Canada 
{arkane,fwtompa}@uwaterloo.ca 
\@float
copyrightbox[b]
\end@floatCategories and Subject Descriptors H.2.4 [Database Management]: Systems—Query Processing ; H.3.3 [Information Search and Retrieval]: Retrieval models; H.3.4 [Systems and Software]: Performance evaluation (efficiency and effectiveness)

mathematical information retrieval (MIR), inverted index, querybyexpression, subtree similarity
Mathematical Information Retrieval (MIR) is an important emerging area of Information Retrieval research [?, ?, ?, ?]. Technical documents often include a substantial amount of mathematics, but math is difficult to use directly in queries. For the most part, largescale search engines do not support formula search other than indirectly, e.g., through matching LaTeX strings. Formula queries allow documents with similar expressions or mathematical models to be discovered automatically, providing a new way to search and browse technical literature [?]. For mathematical nonexperts, querying based on the appearance of expressions may also be useful, for example when students try to interpret unfamiliar notation [?]. Many have had the experience of wishing they could search through technical documents for similar formulae rather than find words to describe them.
Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search shows the top of a results page from the new Tangent [?, ?] formula retrieval engine.^{1}^{1}1http://www.cs.rit.edu/~dprl/Software.html The 17 hits shown are grouped by their structure (exact match, variable substitution, operator substitution), and groups are ordered by the similarity of the contained formulae to the query. Efficient and effective retrieval becomes more difficult when the best matches are even less similar to the query formula (e.g., the repository includes larger expressions that include pieces similar to one or more parts of the query formula) or when wildcards that can match arbitrary symbols or subexpressions are included in the query [?].
For scalability, Tangent now employs a twolevel cascading search system [?] that provides both query runtime efficiency and ranking effectiveness for formula search. The first level is the core engine, which uses an uncompressed inverted index over tuples representing pairs of symbols in an expression. This level provides limited support for wildcard symbols and can quickly produce an ordered list of candidate results using a simple ranking algorithm. The second level reranks the top candidate results using Maximum Subtree Similarity (MSS), a new metric for computing the similarity of mathematical formulae based on their appearance. The system architecture is summarized in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search.
Contributions. This paper includes three primary contributions. Our first is the incorporation of substantially smaller indices than those used previously[?, ?] (Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search), which can obtain strong retrieval results in a scalable system. The second contribution is the MSS metric (Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search), which produces an intuitive ordering for retrieved formula based on the visual structure of expressions, taking unifiable symbol types into account. The third is a new symbol pair retrieval model (Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search) that incorporates the first two contributions in an efficient and effective twostage cascaded implementation, as demonstrated experimentally (Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search). In addition, we believe that the form of output adopted, namely grouping results by similarity and match structure, is an improvement over existing MIR interfaces.
Interest in Mathematical Information Retrieval (MIR) has been increasing in recent years, as witnessed by the NTCIR10 [?] and NTCIR11 [?] Math Retrieval Tasks held in 2013 and 2014, respectively.
Math representations are naturally hierarchical, and often represented by trees that may be encoded as text strings. As a result, approaches to querybyexpression may be categorized as treebased or textbased, as determined by the structures used to represent formulae. The encoded hierarchies commonly represent either the arrangement of symbols on writing lines (as in LaTeX or Presentation MathML) or the underlying mathematical semantics as nested applications of operations to arguments (as in OpenMath or Content MathML). Both appearance and semantic representations have been used for retrieval.
TextBased Approaches. In textbased approaches, math expression trees are linearized, and often normalized, before indexing and retrieval. Common normalizations include defining synonyms for symbols (e.g., function names), using canonical orderings for commutative operators and spatial relationships (e.g., to group
+b } with {\verb b+
and_i^2 } with {\verb
^2_i ), enumerating variables, and replacing symbols by their mathematical type (e.g., numbers, variables, and classes of operators) [?, ?].Although linearization masks significant amounts of structural information, it allows text and math retrieval to be carried out efficiently by a single search engine (commonly Lucene^{2}^{2}2https://lucene.apache.org/). As a result, most textbased formula retrieval methods use TFIDF (term frequencyinverse document frequency) retrieval after linearizing expressions [?, ?]. In an alternative approach, the largest common substring between the query formula and each indexed expression is used to retrieve LaTeX strings [?]. This captures more structural information, but also requires evaluating all expressions in the index using a quadratic algorithm.
TreeBased Approaches. Treebased formula retrieval approaches use explicit trees to represent expression appearance or semantics directly. These approaches index complete formula trees, often along with their subexpressions to support partial matching. Methods have been developed to compress tree indices by storing identical subtrees uniquely [?] and to match expressions using treeedit distances with early stopping for fast retrieval [?]. The substitution tree data structure, first designed for unification of predicates [?], has been used to create treestructured indices for formulae [?]. Descendants of an index tree node contain expressions that unify with the parameterized expression stored at that node.
A recent treebased technique adapts TFIDF retrieval for vectors of subexpressions and generalized subexpressions in which arguments are represented by untyped placeholders [?]. In this method a Symbol Layout Tree is modified to capture some semantic properties, normalizing the order of arguments for commutative operators and representing operator precedences explicitly.
‘Spectral’ TreeBased Approaches. An emerging subclass of the treebased approach uses paths or small subtrees rather than complete subtrees for retrieval. One system converts subexpressions in operator trees to words representing individual arguments and operatorargument triples [?]. A lattice over the sets of generated words is used to define similarity, and a breadthfirst search constructs a neighbor graph traversed during retrieval. Another system employs an inverted index over paths in operator trees from the root to each operator and operand, using exact matching of paths for retrieval [?]. The large number of possible unique paths combined with exact matching make this technique brittle.
Rather than indexing paths from the root of the tree, the Tangent math retrieval system stores relative positions of symbol pairs in Symbol Layout Trees to create a “bag of symbol pairs” representation [?, ?]. This symbol pair representation supports partial matches in a flexible way, while preserving enough structural information to return exact matches for queries. Set agreement metrics are applied to the bags of symbol pairs to compute formula similarities. For example, the harmonic mean for the percentage of matched pairs in the query and a candidate (i.e., Dice’s coefficient for set similarity^{3}^{3}3Given a query tree and a candidate tree , let and , respectively, denote a set of their features (such as a set of node and edge labels) and let denote the set of features they have in common. Dice’s coefficient of similarity () can then serve as the score for .) prefers large matches of the query with few additional symbols in the candidate. Tangent (starting with Version 2) also accommodates matrices, isolated symbols, and wildcard symbols and augments formula search with text search. Formula retrieval based on bags of symbol pairs combined with keyword retrieval using Lucene allowed Tangent to produce the highest Precision@5 result for the NTCIR11 Math2 main retrieval task with combined text and formula queries (92%) [?].
In this paper, we address needed improvements for Tangent, described in the next section.
The math retrieval task we address is to search a corpus to produce a ranked list of formulae (and the pages on which those formulae are located) that match a query formula expressed in LaTeX or Presentation MathML, with or without the inclusion of wildcard symbols. Formulae ranked highly should match the query formula exactly or, failing that, closely resemble it. The system should be scalable in terms of index size, indexing speed, and querying speed.
Scalability and Retrieval Effectiveness. As originally implemented, Tangent is not scalable: indexing time is less than 200 formulae per second, producing indices of over 1GB for the NTCIR11 Wikipedia corpus and 30 GB for the NTCIR11 arXiv corpus. Retrieval time is also slow, averaging 5 seconds per query for the Wikipedia task (under 400 thousand distinct formulae) and averaging 3 minutes per query for the NTCIR main task (3 million distinct formulae). Furthermore, while retrieval effectiveness is very good, there is substantial room for improvement.
Symbols and Containers. Tangent uses a Symbol Layout Tree (SLT) to represent the appearance of a mathematical formula. Tree nodes represent individual symbols and visually explicit aggregates, such as fractions, matrices, function arguments, and parenthesized expressions. In Tangent Version 3, all symbols except those representing operators or separators (e.g., commas) are prefixed with their type, represented by a single character followed by an exclamation point. More specifically, SLT nodes represent:

typed mathematical symbols: numbers (N!); variable names (V!); text fragments, such as lim, otherwise, and such that (T!)

fractions (F!)

container objects: radicals (R!); matrices, tabular structures, and parenthesized expressions (M!x)

explicitly specified whitespace (W!)

wildcard symbols (?)

mathematical operators
Because of their visual similarity, all tabular structures, including matrices, binomial coefficients, and piecewise defined functions are encoded using the matrix indicator M!. If a matrixlike structure is surrounded by fence characters, then those symbols are indicated after the exclamation mark. Finally, the indicator includes a pair of numbers separated by an , indicating the number of rows and the number of columns in the structure. For example, M!2x3 represents a 2x3 table with no surrounding delimiters and M!()1x5 represents a 1x5 table surrounded by parentheses. Importantly, all parenthesized subexpressions are treated as if they were 1x1 matrices surrounded by parentheses, and, in particular, the arguments for any ary function are represented as if they were a 1x matrix surrounded by parentheses.
As well as associating a label (e.g., V!x) with every SLT node, every node has an associated type (number, variable, operator, etc.). A node’s type is reflected in its label, usually represented by the part of the label up to an exclamation point (e.g., V!), but node labels preceded by a question mark (?) have type wildcard; a matrix node’s type includes the matrix dimensions, but not its fence characters (e.g., M!2x3); and other node labels without exclamation marks have type operator.
Spatial Relationships. Labeled edges in the SLT capture the spatial relationships between objects represented by the nodes:

next () references the adjacent object that appears to the right on the same line

within ( ) references the radicand of a root or to the first element appearing in rowmajor order in a structure represented by M!

element ( ) references the next element appearing in rowmajor order in a structure represented by M!

above ( ) references the leftmost object on a higher line (e.g., superscript, over symbol, fraction numerator, or index for a radical)

below ( ) references the leftmost object on a lower line (e.g., subscript, under symbol, fraction denominator)

preabove ( ) references the leftmost object of a prescripted superscript

prebelow ( ) references the leftmost object of a prescripted subscript
An SLT is rooted at the leftmost object on the main baseline (writing line) of the formula it represents. Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search shows an example of an SLT, where for simplicity, unlabeled edges represent the next relationship and types other than wildcard are not displayed.
(a) Query Formula and Symbol Layout Tree (SLT)
Sym1 Sym2 Path Count V! V!i 1 V! = 1 = N!2 1 N!2 ?x0 1 N!2 M!()2x1 1 M!()2x1 V!N 1 V!N V!i 1
V!N!0 1 V!i !0 2 M!()2x1 !0 1 ?x0 !0 1
V!N!2 1 = ?x0 1 ⋮ ⋮ ⋮ ⋮
V!V!i 1
(b) Tuples for SLT Symbol Relationships with Counts
(c) Example Search Hits. Green: exact matches, Orange: unified matches, Dashed Nodes: unmatched symbolsFigure \thefigure: Query Formula with Corresponding Symbol Layout Tree (SLT), Symbol Pair Tuples, and Sample Search Results. Creating SLTs. SLTs can be created straightforwardly from Presentational MathML by a recursive descent parser. For other input formats, we assume that converters such as LaTeXML^{4}^{4}4http://dlmf.nist.gov/LaTeXML/ exist to produce Presentational MathML.
In most circumstances, whitespace is not represented in an SLT. As a result, although unicode whitespace and related characters, such as “invisible times” (U+2062), occasionally appear as operators in Presentational MathML expressions, they are all ignored for the purpose of matching expressions in Tangent.
As described above, a node in a Symbol Layout Tree can have up to seven labeled outgoing edges (with no edge label repeating for any node). For a given Symbol Layout Tree, Tangent produces a set of tuples that each encodes the relationship between a pair of symbols occurring on some path from the root to a leaf. Given two nodes on such a path, we define the relative path between the nodes by the sequence of edge labels traversed from the ancestor node to the descendant.
As an optimization that saves both space and time, and following the practice of searching via ngrams [?], the new version of Tangent does not store all tuples defined, but only those for which the distance between symbols (measured by the number of edges separating them) is less than or equal to a specified window size .
In addition to normal tuples for an SLT, endofline (EOL) information can optionally be captured by introducing special tuples of the form . Such endofline tuples are likely to improve retrieval, particularly for individual symbols and small expressions.
Tuple information for the example expression in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search along with tuple counts is shown in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Searchb. The maximum path length between two symbols is five, yielding 23 distinct tuples (19 symbol pair tuples plus four EOL tuples with one repetition for ‘i’). However, if the window size is set to 2, then only 16 distinct symbol pairs are stored.
In order to improve both query runtime efficiency and ranking effectiveness, our search system uses a two level cascading approach [?]. After parsing a query formula, the first level searches the corpus and returns candidate results. The second level then reranks those candidates, and finally the results are displayed on an HTML page using grouping and color coding of match structures for improved clarity.
The first level of our search system, referred to as the core engine, uses an inverted index over tuples defined from symbol pairs of the Symbol Layout Trees of expressions. The core engine supports limited query functionality in order to produce a small list of candidate expression results quickly using the simple ranking metric defined by Dice’s coefficient over tuple matches. These candidate expressions are returned together with the lists of documents that contain each expression.
The second level of our system is a reranker that implements the full query functionality to identify tuple matches in expressions. The reranker scores the candidate expression results using the more accurate Maximum Subtree Similarity ranking metric, as defined in Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search. The reranker then combines several expression scores in the candidate expression results to produce a final document ranking.
The final query results can be ordered by either formula rank or document rank. When the results are presented in formula rank order, they are grouped by their maximum query subtree overlaps as shown in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search; when presented in document rank order they are grouped by document. In either case, expressions are displayed color coded to highlight their maximum subtree overlaps, as shown in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Searchc.
The core engine for our system quickly finds a small number of highly relevant candidate results for a math search query, which are later reranked. The engine returns these top formulae determined using a simple ranking algorithm, along with the list of documents containing each formula and the first position of that formula in the document.
Since runtime performance is a high priority, the core engine uses a customized inverted index data structure implemented in C++. In addition, the engine evaluates only a subset of the query language functionality to allow the use of a fast and simple ranking algorithm that can still find a good set of candidate results.
The input to the indexer is a set of document names and the extracted mathematical formulae found in each document: , and the input to the search component is a single query formula. Each formula is converted to a set of tuples (see Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search) that serve as words do in a normal search engine.
Index Structures: At index time, an inverted index is built over the given documentformulatuple relationships. The index includes postings lists PL1 that map each tuple to all formulae containing that tuple. A query containing only nonwildcard tuples can thus be implemented by combining the corresponding tuples’ postings lists using an OR operator. We store these postings lists as ordered lists of formula identifiers (integers), so that the lists can be easily combined using a merge algorithm. The engine uses a dictionary D1 to always assign the same formula to the same identifier, thus saving both space and time in the engine.
In order to return document information for query results, the engine stores postings lists PL2 mapping each formula identifier to the identifiers of the documents containing those formulae, along with their first position in the document. To improve compression, a dictionary D2 is used for document names and another dictionary D3 is used for tuples.
The core engine supports limited wildcard functionality. Query tuples containing a single wildcard as either the ancestor or descendant symbol are implemented as iterator expansions. The engine stores postings lists PL3 that map each single wildcard tuple to the set of tuple identifiers that match. Assigning tuple identifiers using a dictionary D4 again gives some compression benefits. Implementing even this restricted wildcard functionality can be expensive, since the iterator expansion could be quite large.^{5}^{5}5The engine does not try to enforce wildcard variable agreement between tuples (wildcard joins), and it ignores multiwildcard tuples. An initial implementation handling multiwildcard tuples and wildcard joins was found to be approximately a hundred times slower than the current engine for a small dataset.
In summary, the core engine uses two main data structures: dictionaries convert objects (such as strings or tuples) into a compact 0based range of internal identifiers (integers) and postings lists are lists of integer tuples ordered by the first integer in the tuple.

dictionaries

formula formID

document docID

tuple tupleID

wildcardtuple wildcardtupleID


postings lists

tupleID formID, count

formID docID, position

wildcardtupleID tupleID

These data structures can be combined to produce compression, ease of storage, and fast access speeds.
Searching: Query processing follows the architecture shown in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search. First, the query is parsed into an SLT and tuples are extracted. Then wildcard tuples are expanded, the associated postings lists for each tuple are found, iterators over these lists are created, and an iterator tree that implements the query is formed. Next, the iterator tree is advanced along formula identifiers in order, the scores are calculated, and the top formulae are stored in a heap. During this process, nonwildcard iterators are advanced first so that wildcard iterators only match unallocated tuples. As optimizations, iterators may skip over some formulae based on thresholds and maxscore calculations (see below). After the iterators are finished, matching formulae and scores are returned along with the associated document names.
The engine uses Dice’s coefficient over tuples as a simple ranking algorithm, counting the number of tuples that overlap between the query and a candidate formula using the query iterators. The engine also stores the tuple count for each formula in an array A1 and uses these values in the ranking calculation:

formID tuplecount
Since wildcards can often match multiple tuples in a query and overlap with other wildcards, there could be multiple ways to count the tuples that overlap. The engine implements a greedy counting approach by simply assigning the matches for tuples when each of the iterators is advanced.
Parameters: The engine has three configuration parameters: the window size for formulatotuple conversion, the optional use of endofline tuples, and the number of formulae to return for each query. The runtime efficiency and ranking effectiveness of configurations using various settings of the first two parameters with are examined in Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search.
Optimizations: By using dictionaries and postings lists, the engine’s data structures are small enough to be run in memory for the datasets being examined, so we do not examine additional techniques to compress these data structures here. Nevertheless, query processing might still be slow, even though the data structures are in memory, the ranking algorithm is fast, and the use of a dictionary avoids repeated processing of duplicate formulae. As a result, various techniques are employed to improve query execution time:

Avoid processing all postings by allowing skipping in query iterators. This functionality is implemented using doubling (galloping) search [?].

Skip formulae based on size thresholds. We use the current top candidate list to define a minimum score that defines minimum and maximum tuple size thresholds from the definition of Dice’s coefficient. We also improve on the effectiveness of these thresholds by reordering formula identifiers: sort the formulae by size, split into quartiles , and then reorder .

Avoid formulae that match only wildcard tuples when the score threshold allows. This is similar to portions of the maxscore [?] optimization, only at a coarser granularity.

Avoid processing all wildcard tuple expansions. If a tuple is matched to a wildcard for the next formula, do not process the remaining iterators for this wildcard.

Process iterators for large postings lists first. Evaluate the binary operator tree leftfirst and order tree operators descending by size when possible.
Various improvements to the engine have been left for future work, including compression of the postings lists and implementing more of the query functionality in the engine. Additional improvements in query runtime are also possible by using an implementation of weakAND [?] or a more finegrained implementation of maxscore [?].
Effective information retrieval depends on ranking documents based on their similarity to a user’s query. For example, when using treebased formula retrieval, one could extract a set of features from a query tree and each candidate indexed tree , apply Dice’s coefficient of similarity as the score for , and rank candidates by their scores. In this section we describe an alternative to Dice’s metric that is particularly effective in ranking mathematical formulae.

The label on node in SLT is denoted . The number of nodes in SLT is denoted . For simplicity, we write if is a node in and if is an edge in .
Approximate matches of formulae might involve isolating corresponding parts of SLTs representing a query and a candidate match. Therefore we need a basis for describing such a correspondence.

SLTs and are aligned if there is an isomorphism mapping nodes from onto nodes from such that for every edge , there is a corresponding edge that has the same label. (Note that node labels in aligned trees need not match.) For a subset of nodes in , we define .
Approximate matches might also involve simple replacements of symbols in one SLT by alternative symbols (e.g., for or 3 for 2). Naturally, a wildcard symbol can be replaced by any symbol.

Node in SLT can be unified with node in SLT , denoted , if one of the following conditions holds:

Both and have type variable name (V!),

Both and have type number (N!),

has type wildcard (?), or

has a type other than variable name, number, or wildcard and .

However the SLT for an arbitrary query formula will not necessarily align with the SLT for an arbitrary candidate match formula. Therefore, we need to consider parts of the SLTs that can be aligned. When considering many candidate match trees, we are most interested in those parts of the query and candidate trees that are similar to the tree representing the whole query.

Given SLTs and and aligned SLTs and with isomorphism from to , where is a pruned subtree^{6}^{6}6Given a tree , a pruned subtree is any connected subset of nodes from together with the edges connecting those nodes. Thus a pruned subtree is itself a tree, but it need not extend to the leaves of . Henceforth, we will use “subtree” to mean “pruned subtree.” of and is a pruned subtree of , let . Let be a measure of similarity of to with respect to (Dice’s measure). is then maximally similar to if the root of can be unified with the root of and there is no other pair of aligned SLTs and with corresponding measure , where is a subtree of , is a subtree of , the root of is the same as the root of , the root of is the same as the root of , , and .
Theorem 1
Given SLTs and and aligned SLTs and with isomorphism from to , where is a subtree of and is a subtree of , determining that is maximally similar to can be performed in time .

Let be the root of . The maximally similar subtree to and rooted at can be determined in time :
missingbase case: If is a leaf of , then and is maximally similar to iff . This can be determined in time.
recursion: Let is a maximally similar subtree of , the root of is a child of , the root of the aligned SLT for is a child of , and the label on the edge from to the root of is the same as the label on the edge from to the root of the aligned SLT for . (This construction is unambiguous because the edge labels on all edges starting at a node are unique.) The isomorphism can then be extended to include all the nodes in all subtrees in , and the subtree of consisting of and all the subtrees in will be aligned with the subtree of having nodes . Let . The maximally similar subtree to and rooted at then includes iff (i.e., we can evaluate the similarity for each subtree independently to determine whether or not it is part of the maximally similar subtree). Because the outdegree of is bounded by a constant, each step of the recursion can be performed in time.
is then maximally similar to iff it is the tree thus constructed and , and the construction can be performed in time .
Next, when matching with substituted symbols, it is important that the substitutions are consistent when determining that two formulae match approximately.

Given and , two aligned SLTs with isomorphism from to , an alignment partition is a subset of nodes in such that . For node , we define to be the alignment partition containing if it exists and otherwise. (Note that .) For alignment partition , denotes the label that is common to all nodes in and denotes the label that is common to all nodes in .

Given aligned SLTs and with isomorphism from to and the set of all corresponding alignment partitions, we define a matched set of nodes as
In preparation to preferring matches of large connected parts of SLTs, let .
We need to accommodate situations in which symbol in the query formula is replaced by symbol in some parts of a candidate matching formula and by other symbols elsewhere, and where superfluous instances of or might appear in the candidate match. We suggest the following properties for a scoring function, as illustrated in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search: alignments with more matched symbols in close proximity to each other score higher than those with fewer matched symbols or more disconnected matches; if two candidates score equally with respect to matched symbols and their proximity, the one with fewer superfluous symbols scores higher; and everything else being equal, alignments with identical node labels score higher than alignments with distinct node labels that can be unified. Tangent uses such a scoring function:

Given a query SLT , an SLT for a candidate match, and two aligned SLTs and where is a subtree of and is a subtree of , let be a matched set of nodes for and . The score of with respect to , , , and is denoted and defined as the triple composed of the following parts:

the harmonic mean of the fraction of nodes from preserved by and the fraction of edges preserved by , i.e., if , otherwise 0.

the negation of the number of unmatched nodes in , i.e., .

the number of nodes that match exactly, i.e., .
The scores (triples) assigned to any two candidate matches can be computed in time if , , and are given, and they can be compared lexicographically to determine which candidate ranks higher.

For aligned SLTs and with isomorphism from to and the set of all corresponding alignment partitions, we would like to choose a matched set of nodes that produces a high score, but evaluating all matched sets induced by an alignment is too expensive. Therefore we use a greedy algorithm to select which partitions to include in the matched set of nodes, based on the properties we use for scoring:

Let be the alignment partition that contains the most nodes; or if more than one partition has the most nodes, then let be one of those partitions for which if it exists; otherwise let be any of the largest alignment partitions. Initialize to include all nodes in .

Repeatedly identify the largest alignment partition such that is not the label of any node in and is not the label of any node unified with a node in , choosing to be one where if it exists; replace by .

Stop when no more alignment partitions can be included in .
If hash tables are used to record which node labels have been included in and in , checking for duplicate labels can be performed in time. Partitions can be considered one by one in decreasing order of size, which requires time to initialize and then to enumerate since the number of partitions cannot exceed the number of nodes in .
Finally, to compare a query SLT against a candidate SLT , we choose a pair of aligned subtrees that maximizes the score for the candidate with respect to the query.

Given SLTs and , consider pairs of aligned subtrees and as follows.

The root of can be unified with the root of .

is maximally similar to .
The Maximum Subtree Similarity score MSS of with respect to is over all such pairs, where is a greedily chosen matched set of nodes.

Theorem 2
Computing Maximum Subtree Similarity for a candidate formula requires time .

The number of pairs of aligned subtrees is at most . For each pair, checking whether the roots can be unified requires time, checking maximum similarity requires time, and computing the score requires constant time plus time to choose .
We show experimentally that this similarity metric performs very well.
missingmissingTable \thetable: missingNTCIR11 Wikipedia Formula Retrieval Benchmark Results (100 Queries). For each system the top row shows % recall (over all hits, top), and the bottom row shows the Mean Reciprocal Rank missing
DocumentCentric FormulaCentric Participant Total Easy Frequent Variable Hard Total Easy Frequent Variable Hard TUW Vienna 97 100 100 93 88 93 100 96 89 63 (mrr) 82 97 50 96 54 88 96 72 94 71 NII Japan 97 98 100 93 100 94 98 96 89 88 (mrr) 76 99 49 82 67 77 89 92 78 48 Tangent2 (RIT) 88 98 79 89 63 78 95 50 81 63 (mrr) 80 96 31 92 83 86 94 47 96 83 Tangent3: Using exact formula location on target document w=1, EOL 100 100 100 100 100 89 95 67 100 88 (mrr) 83 100 55 95 41 85 100 58 93 32 w=1 , NoEOL 98 98 96 100 100 87 93 63 100 88 (mrr) 82 100 56 94 31 84 100 59 92 32 Tangent3: Matching equivalent formulae on target document w=1, EOL 100 100 100 100 100 (mrr) 82 100 55 93 28 w=1 , NoEOL 98 98 96 100 100 (mrr) 82 100 56 92 28
In this Section we present experiments designed to observe the effect of system parameters on index size, retrieval time, and search results. We do this using a combination of benchmarks, and a human experiment to evaluate the similarity of the Top10 formulae returned by our system to query expressions.
Computational Resources. We use a Ubuntu Linux 12.04.5 server with 24 Intel Xeon processors (2.93GHz) and 96GB of RAM. While some indexing operations were parallelized (as noted below), all retrieval times are reported for a single process.
The NTCIR11 Wikipedia benchmark [?] is 2.5 GB, with 30,000 articles containing roughly 387,947 unique LaTeX expressions. The benchmark includes 100 queries for measuring specificitem retrieval performance, where each query is associated with a single target formula in a specific document. One or more wildcard symbols are present in 35 of the queries. Easy queries (41) and Frequent queries (24) have no wildcard symbols, and are distinguished by whether one or multiple formulae in the corpus match the target expression. Variable queries (27) and Hard queries (8) contain wildcards, and are again distinguished by whether one or multiple formulae match the target. Search results are returned as a ranked list of pairs.
Systems are evaluated using two metrics. First, by the percentage of targets located (at any rank), and second by the Mean Reciprocal Rank (mrr) of successfully retrieved targets. The ‘documentcentric’ evaluation filters results so that document identifiers are listed in their order of appearance in the results. The ‘formulacentric’ results are computed using the complete ranked list of matches. Previously, ‘formulacentric’ results were computed using specific formula identifiers, so for example, given a query and target formula, if the target formula is found at a different location within the target document, this is considered a miss.
At the top of Table The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search, NTCIR11 results from the two best systems and from Tangent version 2 are shown as fractions rounded to the nearest percentage [?]. The results for Tangent3 are shown below this in Table The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search. For the formulacentric evaluation, we present results when defining hits using specific formula identifiers, and when accepting any identical formula on the target document. The large difference with these two definitions of hits indicates that our system should index all occurrences of formulae in documents, rather than just the first unique occurrence of each formula in a document as done currently.
For Tangent3, all combinations of window sizes (where is all tuples) and including or excluding endofline symbols (EOL/NoEOL) were used. Surprisingly, different window sizes had very little effect on performance, although adding EOL tuples increased the number of formulae retrieved by two (e.g. the query ‘s’ could be located). We show results for windows size 1 () and with and without endofline symbols in Table The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search.
Tangent3 obtains the best document recall and mrr results to date, with perfect recall when using EOL tuples. The slightly lower Variable and Hard query mrr values are an artifact of additional challenging formulae being located at lower ranks. For formula evaluation, the formula recall is improved over Tangent2 when using exact formula id matches for hits, but when we treat equivalent formula on a document as a match, we again obtain perfect recall using EOL tuples. As before, the mrr is slightly lower than in some other results because additional formulae have been found which are located at relatively low ranks.
From a user’s perspective, after formulae are grouped according to their SLT matches (see Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search), 89 of the target formulae appear in the first (Top1) group, and 95 within the Top3 groups. If EOL symbols are added, then 97 of the queries are found in the Top3 formula groups.
We used both NTCIR11 collections to test the index sizes and retrieval times for our system. In addition to the Wikipedia collection described above, we use the much larger NTCIR11 arXiv collection to test the scalability of Tangent. The arXiv collection is 174 GB uncompressed, with 8,301,578 documents (fragments from arXiv articles) and roughly 60 million formulae including isolated symbols.
Indexing Time. The arXiv data took 43 hours to preprocess the documents (using 10 processes), and at most an additional 3.5 hours to generate the index using a single process (all tuples, with endofline tuples). Wikipedia was much faster, requiring 260 seconds for preprocessing, and at most 95 seconds for index creation. As our document preprocessor and reranker are implemented in Python, we believe that a faster implementation (e.g., in C++) could reduce run times by a factor of 410 in both cases.
Index Sizes. As seen in Table The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search, when all tuples are stored with EOL tuples, the Wikipedia index is 499 MB on disk; in contrast, for the arXiv this maximum index size is 29 GB. When these index files are loaded into memory, they consume 2  2.5 times their space on disk. Index size increases roughly linearly from window sizes 14, with endofline tuples increasing storage by a constant amount. For smaller window sizes storage is much smaller; for without endofline tuples the index file is 63 MB for Wikipedia, and 5.2 GB for arXiv; these are much smaller than Tangent2 (1.3 GB for Wikipedia, and roughly 36 GB for the arXiv dataset [?]).
Retrieval and Reranking Times. We ran all 100 of the NTCIR11 Wikipedia queries over the arXiv collection to test retrieval speed. Retrieval is now much faster than Tangent2 (see Section The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search). We see in Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Searchb that median retrieval times are less than 1 second in all conditions, but much faster without endofline symbols. For the smaller Wikipedia collection retrieval is much faster, with average retrieval time without EOL at being 7ms () and 9ms at ().
Reranking times are consistent across core parameters because the number of candidates reranked is fixed at . For Wikipedia, rerank times are (median, , ) = ms. The mean is skewed significantly by a small number of outliers. In one extreme case, reranking took 46 seconds, while retrieval from the core ( with EOL) was just 1.7 seconds. This particular expression is very large, with 16 wildcard symbols (Query 52), producing large hits with many possible unifications. We do not expect to see many queries of this type in common use.
Index Sizes () Wikipedia arXiv w NoEOL EOL NoEOL EOL 1 63.1 72.6 5,238 6,036 2 94.4 103.9 7,419 8,216 3 126.9 136.4 9,491 10,288 4 159.7 169.1 11,397 12,194 489.2 498.7 28,099 28,897 Table \thetable: Index Sizes for NTCIR11 Collections a) MSS Distributions (nDCG) b) Retrieval times (ms) Figure \thefigure: Distribution of Top100 nDCG (MSS) Scores and Wikipedia Query Retrieval Times for the NTCIR11 arXiv Collection. We now consider how well MSSbased rankings correspond to human perceptions of formula similarity, through evaluating Top10 results.
To first select which combinations of parameters to use for our human evaluation, we examined the MSS scores of formulae returned by the core before reranking. In Figure The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Searcha, from the Top100 hits returned by the core for the NTCIR11 Wikipedia task, we compute normalized Discounted Cumulative Gain (nDCG@100) distributions for the Maximum Subtree Similarity Scores in each of the Top100 hits compared to an MSS ‘gold standard.’ In the gold standard all formulae in the Wikipedia collection have been scored for each of the 100 Wikipedia queries, and the top100 formulae for each query are used for normalization. We again consider a number of different window size and EOL parameters (, with and without EOL tuples). The first five columns show increasing values without EOL tuples, followed by the same range of values with EOL tuples.
Adding EOL tuples shifts values around the median down. We took this as evidence that including EOL tuples was not helping return more similar formulae as measured by nDCG over the MSS scores. Further, as moving from to reduces the variance most dramatically, to keep the number of hits for individual participants to evaluate reasonable, we chose to consider only and .
Data. A set of 10 queries were selected using random sampling from the Wikipedia query set. Five of the queries contained wildcards, and the other five did not. Some queries were manually rejected and then randomly replaced to insure that a diverse set of expression sizes and structures were collected. Using the Wikipedia collection, for the three versions of the core compared (, no EOL tuples), we applied reranking to the Top100 hits, and then collected the Top10 hits returned by each query for rating.
Evaluation Protocol. Participants completed the study alone in a private, quiet room with a desktop computer running the evaluation interface in a web browser. The web pages first provided an overview, followed by a demographic questionnaire, instructions on evaluating hits, and then familiarization trials (10 hits; 5 for each of two queries). After familiarization participants evaluated hits for the 10 queries, and finally completed a brief exit questionnaire. Participants were paid $10 at the end of their session.
Participants rated the similarity of queries to results using a fivepoint Likert scale (Very Dissimilar, Dissimilar, Neutral, Similar, Very Similar). It has been shown that presenting search results in an ordered list affects the likelihood of hits being identified as relevant [?]. Instead we presented queries along with each hit in isolation. To avoid other presentation order effects, the order of query presentation was randomized, followed by the order in which hits for each query were presented.
Demographics and Exit Questionnaire. 21 participants (5 female, 16 male) were recruited from the Computing and Science colleges at our institution. Their age distribution was: 1824 (8), 2534 (9), 3544 (1), 4554 (1), 5564 (1) and 6574 (1). Their highest levels of education completed were: Bachelor’s degree (9), Master’s degree (9), PhD (2), and Professional Degree (1). Their reported areas of specialization were: Computer Science (13), Electrical Engineering (2), Psychology (1), Sociology (1), Mechanical Engineering (1), Computer Engineering (1), Math (1) and Professional Studies (1).
In the postquestionnaire, participants rated the evaluation task as Very Difficult (3), Somewhat Difficult (10), Neutral (6), Somewhat Easy (2) or Very Easy (0). They reported different approaches to assessing similarity. Many considered whether operations and operands were of the same type or if two expressions would evaluate to the same result. Others reported considering similarity primarily based on similar symbols, and shared structure between expressions.
Rank/Position in Top10 Hits 1 2 3 4 5 4.54 (0.78) 3.79 (1.16) 3.48 (1.31) 3.20 (1.30) 2.83 (1.24) 4.54 (0.78) 3.71 (1.22) 3.48 (1.30) 3.16 (1.28) 2.90 (1.25) 4.54 (0.78) 3.78 (1.18) 3.59 (1.22) 3.27 (1.15) 2.98 (1.24) 6 7 8 9 10 2.94 (1.22) 2.65 (1.19) 2.78 (1.20) 2.78 (1.20) 2.85 (1.24) 2.93 (1.19) 2.85 (1.25) 2.57 (1.18) 2.74 (1.22) 2.80 (1.13) 2.92 (1.23) 2.80 (1.17) 2.98 (1.23) 2.92 (1.21) 2.87 (1.17) Table \thetable: Mean and Standard Deviation Likert Ratings for Top10 NTCIR11 Wikipedia Hits (21 Participants, 10 queries). Similarity Ratings. As seen in Table The Tangent Search Engine: Improved Similarity Metrics and Scalability for Math Formula Search, the Likertbased similarity rating distributions are very similar, and identical in a number of places. In all three conditions, average ratings increase consistently from the 5th to 1st hits. The top 4 formula hits all have an average rating higher than ’3,’ suggesting that a number of participants felt these formula had some similarity with the query expression. After this the ratings are less than ‘3’ and sometimes shift. Perhaps because matches were not highlighted, in a number of cases exact matches were rated as ‘4’ rather than ‘5.’ As was found for the NTCIR11 Wikipedia benchmark, it appears that a window size of 1 is able to obtain strong results. This is appealing, because this requires the smallest index size and has the fastest retrieval times.
We have presented a new technique for ranking appearancebased formula retrieval results, using the candidate formula subtree with the harmonic mean for matched symbols and edges after greedy unification of symbols by type. This Maximum Subtree Similarity (MSS) metric prefers large connected matches of the query within the formula. In an experiment we found that for the Top10 hits, the human ratings of similarity were consistent with the ranking produced by our metric. We have also described an efficient twostage implementation of our retrieval model that produces stateoftheart results for the NTCIR11 Wikipedia formula retrieval task, using a much smaller index.
In the future we plan to explore using endofline symbols, but only for small expressions. This will not require much additional space in the index, while greatly reducing the cost of wildcard endofline tuples. We also plan to support multiple copies of a formula in a document, devise new methods for ranking documents based on multiple matches and/or query expressions, and integrate our formula retrieval system with keyword search.
Acknowledgements
This material is based upon work supported by the National Science Foundation (USA) under Grant No. IIS1016815 and HCC1218801. Financial support from the Natural Sciences and Engineering Research Council of Canada under Grant No. 9292/2010, Mitacs, and the University of Waterloo is gratefully acknowledged.
 [1] A. Aizawa, M. Kohlhase, and I. Ounis. NTCIR10 math pilot task overview. In NTCIR, pages 654–661, 2013.
 [2] A. Aizawa, M. Kohlhase, I. Ounis, and M. Schubotz. NTCIR11 Math2 task overview. In NTCIR, pages 88–98, 2014.
 [3] J. L. Bentley. An almost optimal algorithm for unbounded searching. Inf. Process. Lett., 5(3):82–87, 1976.
 [4] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a twolevel retrieval process. In CIKM, pages 426–434, 2003.
 [5] P. Graf. Substitution tree indexing. In RTA, pages 117–131, 1995.
 [6] Z. Guan and E. Cutrell. An eye tracking study of the effect of target rank on web search. In SIGCHI, pages 417–420, 2007.
 [7] H. Hiroya and H. Saito. Partialmatch retrieval with structurereflected indices at the NTCIR10 math task. In NTCIR, pages 692–695, 2013.
 [8] S. Kamali and F. W. Tompa. A new mathematics retrieval system. In CIKM, pages 1413–1416, 2010.
 [9] S. Kamali and F. W. Tompa. Structural similarity search for mathematics retrieval. In MKM/Calculemus/DML, pages 246–262. 2013.
 [10] M. Kohlhase and I. Sucan. A search engine for mathematical formulae. In AISC, pages 23–24, 2006.
 [11] X. Lin, L. Gao, X. Hu, Z. Tang, Y. Xiao, and X. Liu. A mathematics retrieval system for formulae in layout presentations. In SIGIR, pages 697–706, 2014.
 [12] B. R. Miller and A. Youssef. Technical aspects of the digital library of mathematical functions. Ann. Math. Artif. Intell., 38(1):121–136, 2003.
 [13] T. T. Nguyen, S. C. Hui, and K. Chang. A latticebased approach for mathematical search using formal concept analysis. Expert Syst. Appl., 39(5):5820 – 5828, 2012.
 [14] N. Pattaniyil and R. Zanibbi. Combining TFIDF text retrieval with an inverted index over symbol pairs in math expressions: The Tangent math search engine at NTCIR 2014. In NTCIR, pages 135–142, 2014.
 [15] P. Pavan Kumar, A. Agarwal, and C. Bhagvati. A structure based approach for mathematical expression retrieval. In MIWAI, pages 23–34. 2012.
 [16] M. Schubotz. Challenges of mathematical information retrieval in the NTCIR11 Math Wikipedia Task. In SIGIR, to appear, 2015.
 [17] P. Sojka and M. Líška. Indexing and searching mathematics in digital libraries. In Calculemus/MKM, pages 228–243. 2011.
 [18] D. Stalnaker and R. Zanibbi. Math expression retrieval using an inverted index over symbol pairs. In DRR, volume 9402, pages 940207–1–12, 2015.
 [19] H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6):831–850, 1995.
 [20] L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In SIGIR, pages 105–114, 2011.
 [21] K. Wangari, R. Zanibbi, and A. Agarwal. Discovering realworld use cases for a multimodal math search interface. In SIGIR, pages 947–950, 2014.
 [22] P. Willett. Document retrieval experiments using vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. J. Documentation, 35:296–305, 1979.
 [23] A. Youssef. Roles of math search in mathematics. In Proc. Math. Knowl. Manage., pages 2–16. 2006.
 [24] R. Zanibbi and D. Blostein. Recognition and retrieval of mathematical expressions. Int. J. Document Anal. Recognit., 15(4):331–357, 2012.

