Why Machines Cannot Learn Mathematics, Yet

# Why Machines Cannot Learn Mathematics, Yet

André Greiner-Petter University of Wuppertal, Wuppertal, Germany
Terry Ruas University of Michigan-Dearborn, Dearborn, USA
Moritz Schubotz University of Wuppertal, Wuppertal, Germany

Akiko Aizawa
National Institute of Informatics, Tokyo, Japan
William Grosky University of Michigan-Dearborn, Dearborn, USA
Bela Gipp University of Wuppertal, Wuppertal, Germany
###### Abstract

Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in ML, it seems canonical to apply ML techniques to represent and retrieve mathematics semantically. In this work, we apply popular text embedding techniques to the arXiv collection of STEM documents and explore how these are unable to properly understand mathematics from that corpus. In addition, we also investigate the missing aspects that would allow mathematics to be learned by computers.

###### Keywords:
Mathematical Information Retrieval, Machine Learning, Word Embeddings, Math Embeddings, Mathematical Objects of Interest

Introduction\markdownRendererInterblockSeparatorMathematics is capable of explaining complex concepts and relations in a compact, precise, and accurate way. Learning this idiom takes time and is often difficult, even to humans. The general applicability of mathematics allows a certain level of ambiguity in its expressions. This ambiguity is regularly mitigated by short explanations following or preceding these mathematical expressions, that serve as context to the reader. Along with context dependency, inherent issues of linguistics (e.g. ambiguity, non-formality) make it even more challenging for computers to understand mathematical expressions. Said that, a system capable of capturing the semantics of mathematical expressions automatically would be suitable for several applications, from improving search engines to recommender systems. \markdownRendererInterblockSeparatorDuring our evaluations of MathMLBen [SchubotzGSMCG18], a benchmark for converting mathematical LaTeX expressions into MathML, is possible to notice several fundamental problems that generally affect prominent ML approaches to learn semantics of mathematical expressions. For instance, the first entry of the benchmark,\markdownRendererInterblockSeparator

 W(2,k)>2k/kε (1)
\markdownRendererInterblockSeparator

is extracted from the English Wikipedia page about Van der Waerden’s theorem. Without further explanation, the symbols , , and might have several possible meanings. Depending on which one is considered, even the structure of the formula may be different. If we consider as a variable, instead of a function, it changes the interpretation of to a multiplication operation. \markdownRendererInterblockSeparatorLearning connections, such as between and the entity ‘\markdownRendererEmphasisVan der Waerden’s number’, requires a large specifically labeled scientific database that contains these mathematical objects. Furthermore, a fundamental understanding of the mathematical expression would increase the performance during the learning process, e.g., that and contain the same function.\markdownRendererInterblockSeparatorWord embedding techniques has received significant attention over the last years in the Natural Language Processing (NLP) community, especially after the publication of word2vec [Mikolov-b:13]. Recently, more and more projects try to adapt this knowledge for solving Mathematical Information Retrieval (MIR) tasks [DBLP:journals/corr/GaoJYYYT17, DBLP:journals/corr/abs-1803-09123, DBLP:conf/mkm/YoussefM18, DBLP:journals/corr/abs-1902-06034]. While all of these projects follow similar approaches and obtain promising results, all of them fail to understand mathematical expressions because of the same fundamental issues. In this paper, we explore some of the main aspects that we believe are necessary to leverage the learning of mathematics by computer systems. We explain, with our evaluations of word embedding techniques on the arXMLiv 2018 [SML:arXMLiv:08.2018] dataset, why current ML approaches are not applicable for MIR tasks, yet.\markdownRendererInterblockSeparator\markdownRendererHeadingTwoBackground & Related Work \markdownRendererInterblockSeparatorUnderstanding mathematical expressions essentially means comprehending the semantic value of its internal components, which can be accomplished by linking its elements with their corresponding mathematical definitions. Current MIR approaches [kristianto2014extracting, disSigir16, Schubotz2017] try to extract textual descriptors of the parts that compose mathematical equations. Intuitively, there are questions that arise from this scenario, such as (i) how to determine the parts which have their own descriptors, and (ii) how to identify correct descriptors over others. \markdownRendererInterblockSeparatorAnswers to (i) are more concerned in choosing the correct definitions for which parts of a mathematical expression should be considered as one mathematical object [DBLP:conf/lwa/Kohlhase17, POM-Tagger, SchubotzGSMCG18]. Current definitions, such as the content MathML 3.0 specification, are often imprecise333Note that OpenMath is another specification specifically designed for encoding semantics of mathematics. However, content MathML is an encoding of OpenMath and inherent problems of content MathML also apply for OpenMath (see https://www.openmath.org/om-mml/).. For example, content MathML 3.0 uses csymbol elements for functions and specifies them as expressions that \markdownRendererEmphasisrefer to a specific, mathematically-defined concept with an external definition. However, it is not clear whether or the sequence (Equation 1) should be declared as a csymbol. Another example are content identifiers, which MathML specifies as \markdownRendererEmphasismathematical variables which have properties, but no fixed value. While content identifiers are allowed to have complex rendered structures (e.g. ), it is not permitted to enclose identifiers within other identifiers. Let us consider , where is a vector and its -th element. In this case, should be considered as a composition of three content identifiers, each one carrying its own individualized semantic information, namely the vector , the element of the vector, and the index . However, with the current specification, the definition of these identifiers would not be canonical. One possible workaround to represent such expressions with content MathML is to use a structure of four nodes, interpreting as a function with the csymbol \markdownRendererEmphasisvector-selector. However, ML algorithms and MIR approaches would benefit from more precise definitions and a unified answer for (i). Most of the related work relies on these relatively vague definitions and in the analysis of content identifiers, focusing their efforts on (ii). \markdownRendererInterblockSeparatorIn [disSigir16], an approach is presented for scoring pairs of identifiers and definiens666definiens is a phrase that defines an identifier or mathematical object. Considering equation (1), the correct definiens for is the phrase ’\markdownRendererEmphasisVan der Waerden’s number’. by the number of words between them. Their approach is based on the assumption that correct definiens appear close to the identifier and to the complex mathematical expression that contains this same identifier. Kristianto et al. [kristianto2014extracting] introduce an ML approach, in which they train a Support Vector Machine (SVM) to consider sentence patterns and other characteristics as features (e.g. part-of-speech (POS) tags, parse trees). Later, [Schubotz2017] combine the aforementioned approaches and use pattern recognition based on the POS tags of common identifier-definiens pairs, the distance measurements, and SVM, reporting results for precision and recall of 48.60% and 28.06%, respectively. These results can be considered as a baseline for MIR tasks.\markdownRendererInterblockSeparatorMore recently, some projects try to use embedding techniques to learn patterns of the correlations between context and mathematics. In the work of  [DBLP:journals/corr/GaoJYYYT17], they embed single symbols and train a model that is able to discover similarities between mathematical symbols. Similarly to this approach, Krstovski and Blei [DBLP:journals/corr/abs-1803-09123] use a variation of word embeddings (briefly discussed in Section 8) to represent complex mathematical expressions as single unit tokens for IR. In 2019, M. Yasunaga et al. [DBLP:journals/corr/abs-1902-06034] explore an embedding technique based on recurrent neural networks to improve topic models by considering mathematical expressions. They state their approach outperforms topic models that do not consider mathematics in text and report a topic coherence improvement of over the LDA777Latent Dirichlet Allocation baseline. What all these embedding projects have in common is that they show promising examples and suppose a high potential, but do not evaluate their results for MIR. \markdownRendererInterblockSeparatorQuestions (i), (ii), and other pragmatic issues are already in discussion in a bigger context, as data production continues to rise and digital repositories seem to be the future for any archive structure. The National Research Council is making efforts to establish what they call the \markdownRendererEmphasisDigital Mathematics Library (DML), a project under the International Mathematical Union. The goal of this future project is to take advantage of new technologies and help to solve the inability to search, relate, and aggregate information about mathematical expressions in documents over the web. \markdownRendererInterblockSeparator\markdownRendererHeadingTwoMachine Learning on Embeddings \markdownRendererInterblockSeparatorThe word2vec [Mikolov-b:13] technique computes real-valued vectors for words in a document using two main approaches: skip-gram and continuous bag-of-words (CBOW). Both produce a fixed length -dimensional vector representation for each word in a corpus. In the skip-gram training model, one tries to predict the context of a given word, while CBOW predicts a target word given its context. In word2vec, context is defined as the adjacent neighboring words in a defined range, called a sliding window. The main idea is that the numerical vectors representing similar words should have close values if the words have similar context, often illustrated by the \markdownRendererEmphasisking-queen relationship999.\markdownRendererInterblockSeparatorExtending word2vec’s approaches, Le and Mikolov [Le:14] propose Paragraph Vectors (PV), a framework that learns continuous distributed vector representations for any size of text segment (e.g. sentences, paragraphs, documents). This technique alleviates the inability of word2vec to embed documents as one single entity. This technique also comes in two distinct variations: Distributed Memory (DM) and Distributed Bag-of-Words (DBOW), which are analogous to the skip-gram and CBOW training models respectively. However, in both approaches, an extra feature vector representing the text segment, named paragraph-id, is included as another word. This paragraph-id is updated throughout the entire document, based on the current evaluated context window for each word, and is used to represent the whole text segment.\markdownRendererInterblockSeparatorRecently, researchers have been trying to improve their semantic representations, producing multiple vectors (multi-sense embeddings) based on the word’s sense, context, and distribution in the corpus [Huang:12, Reisinger:10]. Another concern with traditional techniques is that they often neglect exploring lexical structures with valuable prior knowledge about the semantic relations, such as: WordNet [DBLP:journals/cacm/Miller95], ConceptNet [LiuCN:04] and BabelNet [Navigli:12]. These lexical structures offer a rich semantic environment that illustrate the word-senses, their use, and how they relate to each other. Some publications take advantage of the robustness provided by word embeddings approaches and lexical structures to combine them into multi-sense representations, improving their overall performance in many NLP downstream tasks [Mancini:17, Ruas:19, Taher:16].\markdownRendererInterblockSeparatorThe lack of solid references and applications that provide the same semantic structure of natural language for mathematical identifiers make their disambiguation process even more challenging. In natural texts, one can try to infer the most suitable word sense for a word based on the lemma101010canonical form, dictionary form, or citation form of a set of words itself, the adjacent words, dictionaries, thesaurus and so on. However, in the mathematical arena, the scarcity of resources and the flexibility of redefining their identifiers take this issue to a more delicate scenario. The context text preceding or following the mathematical equation is essential for its understanding.\markdownRendererInterblockSeparatorMore recently, [DBLP:journals/corr/abs-1803-09123] propose a variation of word embeddings for mathematical expressions. Their main idea relies on the construction of a distributed representation of equations, considering the word context vector of an observed word and its word-equation context window. They treat equations as single-unit words (EqEmb), which eventually appears in the context of different words. They also try to explore the effects of considering the elements of mathematical expressions separately (EqEmb-U). In this scenario, mathematical equations are represented using a Syntax Layout Tree (SLT) [DBLP:conf/sigir/ZanibbiDKT16], which contains the spatial relationship between its symbols. While they present some interesting findings for retrieving entire equations, little is said about the vectors representing equation units and how they are described in their model. The word embedding techniques seem to have potential for semantic distance measures between complex mathematical expressions. However, they are not appropriate for extracting semantics of identifiers separately. This is an indication that the problems of representing mathematical identifiers are tied to more fundamental issues, which are explained in Section Why Machines Cannot Learn Mathematics, Yet.\markdownRendererInterblockSeparatorSince the overall performance of word embedding algorithms has shown superior results in many different NLP tasks, such as machine translation [Mikolov-b:13], relation similarity [Iacobacci:15], word sense disambiguation [Camachob:15], word similarity [Neela:14, Ruas:19], and topic categorization [Taher:17]. In the same direction, we also explore how well mathematical tokens can be embedded according to their semantic information. However, mathematical formulae are highly ambiguous and if not properly processed, their representation is jeopardized.\markdownRendererInterblockSeparator\markdownRendererHeadingThreeHow to Embed Mathematics \markdownRendererInterblockSeparatorThere are two main standard formats in which to represent mathematics in science: LaTeX and MathML. The former is used by humans for writing scientific documents. The latter, on the other hand, is popular in web representations of mathematics due to its machine readability and XML structure. There has been a major effort to automatically convert LaTeX expressions to MathML [SchubotzGSMCG18] ones. However, neither LaTeX nor MathML are practical formats for embeddings. Considering the equation embedding techniques in [DBLP:journals/corr/abs-1803-09123], we devise three main types of mathematical embeddings.\markdownRendererInterblockSeparator\markdownRendererStrongEmphasisMathematical Expressions as Single Tokens: EqEmb [DBLP:journals/corr/abs-1803-09123] uses entire mathematical expressions as one token. In this type, the inner structure of the mathematical expression is not taken into account. For example, Equation (1) is represented as one single token . Any other expression, such as in the surrounding text of (1), is an entirely independent token . Therefore, this approach does not learn any connections between and (1). While this approach seems to hold interesting results for comparing mathematical expressions, it fails in representing the semantic aspects of inner elements in mathematical equations. \markdownRendererInterblockSeparator\markdownRendererStrongEmphasisStream of Tokens: Instead of embedding mathematical expressions as a single token, we can represent them through a sequence of its inner tokens. For example, considering only the identifiers in Equation (1), we would have a stream of three tokens , , and . This approach has the advantage of learning all mathematical tokens. However, this method also has some drawbacks. Complex mathematical expressions may lead to long chains of elements, which can be especially problematic when the window size of the training model is too small. Naturally, there are approaches to reduce the length of chains. In Section Why Machines Cannot Learn Mathematics, Yet we show our own model which uses a stream of mathematical identifiers and cut out all other expressions. In [DBLP:journals/corr/GaoJYYYT17], L. Gao et al. use a CBOW and embed all mathematical symbols, including identifiers and operands, such as , or variations of equalities . In [DBLP:journals/corr/abs-1902-06034], they do not cut out symbols and train their model on the entire sequence of tokens that the LaTeX tokenizer generates. Considering Equation (1), it would result in a stream of 13 tokens. They use a long short-term memory (LSTM) architecture to handle longer chains of tokens and also to limit their length to tokens. Usually, in word embeddings, such behaviour is not preferred since it increases the noise in the data111111Noise means, the data consists of many uninteresting tokens that affect the trained model negatively.. We will see later in the paper (Section 10), that a typically trained model on mathematical embeddings is able to detect similarities between mathematical objects but do not perform well detecting connections to word descriptors. Therefore, we consider close relations of mathematical symbols to other mathematical symbols as noise. To mitigate this issue, we only work with mathematical identifiers and not any other symbols or structures.\markdownRendererInterblockSeparator\markdownRendererStrongEmphasisSemantic Groups of Tokens: The third approach of embedding mathematics is only theoretical, and concerns the aforementioned problems related to the vague definitions of identifiers and functions in a standardized format (e.g. MathML). As previously discussed, current MIR and ML approaches would benefit from a basic structural knowledge of mathematical expressions, such that variations of function calls (e.g. and ) can be recognized as the same function. Instead of defining a unified standard, current techniques use their own ad-hoc interpretations of structural connections, e.g., is one identifier rather than three [SchubotzGSMCG18, Schubotz2017]. We assume that an embedding technique would benefit from a system that is able to detect the parts of interest in mathematical expressions prior any training processes. However, such system still does not exist.\markdownRendererInterblockSeparator\markdownRendererHeadingTwoPerformance of Math Embeddings \markdownRendererInterblockSeparatorThe examples illustrated in [DBLP:journals/corr/GaoJYYYT17, DBLP:journals/corr/abs-1803-09123, DBLP:journals/corr/abs-1902-06034] seem to be feasible as a new approach for distance calculations between complex mathematical expressions. While comparing mathematical expressions is essentially practical for search engines or automatic plagiarism detection systems, these approaches do not seem to capture the components of complex structure separately, which are necessary for other applications, such as automated reasoning. Another aspect to be considered is that in [DBLP:journals/corr/abs-1803-09123] they do not train mathematical identifiers, preventing their system from learning connections between identifiers and definiens (e.g., and the definiens ‘\markdownRendererEmphasisVan der Waerden number’). Additionally, the connection between entire equations and definiens is, at some level, questionable. Entire equations are rarely explicitly named121212However, it is common for groundbreaking findings, such as \markdownRendererEmphasisPythagorean’s theorem or the \markdownRendererEmphasisenergy-mass equivalence.. However, in the extension EqEmb-U [DBLP:journals/corr/abs-1803-09123], they use an SLT representation to tokenize mathematical equations and to obtain specific unit-vectors, which is similar to our \markdownRendererEmphasisidentifiers as tokens approach.\markdownRendererInterblockSeparatorIn order to investigate the discussed approaches, we apply variations of a word2vec implementation to extract mathematical relations from the arXMLiv 2018 [SML:arXMLiv:08.2018] dataset, an HTML collection of the arXiv.org preprint archive13, which is used as our training corpus. We also consider the subsets that do not report errors during the document conversion (i.e. \markdownRendererEmphasisno_problem and \markdownRendererEmphasiswarning) which represent 70% of archive.org. There are other approaches that also produce word embeddings given a training corpus as an input, such as fastText [Bojanowski:17], ELMo [Matthew:18], and GloVe [Penni:14]. The choice for word2vec is justified because of its implementation, general applicability, and robustness in several NLP tasks [Iacobacci:15, Iacobacci:16, Li:15, Mancini:17, Taher:16, Ruas:19]. Additionally, in fastText they propose to learn word representations as a sum of the -grams of its constituent characters (sub-words). This would incorporate a certain noise to our experiments. In ELMo, they compute their word vectors as the average of their characters representations, which are obtained through a two-layer bidirectional language model (biLM). This would bring even more granularity than fastText, as they consider each character in a word as having their own -dimensional vector representation. Another factor that prevent us from using ELMo, for now, is its expensive training process. Closer to the word2vec technique, GloVe [Penni:14] is also considered, but its co-occurrence matrix would escalate the memory usage, making its training for arXiv not possible at the moment. We also examine the recently published Universal Sentence Encoder (USE) [Cer:18] from google, but their implementation does not allow one to use a new training corpus, only to access its pre-calculated vectors.\markdownRendererInterblockSeparatorAs a pre-processing step, mathematical expressions are represented using MathML151515The source TeX file has to use mathematical environments for its expressions. notation. Firstly, we replace all mathematical expressions by the sequence of the identifiers it contains, i.e., is replaced by ‘ ’. Secondly, we remove all common English stopwords from the training corpus. Finally, we train a word2vec model (skip-gram) using the following hyperparameters configuration161616Non mentioned hyperparameters are used with their default values as described in the Gensim API [rehurek_lrec]: vector size of 300 dimensions, a window size of 15, minimum word count of 10, and a negative sampling of .\markdownRendererInterblockSeparatorThe trained model is able to partially incorporate semantics of mathematical identifiers. For instance, the closest171717Considering cosine similarity. 27 vectors to the mathematical identifier are mathematical identifiers themselves and the fourth closest noun vector to is . Inspired by the classic \markdownRendererEmphasisking-queen example, we explore which tokens perform best to model a known relation. Consider an approximation , where represents the word \markdownRendererEmphasisvariable, the identifier , and represents . We are looking for that fits best for the approximation. We call this measure the \markdownRendererEmphasissemantic distance to with respect to a given relation between two vectors. Table 1 shows the top 10 semantically closest results to with respect to the relation between and .\markdownRendererInterblockSeparator

\markdownRendererInterblockSeparator

We also perform an extensive evaluation on the first 100 entries181818Same entries used in [Schubotz2017] of the \markdownRendererEmphasisMathMLBen benchmark [SchubotzGSMCG18]. We evaluate the average of the \markdownRendererEmphasissemantic distances with respect to the relations between and , and , and and . In addition, we consider only results with a cosine similarity of or above to maintain a minimum quality in our results. The overall results were poor with a precision of and a recall of . For the identifier (Equation (1)), the evaluation presents four semantically close results: \markdownRendererEmphasisfunctions, \markdownRendererEmphasisvariables, \markdownRendererEmphasisform, and the mathematical identifier . Even though expected, the scale of the presented results are astonishing. \markdownRendererInterblockSeparatorAdditionally, we also try the Distributed Bag-of-Words of Paragraph Vectors (DBOW-PV) [Le:14] considering the approach of [Schubotz2017]. In [Schubotz2017], they analyze all occurrences of mathematical identifiers and consider the entire article at once. We assume this prevents the algorithm from finding the right descriptor in the text, since later or prior occurrences of an identifier might appear in a different context, and therefore potentially introduce different meanings. Instead of using the entire document, we consider the algorithm of [Schubotz2017] only in the input paragraph and similar paragraphs given by our DBOW-PV model. Unfortunately, the obtained variance within the paragraphs brings a high number of false positives to the list of candidates, which affects negatively our performance.\markdownRendererInterblockSeparatorWe also experiment other hyperparameters when training our word embeddings model to see if it is possible to improve the overall results. However, while the performance decreases, no drastic structural changes appear in the model. Figure 1 illustrates a t-SNE plot191919Note that t-SNE plots may misleadingly create clusters that do not exists in the model. To overcome this issue we create several plots with different settings. The results remain similar to the plot we show which is an indication that the visual clusters exists also in the model. of the model trained with 400 dimensions, a window size of 25, and minimum count of 10 words, without any filters applied to the text. The plot is similar to the visualized model presented in [DBLP:journals/corr/GaoJYYYT17], even though they use a different embedding technique. Compared to [DBLP:journals/corr/GaoJYYYT17], we provide a bigger picture of the model that reveals some dense clusters for numbers at with the math token for \markdownRendererEmphasisinvisible times nearby, equation abbreviations such as, eq1, at , and logical operators at . We highlight mathematical tokens in the model in red and word tokens in blue. The plot in Figure 1 illustrates that mathematical tokens are close to each other.\markdownRendererInterblockSeparator

\markdownRendererInterblockSeparator

## Acknowledgments

This work was supported by the German Research Foundation (DFG grant GI-1259-1). We thank Howard Cohl who provided insights and expertise.

\printbibliography

[keyword=primary]

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters