Automating the search for a patent’s prior art with a full text similarity search
More than ever, technical inventions are the symbol of our society’s advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines.
In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions that are similar to the one described in the submitted document. Various state-of-the-art approaches for feature extraction and document comparison are evaluated. In addition to that, the quality of the current search process is assessed based on ratings of a domain expert. The evaluation results show that our automated approach, besides accelerating the search process, also improves the search results for prior art with respect to their quality.
Machine Learning Group, Technische Universität Berlin, Berlin, Germany
Pfenning, Meinig & Partner mbB, Berlin, Germany
Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea
Max-Planck-Institut für Informatik, Saarbrücken, Germany
A patent is the exclusive right to manufacture, use, or sell an invention and is granted by the government’s patent offices [wipohandbook]. For a patent to be granted, it is indispensable that the described invention is not known or easily inferred from the so-called prior art, where prior art includes any written or oral publication available before the filing date of the submission. Therefore, for each application that is submitted, the responsible patent office performs a search for related work to check if the subject matter described in the submission is inventive enough to be patentable [wipohandbook]. Before handing in the application to the patent office, the inventors will usually consult a patent attorney, who represents them in obtaining the patent. In order to assess the chances of the patent being granted, the patent attorney often also performs a search for prior art.
When searching for prior art, patent officers and patent attorneys are currently mainly relying on simple keyword searches such as those implemented by the Espacenet tool from the European Patent Office, the TotalPatent software developed by LexisNexis, or the PatSnap patent search, all of which provide very limited semantic search options. These search engines often fail to return relevant documents and due to constraints regarding the length of the entered search text, it is usually not possible to consider a patent application’s entire text for the search, but merely query the database for specific keywords.
Current search approaches for prior art therefore require a significant amount of manual work and time, as given a patent application, the patent officer or attorney has to manually formulate a search query by combining words that should match documents describing similar inventions [Alberts2017book1]. Furthermore, these queries often have to be adapted several times to optimize the output of the search [golestan2015term, Tseng2007]. A main problem here is that regular keyword searches do not inherently take into account synonyms or more abstract terms related to the given query words. This means, if for an important term in the patent application a synonym, such as wire instead of cable, or a more specialized term, such as needle instead of sharp object, has been used in an existing document of prior art, a keyword search might fail to reveal this relation unless the alternative term was explicitly included in the search query. This is relevant as it is quite common in patent texts to use very abstract and general terms for describing an invention in order to maximize the protective scope [tannebaum2015patnet, Andersson2017book9]. A line of research [kando2000workshop, lupu2011current, lupu2017current, lupu2013patent, shalaby2017patent] has focused on automatically expanding the manually composed queries, e.g., to take into account synonyms collected in a thesaurus [magdy2011study, Lupu2017book2] or include keywords occurring in related patent documents [fujii2007enhancing, mahdabi2012learning, mahdabi2014effect]. Yet, with iteratively augmented queries – be it by manual or automatic extension of the query – the search for prior art remains a very time consuming process.
Furthermore, a keyword-based search for prior art, even if done with most professional care, will often produce suboptimal results (as we will see e.g. later in this paper and Supporting Information LABEL:S4b). With possibly imperfect queries, it must be assumed that relevant documents are missed in the search, leading to false negatives (FN). On the other hand, query words can also appear in texts that, nonetheless, have quite different topics, which means the search will additionally yield many false positives (FP). When searching for prior art for a patent application, the consequences of false positives and false negatives are quite different. While false positives cause additional work for the patent examiner, who has to exclude the irrelevant documents from the report, false negatives may lead to an erroneous grant of a patent, which can have profound legal and financial implications for both the owner of said patent as well as competitors [Trippe2017book5].
1.1 An approach to automate the search for prior art
To overcome some of these disadvantageous aspects of current keyword-based search approaches, it is necessary to decrease the manual work and time required for conducting the search itself, while increasing the quality of the search results by avoiding irrelevant patents from being returned, as well as automatically accounting for synonyms to reduce false negatives. This can be achieved by comparing the patent application with existing publications based on their entire texts rather than just searching for specific keywords. By considering the entire texts of the documents, much more information, including the context of keywords used within the respective documents, is taken into account. For humans it is of course infeasible to read the whole text of each possibly relevant document. Instead, state-of-the-art text processing techniques can be used for this task.
This paper describes a novel approach to automate the search for prior art with natural language processing (NLP) and machine learning (ML) techniques in order to make it more efficient and accurate. The essence of this idea is illustrated in Fig 1. We first obtain a dataset of related patents from a patent database by using a few seed patents and then recursively adding the patents or patent applications that are cited by the documents already included in the dataset. The patent texts are then transformed into numerical feature vectors, based on which the similarity between two documents can be computed. We evaluate different similarity measures by comparing the documents that our automated approach considers as being very similar to some patent to those documents that were originally cited in this patent’s search report and, in a second step, to documents considered relevant for this patent by a patent attorney.
The remainder of the paper is structured as follows: After briefly reviewing existing strategies for prior art search as well as machine learning methods for full text similarity search and its applications, we discuss our approach for computing the similarities between the patents using different feature extraction methods. These methods are then evaluated on an example corpus of patents including their citations, as well as a second corpus where relevant patents were identified by a patent attorney. Furthermore, we assess the quality of the original citation process itself based on both corpora. A discussion of the relevance of the obtained results and a brief outlook conclude this manuscript.
1.2 Related work
Most research concerned with facilitating and improving the search for a patent’s prior art has focused on automatically composing and extending the search queries. For example, a manually formulated query can be improved by automatically including synonyms for the keywords using a thesaurus [magdy2011study, tannebaum2015patnet, Lupu2017book2, magdy2009exploring, wang2013semantic]. A potential drawback of such an approach, however, is that the thesaurus itself has to be manually curated and extended [Zhang2016]. Another line of research focuses on pseudo-relevance feedback, where, given an initial search, the first search results are used to identify additional keywords that can be used to extend the original query [mahdabi2012learning, ganguly2011patent, golestan2015term]. Similarly, past queries [tannebaum2014using] or meta data such as citations can be used to augment the search query [fujii2007enhancing, mahdabi2014effect, mahdabi2014query]. A recent study has also examined the possibility of using the word2vec language model [Mikolov2013a, Mikolov2013b, Mikolov2013c] to automatically identify relevant words in the search results that can be used to extend the query [singh2016relevance].
Approaches for automatically adapting and extending queries still require the patent examiner to manually formulate the initial search query. To make this step obsolete, heuristics can be used to automatically extract keywords from a given patent application [mahdabi2011building, konishi2005query, verma2011applying] or a bag-of-words (BOW) approach can be used to transform the entire text of a patent into a list of words that can then be used to search for its prior art [verberne2009prior, bouadjenek2015study, xue2009transforming]. Often times, partial patent applications, such as an extended abstract, may already suffice to conduct the search [bouadjenek2015study]. The search results can also be further refined with a graph-based ranking model [mihalcea2004textrank] or by using the patents’ categories to filter the results [verma2011exploring]. Different prior art search approaches have previously been discussed and benchmarked within the CLEF project, see e.g. [piroi2010clef] and [piroi2013overview].
In our approach, detailed in the following sections, we also alleviate the required work and time needed to manually compose a search query by simply operating on the patent application’s entire text. However, instead of only searching the database for relevant keywords extracted from this text, we transform the texts of all other documents into numerical feature representations as well, which allow us to compute the full text similarities between the patent application and its possible prior art.
Calculating the similarity between texts is at the heart of a wide range of information retrieval tasks, such as search engine development, question answering, document clustering, or corpus visualization. Approaches for computing text similarities can be divided into similarity measures relying on word similarities and those based on document feature vectors [gomaa2013survey].
To compute the similarity between two texts using individual word similarities, the words in both texts first have to be aligned by creating word pairs based on semantic similarity and then these similarity scores are combined to yield a similarity measure for the whole text. Corley and Mihalcea [Corley2005] propose a text similarity measure, where the most similar word pairs in two texts are determined based on semantic word similarity measures as implemented in the WordNet similarity package [Patwardhan2003]. The similarity score of two texts is then computed as the weighted and normalized sum of the single word pairs’ similarity scores. This approach can be further refined using greedy pairing [lintean2012measuring]. Recently, instead of using WordNet relations to obtain word similarities, the similarity between semantically meaningful word embeddings, such as those created by the word2vec language model [Mikolov2013a], was used. Kusner et al. [kusner2015word] defined the word mover’s distance for computing the similarity between two sentences as the minimum distance the individual word embeddings have to move to match those of the other sentence. While similarity measures based on the semantic similarities of individual words are advantageous when comparing short texts, finding an optimal word pairing for longer texts is computationally very expensive and therefore these similarity measures are less practical in our setting, where the full texts of whole documents have to be compared.
To compute the similarity between longer documents, these can be transformed into numerical feature vectors, which serve as input to a similarity function. Rieck and Laskov [Rieck2008] give a comprehensive overview of similarity measures for sequential data, some of which are widely used in information retrieval applications. Achananuparp et al. [Achananuparp2008] test some of these similarity measures for comparing sentences on three corpora, using accuracy, precision, recall, and rejection as metrics to evaluate how many of the retrieved documents are relevant in relation to the number of relevant documents missed. Huang [Huang2008] use several of these similarity measures to perform text clustering on tf-idf vectors. Interested in how well similarity measures reproduce human similarity ratings, Lee et al. [lee2005empirical] create a text similarity corpus based on all possible pairs of 50 different documents rated by 83 students. They test different feature extraction methods in combination with four of the similarity measures described in Rieck and Laskov [Rieck2008] and calculate the correlation of the human ratings with the resulting scoring. They conclude that using the cosine similarity, high precision can be achieved, while recall is still not satisfying.
Full text similarity measures have previously been used to improve search results for MEDLINE articles, where a two step approach using the cosine similarity measure between tf-idf vectors in combination with a sentence alignment algorithm yielded superior results compared to the boolean search strategy used by PubMed [lewis2006text]. The Science Concierge [achakulvisut2016science] computes the similarities between papers’ abstracts to provide content based recommendations, however it still requires an initial keyword search to retrieve articles of interest. The PubVis web application by Horn [Horn2017], developed for visually exploring scientific corpora, also provides recommendations for similar articles given a submitted abstract by measuring overlapping terms in the document feature vectors. While full text similarity search approaches have shown potential in domains such as scientific literature, only few studies have explored this approach for the much harder task of retrieving prior art for a new patent application [moldovan2005latent], where much less overlap between text documents is to be expected due to the usage of very abstract and general terms when describing new inventions. Specifically, document representations created using recently developed neural network language models such as word2vec [Mikolov2013a, Mikolov2013b, horn2017conecRepL4NLP] or doc2vec [Mikolov2014] were not yet evaluated on patent documents.
In order to study our hypothesis that the search for prior art can be improved by automatically determining, for a given patent application, the most similar documents contained in the database based on their full texts, we need to evaluate multiple approaches for comparing the patents’ full texts and computing similarities between the documents. To do this, we test multiple approaches for creating numerical feature representations from the documents’ raw texts, which can then be used as input to a similarity function to compute the documents’ similarity.
All raw documents first have to be preprocessed by lower casing and removing non-alphanumeric characters. The simplest way of transforming texts into numerical vectors is to create high dimensional but sparse bag-of-words (BOW) vectors with tf-idf features [Manning2008]. These BOW representations can also be reduced to their most expressive dimensions using dimensionality reduction methods such as latent semantic analysis (LSA) [Landauer1998, moldovan2005latent] or kernel principal component analysis (KPCA) [Schoelkopf1998, Mueller2001, scholkopf2002learning, Scholkopf2003]. Alternatively, the neural network language models (NNLM) [bengio2003neural] word2vec [Mikolov2013a, Mikolov2013b] (combined with BOW vectors) or doc2vec [Mikolov2014] can be used to transform the documents into feature vectors. All these feature representations are described in detail in the Supporting Information LABEL:S1a.
Using any of these feature representations, the pairwise similarity between two documents’ feature vectors and can be calculated using the cosine similarity:
which is for documents that are (almost) identical, and (in the case of non-negative BOW feature vectors) or below for unrelated documents [Crocetti2015, Huang2008, Yates1999]. Other possible similarity functions for comparing sequential data [Rieck2008, Pele2011] are discussed in the Supporting Information LABEL:S1b.
Our experiments are conducted on two datasets, created using a multi-step process as briefly outlined here and further discussed in the Supporting Information LABEL:S2. For ease of notation, we use the term patent when really referring to either a granted patent or a patent application.
We first obtained a patent corpus containing more than 100,000 patent documents from the Cooperative Patent Classification scheme (CPC) category A61 (medical or veterinary science and hygiene), published between 2000 and 2015. From this corpus we selected altogether 28,381 documents for our first dataset: The roughly 2,500 patents published in 2015 constitute our set of “target patents”. Each target patent cites on average 17.5 ( 28.4) other patents in our corpus (i.e. published after 2000). For each target patent, we selected the set of patents that are cited in its search report. Additionally, we randomly selected another 1,000 patents from the corpus, which were not cited by any of the selected target patents. All target patents are then paired up with their respective cited patents, as well as the 1,000 random patents. Each pair is then either assigned the label cited, if is cited in the search report of (i.e. ), or is labelled as random otherwise. This marks our first dataset consisting of 2,470,736 patent pairs with a ‘cited/random’ labelling. The patent documents in this dataset contain on average 13,530 ( 18,750) words.
The second dataset is created by obtaining additional, more consistent human labels from a patent attorney for a small subset of the first dataset. These labels should show which of the cited patents are truly relevant to the target patent and whether important prior art is missing from the search reports. For ten patents, we selected their respective cited patents as well as several random patents that either obtained a high, medium, or low similarity score as computed with the cosine similarity on tf-idf BOW features. These 450 patent pairs were then manually assigned ‘relevant/irrelevant’ labels and constitute our second dataset.
A pair of patents should have a high similarity score if the two texts address a similar or almost identical subject matter, and a low score if they are unrelated. Furthermore, if two patent documents address a similar subject matter, then one document of said pair should have been cited in the search report of the other. To evaluate the similarity computation with different feature representations, the task of finding similar patents can be modelled as a classification problem, where the samples correspond to pairs of patents. A patent pair is given a positive label, if one of the patents was cited by the other, and a negative label otherwise. We can then compute similarity scores for all pairs of patents and select a threshold for the score where we say all patent pairs with a similarity score higher than this threshold are relevant for each other while similarity scores below the threshold indicate the patents in this pair are unrelated. With a meaningful similarity measure, it should be possible to choose a threshold such that most patent pairs associated with a positive label have a similarity score above the threshold and the pairs with negative labels score below the threshold. For a given threshold, we can compute the true positive rate (TPR), also called recall, and the false positive rate (FPR) of the classifier. By plotting the TPR against the FPR for different decision thresholds, we obtain the graph of the receiver operating characteristic (ROC) curve, where the area under the ROC curve (AUC) conveniently translates the performance of the classifier into a number between (no separation between classes) and (clear distinction between positive and negative samples). Further details on this performance measure can be found in the Supporting Information LABEL:S3.
While the AUC is a very useful measure to select a similarity function based on which relevant and irrelevant patents can be reliably separated, the exact score also depends on characteristics of the dataset and may therefore seem overly optimistic [saito2015precision]. Especially in our first dataset, many of the randomly selected patents contain little overlap with the target patents and can therefore be easily identified as irrelevant. With only a small fraction of the random pairs receiving a medium or high similarity score, this means that for most threshold values the FPR will be very low, resulting in larger AUC values. To give a further perspective on the performance of the compared similarity measures, we therefore additionally report the average precision (AP) score for the final results. For a specific threshold, precision is defined as the number of TP relative to the number of all returned documents, i.e., TP+FP. As we rank the patent pairs based on their similarity score, precision and recall can again be plotted against each other for different thresholds and the area under this curve can be computed as the weighted average of precision () and recall () for all threshold values [zhu2004recall]:
The aim of our study is to identify a robust approach for computing the full text similarity between two patents. To this end, in the following we evaluate different document feature representations and similarity functions by assessing how well the computed similarity scores are aligned with the labels of our two datasets, i.e., whether a high similarity score is assigned to pairs that are labelled as cited (relevant) and low similarity scores to random (irrelevant) pairs. Furthermore, we examine the discrepancies between patents cited in a patent application’s search report and truly relevant prior art. The data and code to replicate the experiments is available online.111https://github.com/helmersl/patent_similarity_search
5.1 Using full text similarity to identify cited patents
The similarities between the patents in each pair contained in the cited/random dataset are computed using the different feature extraction methods together with the cosine similarity and the obtained similarity scores are then evaluated by computing the AUC with respect to the pairs’ labels (Table LABEL:table:comp_sections_d2v). The similarity scores are computed using either the full texts of the patents to create the feature vectors, or only parts of the documents, such as the patents’ abstracts or their claims, to identify which sections are most relevant for this task [dhondt2010sections, bouadjenek2015study]. Additionally, the results on this dataset using BOW feature vectors together with other similarity measures can be found in the Supporting Information LABEL:S4a.
|BOW + word2vec||0.9410||0.8618||0.8525|