A Deep Patent Landscaping Model using Transformer and Graph Convolutional Network
Patent landscaping is a method that is employed for searching related patents during the process of a research and development (R&D) project. To avoid the risk of patent infringement and to follow the current trends of technology development, patent landscaping is a crucial task that needs to be conducted during the early stages of an R&D project. Generally, the process of patent landscaping requires several advanced resources and can be tedious. Furthermore, the patent landscaping process has to be repeated throughout the duration of an R&D project. Owing to such reasons, the demand for automated patent landscaping is gradually increasing. However, the shortage of well-defined benchmarking datasets and comparable models makes it difficult to find related research studies.
In this paper, an automated patent landscaping model based on deep learning is proposed. The proposed model comprises a transformer encoder for analyzing textual data present in patent documents and a graph convolutional network for analyzing patent metadata. Six patent landscaping benchmarking datasets, which were produced by the Korean patent attorney, are proposed for determining the resources required for comparing related research studies. Obtained results indicate that the proposed model with the proposed datasets can attain state-of-the-art performance, and mean classification accuracy of 98% can be achieved.
A patent is a significant deliverable in research and development (R&D) projects. A patent protects an assignee’s legal rights and also represents current technology development trends. To study technological trends and seize potential infringement patents, majority of the R&D projects conduct the task of patent landscaping, which involves collecting and analyzing patent documents related to the projects. Generally, the task of patent landscaping is a human-centric, tedious, and expensive process [\citeauthoryearTrippe2015, \citeauthoryearAbood and Feltenberger2018]. Researchers and patent attorneys create keyword sets for querying related patents in large patent databases, eliminate unrelated patent documents, and extract valid patents related to their project. The valid retrieved patents can be accordingly represented as a contour map by employing various visualization techniques [\citeauthoryearYang et al.2010, \citeauthoryearWittenburg and Pekhteryev2015]. However, since the participants of the process have to be familiar with the scientific domain as well, they are costly . Furthermore, the patent landscaping task has to be repeated regularly during an ongoing project for searching newly published patents.
In this paper, a semi-supervised deep learning model for patent landscaping is proposed. The proposed model eliminates repeated patent landscaping tasks after initially selecting patent keywords and valid patent documents by employing a classification model based on deep learning. Additionally, the proposed model incorporates a transformer encoder [\citeauthoryearVaswani et al.2017] and graph convolutional network (GCN) [\citeauthoryearKipf and Welling2016]. Since a patent document can contain several textual features and bibliometric data, a modified transformer structure was applied for processing textual data and a GCN was applied for processing bibliometric data fields constituted as network-based. To compare the performance of the proposed model with existing models, benchmarking datasets for patent landscaping were introduced. Generally, owing to issues such as high cost and data security, appropriate benchmarking datasets are not available. Such benchmarking datasets are generally created by human experts such as patent attorneys. Through the six benchmarking datasets generated through the Korea Intellectual property STrategy Agency (KISTA), we aim to contribute research resources towards machine learning based patent landscaping research. Experimental results indicate that the proposed model with the proposed datasets can attain state-of-the-art performance, and the average classification accuracy for each dataset can be improved by approximately over 15%.
2 Patent Landscaping
The entire process of patent landscaping is shown in Figure 1. Generally, when searching for patent documents, a keyword query is generated using various technological keywords. Because many assignees do not allow their patents to be discovered easily in the search to gain an advantage in the infringement issues that may arise, they tend to write patent titles and abstracts very generically or omit technical details [\citeauthoryearTseng et al.2007]. To deal with this problem, researchers work on completing search queries with complicated words or key phrases to reinforce the search [\citeauthoryearMagdy and Jones2011]. For example, the following search query is created, as shown in the box below.
The primary focus here is the regular repetition process going back to search query formulation from valid patent selection, as shown in Figure 1. Once the search formula is created, it is necessary to track the new patents published regularly using a similar search formula. Because the first valid patent selection is similar to creating a training dataset of supervised learning, it is one of the tasks that can solve repetitive tasks with text classification.
For this reason, patent classification is one of the areas that have been studied steadily in the past decade [\citeauthoryearSureka et al.2009, \citeauthoryearChen and Chiu2011, \citeauthoryearLupu et al.2013]. Relatively recent studies have challenged the International Patent Code (IPC) classification by using long short-term memory (LSTM) [\citeauthoryearShalaby et al.2018] and a text convolutional neural network (text-CNN) [\citeauthoryearLi et al.2018]. Google has developed automated patent landscaping technology using a deep neural network [\citeauthoryearAbood and Feltenberger2018].
However, the biggest weakness of previous studies is the lack of a suitable benchmarking dataset. Unlike the purpose of the actual patent landscaping, an IPC classification studies predicts a fittable IPC code for each patent, which is already granted to all patents by assignees and patent examiners. This is not a useful application for patent landscaping in the real world. In the case of Google’s work, heuristic algorithms are used to generate the training dataset by using technology keywords, such as ”Internet of Things”. Because the keywords and datasets are too general, there is uncertainty in applying the model to real-world R&D project handling specifics and detailed technology for patent landscaping. To the best of the author’s knowledge, a fittable benchmarking dataset, generated by human experts for patent landscaping in the detailed technology field is not available yet.
3 Proposed Datasets for Patent Landscaping
3.1 Data sources
We provide a benchmarking dataset for patent landscaping based on the KISTA patent trends reports. Every year, the Korean Patent Office publishes more than 100 patent landscaping reports through KISTA
|ENDV||Ensuring Technology of Nighttime Driver Visibility|
|NGUG||Next Generation Technology of Underwater Glider|
|MPUART||Marine Plant Technology|
|Using Augmented Reality Technology|
|1MWDFS||Technology for 1MW Dual Frequency System|
|MRRG||Technology for Micro Radar Rain Gauge|
|GOCS||Technology for Geostationary Orbit Complex Satellite|
3.2 Data acquisition
To ensure the reproducibility of building patent datasets, we have built the benchmarking datasets using the Google BigQuery public datasets. Most of the patent data in the KISTA report are obtained by the required use of a search query of the local Korean patent database service WIPS. We first constructed a Python module that converts the WIPS query into a Google BigQuery service query, extracted the patent dataset from BigQuery, and marked the valid patents among the extracted patents. In the case of patent search, other datasets may be extracted depending on the type of publication date and database to be searched. Therefore, we excluded the queried patents published after the original publication date depicted in the report.
3.3 Dataset description
We built extracted patent datasets of six technological areas from Google BigQuery, and the number of patents in each dataset are given in Table 2.
|Dataset name||# of patents||# of valid patents|
4 Deep Patent Landscaping Model
4.1 Model overview
Our proposed deep patent landscaping model is composed of two parts: a transformer [\citeauthoryearVaswani et al.2017] and a graph convolutional network(GCN) [\citeauthoryearKipf and Welling2016]. These parts generate embedding vectors from the features of a patent. The model contains a concatenation layer of the embedding vectors and stacked neural net layers to classify valid patents, as shown in Figure 2. A Patent, which is a scientific document, has metadata fields called bibliometrics. They include International Patent Code (IPC), Cooperative Patent Classification (CPC), and citation with textual data such as abstract and document title. They possess characteristics that are represented by graph-based information. For example, citation information can be converted into a citation network that represents how one patent cites another patent [\citeauthoryearYoon and Park2004]. IPC and CPC, which are technical classification codes for patents, are used as sources of graph information for the co-occurrence matrix; this matrix represents whether patents have the same IPC code. To determine the patent features, we apply the transformer encoder to extract the embedding vectors for textual data, and the GCN structure to extract the embedding vectors for metadata. Finally, we use binary cross-entropy as a loss function for a binary classification model.
4.2 Base features
A patent has different types features
Among the metadata fields, we used IPC, CPC and USPC. They are types of technological classification codes devised by the patent management organizations. Because these codes are determined by the inventors and examiners directly, they are good means to represent the characteristics of a patent. We found that the accuracy is lower when we use citations with classification codes despite examining it. We use FamilyID as the feature matrix of GCN. It is a useful identification code that represents the collection of technological patents filed by the same applicant.
4.3 Computational structures of Transformer and GCN
The core of our model is the output of embedding through the transformer and GCN layers. First, we used only the encoder part of the transformer to learn the abstract data of the patent, as shown in Figure 2(a). Similar to a transformer structure, we stacked the encoder layers six times. We also used multi-head self-attention and scaled dot-product attention without the modification of a transformer encoder. We set the number of heads of multi-head self-attention at 8. We used the pretrained embedding vector from word2vec for the entire patent datasets as input. When we tokenized the abstract text, the tag [SEP] was inserted at the beginning, and the tag [CLS] was inserted at the end of the sentence. We set the sequence length of 128, and the hidden size was 512. To concatenate the output of the transformer with the GCN embedding vectors, we adopted the squeeze technique from BERT [\citeauthoryearDevlin et al.2018] and converted the matrix (sequence length, embedding size) to vector (embedding size) based on [CLS].
The GCN layer first constructs a preprocessed adjacency matrix for each metadata CPC, IPC, and USPC using the co-occurrence information in a patent. It then constructs a one-hot vector for the occurrence of FamilyIDs for each metadata field and stacks the one-hot vectors times to create a feature matrix, . As shown in Figure 4, denotes the number of elements for each metadata field, and denotes the number of FamilyIDs. Next, a layer-wise propagation rule is used for the calculations of dot-product, as shown in Figure 4.
In Equation 1, is a preprocessed adjacency matrix. denotes the sum of adjacency matrix and an identity matrix called self-connection matrix. An adjacency matrix is a binary matrix representing the co-occurrence relationship which exists between the elements of metadata and the classification codes of the same patent. If two CPC codes exist in the same patent, the value of the column representing relationship between two codes in binary matrix is 1. Degree matrix, , is a diagonal matrix that adds the values in adjacency matrix row by row, and its size is the same as that of the adjacency matrix. We used ReLU as an activation function , and the final output size of the GCN structure was 128.
Using the GCN structure, we extracted useful features of the patent by learning the relationship between classification codes that occurred together. As the values of GCN embedding vectors were close to zero, we applied batch normalization [\citeauthoryearIoffe and Szegedy2015] to the embedding vectors before concatenating them with other embedding vectors from the transformer structure.
We measured the performance of the proposed model for the classification of valid patents in the six KISTA datasets. We extracted 8,000-40,000 patent data for each dataset from Google BigQuery and created training and test sets with 8:2 in the entire patent dataset after downsampling the number total datasets. The dataset we proposed can be downloaded via the URLs provided in the following Table 3.
5.2 Hyperparameter settings
We extracted embedding vectors from the transformer and GCN built on a separate learning process. In the transformer, six encoder layers were stacked, and the number of heads of multi-head attention was 8. The total learning epoch, the batch size, the optimizer, the learning rate and the epsilon were 20, 64, Adam Optimizer [\citeauthoryearKingma and Ba2014], 0.0001, and 1e-8 respectively. We set the sequence length, which is the maximum length of the input sentence, to 128, and padded it to 0 if it was shorter than 128. As a result, 512-dimensional embedding vectors were extracted for each word.
In GCN, two GCN layers were used, and the total learning epoch was 200. 128-dimensional embedding vectors were extracted for each metadata from the GCN, and an average of the extracted embedding vectors was taken. We used the early stopping technique to decrease the total time of the learning process.
5.3 Evaluation metric
We used the accuracy and f1 score, which are commonly used in binary classification problems, as an evaluation metric. Because the datasets are highly imbalanced data, we learned the model after reducing the number of patents, which is similar to the number of valid patents.
5.4 Overall results
We compared our proposed model (TRF+GCN in Table 4) with Google’s proposed Automated Patent Landscaping (APL [\citeauthoryearAbood and Feltenberger2018] in Table 4) model, which was a baseline. The results showed that the proposed model had the highest performance, over 98% accuracy, in all datasets, as shown in Table 4. In particular, the results showed that a performance improvement of the proposed model is more than 20% compared to using only the GCN structure with metadata. In addition, our model attained the average of a 10% improvement or more over Google’s APL model, which was a baseline model.
We will investigate the following subjects empirically, with regard to the layers and features that affect patent landscaping model.
Which indicators of textual data and metadata have more influence on the model?
Which features, including citation information, has the most impact on the classification of patent landscaping?
When a misclassification occurs, what are the characteristics of the dataset that is misclassified?
5.5 Textual data vs. metadata represented in graph structure
In the first experiment, we found that a significant level of performance can be obtained only by textual data and abstract, that is compared with the data that can be represented as a graph and bibliometric data. This is because most of the datasets that we use are convergence technology, which cannot be distinguished only by the existing patent technology classifications. In other words, with the patent landscaping process, it is difficult to classify the technology type of the patents only by the patent technology classification code defined by the existing patent-related organizations such as United States patent and trademark office (USPTO) and World Intellectual Property Organization (WIPO). For example, an electric vehicle battery technology may not only include not only a device but also various technology groups such as an electric car battery material and a coolant. Therefore, by considering the contents of the patent along with the patent technology classification code, we can classify the patent according to the exact purpose. As shown in Table 4, we can found that there is a certain performance correlation between the transformer-only model and APL model. However, the GCN-only model has a relatively weak correlation. This also shows the difficulty in classifying a patent in R&D projects only with classification codes of the existing patents exclusively.
5.6 Classification codes vs. citation information
Our second question is, how effective is the presently and widely used citation information in patent landscaping? In conclusion, though citation information is as effective as the classification codes, it is more convenient to use only classification codes. As shown in table 6, 3CODE means that the model uses the transformer with the GCN including IPC, CPC, and United States Patent Classification system (USPC). REF means that the model uses the transformer with the GCN including citation information only. 4CODE means that the model uses the transformer with the GCN including all the metadata mentioned above. The performances of the 3CODE and REF are not significantly different, but in most cases, the performance of 3CODE is higher than that of the REF-only. In addition, when using 4CODE in a special case, the performance is lower than the case of using only the REF or 3CODE.
We assume the reason is that the citation information can have several kinds of noise. In a patent citation, inventors cited not only the prior patent in the same application but also the patent related to the principle technology Sometimes, the principle technology patent is rare and not related to all the same technology application patents. In addition, technology classification codes have the advantage that the learning space is relatively small compared to the citation because the range that can be made in a similar patent field is narrowed. For this reason, technology classification codes are more fitting features than citation information in patent landscaping.
5.7 Which factors result in misclassification?
Our model achieved a performance improvement of 10% compared to the existing model and over 98% accuracy in the patent landscaping applicable to the proposed dataset. However, there are misclassifications. The following characteristics are consider as the reasons for the misclassifications.
First, when the length of the abstract sequence is too long or short, the model does not perform well. We set the length of the abstract as 128 words, and force padding or truncating if the abstract is too long or short. When there is significant 0 padding or truncating, it seems misclassification may occur.
Second, in the case of misclassification category, the total average length of the abstract is about 110 and it is longer than that of well-classified data. Finally, since the numbers and special characters are removed during preprocessing, when there are many numbers or special characters in the abstract, the data are unable to express the information correctly, resulting in a drop in the classification. In the case of metadata, when only the abstract and classification codes are used, if the source of features is not sufficient, misclassification may occur.
6 Conclusion and Future Work
In this paper, we proposed a deep patent landscaping model that solves the classification problem in patent landscaping using transformer and GCN structures. This research suggested a new benchmarking dataset for the automated patent landscaping task and worked to make it a practical study for automated patent landscaping. In particular, it is anticipated that it will be possible to reduce the repetitive patent analysis tasks of practitioners performing patent analysis tasks.
The main contributions of our research are as follows.
We released a new patent landscaping dataset that has been categorized by experts.
We propose a state of the art model which is at a level applicable to the real world.
We examined features of a patent that can be used for patent landscaping.
We discovered the following points about the use of patent features for automated patent landscaping. We need to consider both graphical information such as bibliometric as well as textual data to achieve the highest classification performance. The classification codes specified by inventors in static metadata are available as a useful feature in patent landscaping. The combination of citation information and classification codes are not recommended.
We believe that we have developed a model for patent landscaping that has the possibility to be commercialized at a laboratory scale. In the future, we will build a system that can support building benchmark datasets of over 2000 KISTA datasets to verify the scalability of the model. We will also build models that can classify not only US patents but also European, Japanese, Chinese and Korean patents. Particulary, Japanese patents are classified with additional code schemes, namely F-Term, which do not necessarily share similar characteristics with IPC and need further consideration. In the long term, our goal is to construct a fully automated classification system for patent landscaping that is customized for the company or government organization.
- Co-corresponding Authors
- Most of the search queries were based on the WIPS (https://www.wipson.com) service, which is a local Korean patent database company.
- Aaron Abood and Dave Feltenberger. Automated patent landscaping. Artificial Intelligence and Law, 26(2):103–125, June 2018.
- Yen-Liang Chen and Yu-Ting Chiu. An ipc-based vector space model for patent retrieval. Information Processing & Management, 47(3):309–322, 2011.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, October 2018.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
- D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, December 2014.
- Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. Deeppatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2):721–744, 2018.
- Mihai Lupu, Allan Hanbury, et al. Patent retrieval. Foundations and Trends® in Information Retrieval, 7(1):1–97, 2013.
- Walid Magdy and Gareth JF Jones. A study on query expansion methods for patent retrieval. In Proceedings of the 4th workshop on Patent information retrieval, pages 19–24. ACM, 2011.
- Marawan Shalaby, Jan Stutzki, Matthias Schubert, and Stephan Günnemann. An lstm approach to patent classification based on fixed hierarchy vectors. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 495–503. SIAM, 2018.
- Ashish Sureka, Pranav Prabhakar Mirajkar, Prasanna Nagesh Teli, Girish Agarwal, and Sumit Kumar Bose. Semantic based text classification of patent documents to a user-defined taxonomy. In International Conference on Advanced Data Mining and Applications, pages 644–651. Springer, 2009.
- Anthony Trippe. Guidelines for preparing patent landscape reports. Patent landscape reports. Geneva: WIPO, page 2015, 2015.
- Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin. Text mining techniques for patent analysis. Information Processing & Management, 43(5):1216–1247, 2007.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.
- K Wittenburg and G Pekhteryev. Multi-dimensional comparative visualization for patent landscaping. merl.com, 2015.
- Y Y Yang, L Akers, C B Yang, T Klose, and S Pavlek. Enhancing patent landscape analysis with visualization output. 2010.
- Byungun Yoon and Yongtae Park. A text-mining-based patent network: Analytical tool for high-technology trend. The Journal of High Technology Management Research, 15(1):37–50, 2004.