Cross-domain Aspect Category Transfer and Detection via Traceable Heterogeneous Graph Representation Learning
Aspect category detection is an essential task for sentiment analysis and opinion mining. However, the cost of categorical data labeling, e.g., label the review aspect information for a large number of product domains, can be inevitable but unaffordable. In this study, we propose a novel problem, cross-domain aspect category transfer and detection, which faces three challenges: various feature spaces, different data distributions, and diverse output spaces. To address these problems, we propose an innovative solution, Traceable Heterogeneous Graph Representation Learning (THGRL). Unlike prior text-based aspect detection works, THGRL explores latent domain aspect category connections via massive user behavior information on a heterogeneous graph. Moreover, an innovative latent variable “Walker Tracer” is introduced to characterize the global semantic/aspect dependencies and capture the informative vertexes on the random walk paths. By using THGRL, we project different domains’ feature spaces into a common one, while allowing data distributions and output spaces stay differently. Experiment results show that the proposed method outperforms a series of state-of-the-art baseline models.
As an essential task of aspect-based sentiment analysis (Pontiki et al., 2016), aspect category detection provides important means to identify the refereed aspects from a free-text review (Zhou et al., 2015). For instance, in a review of “The length of dress is perfect for me, it goes right above knee. At this price it is a bargain…”, the aspect categories “price” and “wearing effect” can be detected for further sentiment analysis and opinion mining.
To address the aspect detection problem, a number of machine learning approaches (Ganu et al., 2009; Kiritchenko et al., 2014; Zhou et al., 2015; Ruder et al., 2016), i.e., learning a specific aspect classifier based on review representation for each target domain, have been proposed. Although the mathematical mechanism behind the models can be different, they all share the same prerequisite: a decent labeled training dataset is available for the target product domain. Unfortunately, such cost can be inevitably high for e-commerce platforms because different product domains may have different aspect categories and users may use distinctly different words to express the same aspect category across different domains. Take the world leading e-commerce services, eBay and Taobao, as an example. They both feature more than 20,000 product domains, and the cost of data annotation can be very high.
Intuitively, facing a (target) domain with limited labeled training data (cold start problem), one could improve the model’s performance by enhancing the training process with supplementary labeled data from a related (source) domain. However, in the e-commerce scenario, the following reasons make the cross-domain aspect knowledge transfer a challenging task: (1) Various feature (vocabulary) spaces. While shopping in different domains, customers may use different vocabularies to review the target products or services. (2) Different data distributions. Even for the same aspect, term usage can vary in different domains. For instance, for aspect “size,” customers will use phrases such as “extra large (XL)” in clothing domain, while “270mm” in shoes domain. (3) Diverse output (aspect) spaces. Each domain can have its own aspect categories, e.g., aspect “pilling” for clothing, and aspect “functionality” for shoes.
Because of these reasons, most existing transfer learning methods can hardly be applied to address cross-domain aspect transfers. For instance, existing Domain Adaption (DA) models(Glorot et al., 2011; Pan et al., 2010; Zellinger et al., 2017)—which train and test models on different distributions cross domains—can hardly be adopted for cross-domain aspect category detection task because the DA approach assumes that source and target domains share the same feature and output spaces (Daume III and Marcu, 2006).
Although the textual feature itself, along with the existing methods, cannot fully solve this problem, the abundant user behavior information (e.g., browse and purchase information) can offer important potential for cross-domain aspect category detection. Unlike review text, user behavior information can be domain-independent. For instance, a fashionable customer may pay more attention to appearance, style, and similar aspect information, regardless of domains of the purchased products; while a pragmatist may focus on the quality- and price-related aspects of the target products. This observation motivates us to propose a novel transfer model by leveraging user behavior information, which can be used as intermediate paths for domain integration.
In this study, we proposed a novel solution, Traceable Heterogeneous Graph Representation Learning (THGRL), for cross-domain aspect category knowledge transfer and the improvement of review representation learning. By using e-commerce user behavior information, one can connect different domains’ textual review information and construct a novel cross-domain heterogeneous graph. Then, supplementary aspect information (knowledge) can random walk from one domain to another. Meanwhile, to efficiently characterize knowledge transfer on the graph, we introduce an innovative latent variable (we call it “Walker Tracer”) to capture global semantic dependencies in the graph. Due to the randomness of the random walk-based generation mechanism, there could be the existence of a lot of noise in graph generated paths (vertex sequences), making the fixed context window based approach problematic. For instance, although an unlabeled review can be walked to a related aspect, the distance (in the walking path) between these two vertexes could be very long. The proposed “Walker Tracer” aims to summarize, organize, and navigate the graph generated path collections. The basic goal of is to find the global semantic coherency pattern across different types of vertexes and relations while eliminating the noisy information. From an algorithm viewpoint, the proposed method can project vertexes (on the heterogeneous graph) and tracers (latent variables on global level) into a low-dimensional joint embedding space. Unlike prior graph embedding approaches (Perozzi et al., 2014; Tang et al., 2015; Kipf and Welling, 2017; Jiang et al., 2018a), which focus more on local graph structure representation, the proposed algorithm can capture global random walk patterns across heterogeneous vertexes (customer, review, aspect, and item, etc.). Meanwhile, it is fully automatic without handcrafting feature usage like (Dong et al., 2017).
By using THGRL, the sparse (labeled) aspect information can be deliberately diffused to the most related vertexes based on the local/global semantic plus topology information for cross-domain aspect category transfer and detection.
The contribution of this paper is four-fold.
First, we propose a novel cross-domain aspect transfer problem for aspect category detection in which the feature spaces, data distributions, and output spaces (of the source and target domains) can be all different.
Second, we investigate user behavior information and review textual information to construct a heterogeneous graph, which can enhance aspect category detection performance and address the cold-start problem.
Third, an innovative model, Traceable Heterogeneous Graph Representation Learning (THGRL) is proposed to address the cross-domain aspect transfer problem. THGRL can not only auto-embed the graph heterogeneity (i.e., various types of information) but also characterizes the global graphical pattern plus topological information for representation learning. By using THGRL, we project different domains’ feature spaces into a common one while allowing data distributions and output spaces stay differently.
Last but not least, we validate the proposed method by experiments on multiple real-world e-commerce datasets. In order to help other scholars reproduce the experiment outcome, we release the datasets via GitHub111https://github.com/lzswangjian/THGRL. To the best of our knowledge, these are the first aspect category detection datasets associated with user behavior information.
2. Problem Formulation
Definition 1 ().
Aspect Category Detection
For a given product domain with its own feature space , an aspect category detection task can be defined as a multi-label classification problem, which is defined by two parts, an output space and a classification function . A review instance can be presented as a feature vector where denotes feature and is the feature dimension. is the set of predefined aspect categories (labels); represents a binary vector (assigning a value of 0 or 1 for each element); denotes aspect category; and is the number of aspect categories. is learned from a training set ; each element of is a pair of feature vector and label vector; and is the training sample number.
Definition 2 ().
Cross-Domain Aspect Category Transfer and Detection
is the target product domain with its feature space . In , there is a labeled training set and an unlabeled data set , (labeled sample size can be very small). The cross-domain aspect category transfer and detection aim to improve the performance of target classification function by introducing an auxiliary source product domain .
In this study, we assume:
. The feature spaces of target and source domains can be different.
. The output spaces (i.e., aspect categories) of target and source domains can be different.
. The data distributions of target and source domains can be different.
In this study, we aim to investigate a novel method to project the different and into a common joint graph/text embedding space while reducing the negative effect of different and by cross-domain knowledge transferring. The proposed method focuses on enhancing heterogeneous graph representation. A more detailed method will be introduced in Section 3.
Definition 3 ().
Following the works (Sun et al., 2011; Dong et al., 2017), heterogeneous graph, namely heterogeneous information network, is defined as a graph where denotes the vertex set and denotes the edge (relation) set. is the vertex type mapping function, and denotes the set of vertex types. is relation type mapping function, and denotes the set of relation types. .
3.1. Cross-domain E-commerce Heterogeneous Graph
In this study, by leveraging e-commerce behavior information, we enrich the review textual content representation via various types of relations on the heterogeneous graph, which enables cross-domain sentiment aspect transfer and graphical aspect category augmentation. For instance, following a sequence of a customer’s purchase and review writing behaviors, the knowledge of a source domain review’s aspect categories can be diffused to another unlabeled review in the target domain.
As Table 1 shows, six types of objects (vertexes) and eight types of relations (edges) are encapsulated in the proposed heterogeneous graph. Note that the vertexes of two different product domains could be shared. For instance, a “seller” may sell shoes and clothes simultaneously, and a “customer” can also purchase products from both domains. Furthermore, the reviews of two domains may contain the same “words.” Meanwhile, they may also share the same aspect categories, such as “logistics,” “seller’s services,” etc. The graph enables cross-domain aspect knowledge transfer.
|Customer||Seller (Online Store)|
|Product receives a review|
|Seller gets a review|
|Customer writes a review|
|Customer purchases a product|
|Review mentions an aspect category|
|Word is related to an aspect category (The word is contained in a review which mentions an aspect category)|
|Review contains a word|
3.2. Traceable Heterogeneous Graph Representation Learning
To better learn the representation of the constructed e-commerce heterogeneous graph, we propose a Traceable Heterogeneous Graph Representation Learning (THGRL) model. The proposed model consists of three main components: a hierarchical random walk generator, a walker tracer capture mechanism, and a vertex-tracer representation co-learning procedure. The pseudocode for overall algorithm is given in Algorithm 1.
Hierarchical Random Walk. Unlike the nature of text, a heterogeneous graph () has more structural and topological characteristics. The vertex’s graph neighborhood, , can be defined in various of ways, e.g., direct (one-hop) neighbors of . It is critical to model the vertex neighborhood for graph representation learning. Unlike prior works for homogeneous graph mining, we employ a hierarchical random walk strategy for every vertex to generate the walking path and vertex embedding. As Algorithm 1 shows, the key step of hierarchical random walk is to sample a relation type from relation type set . Then, we use the transition distribution of to generate the next move on the graph. A similar hierarchical random walk strategy has proved an effective means to address random walk on the heterogeneous graph (Jiang et al., 2018b).
Walker Tracer Capturing Distribution Learning. Although the graph heterogeneity has the potential to enhance the random walk performance, the noisy text/behavior information (e.g., noisy vertexes and edges) may pollute the algorithm outcomes. In this work, we propose a novel set of global latent variables, “Walker Tracer,” to address this problem. Such variables can capture global random walk dependencies and eliminate the noise.
As Figure 1 (a) shows, in the generated vertex sequences, each vertex can be captured by the corresponding walker tracer(s). A similar method has been utilized to capture latent topics in a given text (Blei et al., 2003; Blei, 2012). We assume the related vertexes tend to appear in the same walking path, and the different appearance patterns of vertexes can be represented as different probability distributions. Such a vertex distribution is a (global) walker tracer. Given a path, the dominant tracer(s) have the higher chance to capture the informative vertex(es) for embedding. Vertex walking paths are then represented as mixtures over these latent tracers. In this study, the following generative process for walking paths of length with tracers is defined as:
For each walking path:
for each vertex in the walking path:
Probabilistically draw a specific walker tracer
Probabilistically draw a vertex from
Here, is the parameter of the Dirichlet prior on the per-path tracer distributions, and () is the vertex capturing distribution for tracer. () is the tracer mixture distribution for a walking path.
As mentioned in the introduction, the classical random walk process may bring unexpected noisy information because of the graph heterogeneity. For instance, in a random walk generated path, the aspect-related vertexes could be somehow apart while the context of a vertex can be irrelevant due to the randomness of heterogeneous path generation. This phenomenon could threaten the cross-domain aspect category detection, and existing graph embedding methods (Perozzi et al., 2014; Grover and Leskovec, 2016; Dong et al., 2017) can hardly cope with this problem because they can only characterize local random walk information (in a fixed window). In contrast, the proposed global walker tracer is able to estimate the structure of a walk path and capture the long-range informative vertexes.
The probability of vertex-tracer capturing can be calculated as:
The distribution (i.e., and ) learning is a problem of Bayesian inference. Variational inference (Blei et al., 2003) or Gibbs sampling (Porteous et al., 2008) can be used to address this problem. With the favor of learned distributions, each vertex is captured by a latent tracer , and the capture probability can be calculated as:
Vertex-Tracer Representation Co-Learning. THGRL obtains the representations of vertexes (local information) and latent walker tracer (global information) by mapping them into a low-dimensional space , . The learned representations are able to preserve the topology and semantic information in . Motivated by (Liu et al., 2015), we propose to learn representations for vertexes and tracers separately and simultaneously. For each target vertex with its corresponding tracer , the objective of THGRL is defined to maximize the following log probability:
We use as the mapping function from multi-typed vertexes and walker tracers to feature representations. Here, is a parameter specifying the number of dimensions. denotes ’s network neighborhood (context) with the type of vertexes. As Figure 1 (b) shows, the feature learning method is an upgraded version of the skip-gram architecture, which was originally developed for natural language processing and word embedding (Mikolov et al., 2013a, b; Bengio et al., 2013). Compared with merely using the target vertex to predict context vertexes in the original skip-gram model, the proposed approach also employs the corresponding tracer of the target vertex. In other words, a tracer is used as a ‘pseudo vertex’ for collective global information representation. So, in the THGRL framework, the vertex’s context will encapsulate both local (vertex) and global (tracer) information, which can be critical for cross-domain knowledge transfer and graphical aspect augmentation.
defines the conditional probability of having a context vertex given the vertex ’s representation, which is commonly modeled as a softmax function:
Similarly, given the representation vertex ’s corresponding tracer , the conditional probability of having a context vertex is modeled as:
Stochastic gradient ascent is used for optimizing the model parameters of . Negative sampling (Mikolov et al., 2013b) is applied for optimization efficiency. The model parameters size is where is embedding dimension. The computational complexity is where indicates the computational complexity of walker tracer capturing distribution learning, is the iteration numbers, is walking path numbers; indicates the computational complexity of vertex-tracer representation co-learning, denotes context window size.
Vertex-Tracer Representation Integration. As aforementioned, walker tracer is defined as a vertex distribution, which indicates the probability of a vertex captured by this tracer. We select the most probable tracer of vertex to generate the integrated vertex-tracer representation:
where is the combined representation of vertex under tracer , obtained by concatenating the embedding of and , i.e., , where is the concatenation operation.
Finally, for the aspect category detection task, given a review instance , the vertex-tracer representations of all review words are further down-sampled with a global average-pooling operation. The learned review representation can be used as input for classification function for training and testing. Meanwhile, as stated above, the proposed method focuses on cross-domain aspect category transfer via enhanced heterogeneous graph representation. Without loss of generality, , can be a multi-label classification model with an arbitrary structure, i.e., kernel-based model like SVM (Steinwart and Christmann, 2008) (for limited training samples) or neural network based model like BiLSTM-Attention (Yang et al., 2016) (for sufficient training samples).
With THGRL, objects from different domains can be mapped into a same feature space (addressing the “various feature spaces” problem), and the global semantic/aspect dependencies can be characterized for aspect category knowledge transfer (reducing the negative effect of “different data distributions” and “diverse output spaces” problems).
4.1. Dataset and Experiment Setting
Dataset222To the best of our knowledge, behavior information, e.g., customer purchases and seller sells information, is not publicly available. Meanwhile, all the public datasets do not associate with these kinds of information. In order to address this problem, we collected new datasets (https://github.com/lzswangjian/THGRL) from Taobao.. In Table 2, we summarize statistics of three e-commerce datasets (product domains) from Taobao (an online consumer-to-consumer platform in Alibaba). For a specific domain, we collected the customers’ purchase behaviors. For each purchase behavior record, the product ID, the seller ID (online store), and the customer ID were collected. Meanwhile, the reviews of the purchased product and the words contained in the target reviews were also collected. The related aspect categories of reviews were manually labeled by a third-party. For this study, we validated the proposed algorithm in four cross-domain aspect category detection tasks: (clothingshoes, shoesclothing, clothingbags, and bagsclothing).
|Cross-Domain Aspect Category Detection Tasks (Graph)|
* There are no shared reviews and products.
Baselines and Comparison Groups. We chose two groups of baseline algorithms, from text and graph viewpoints, to comprehensively evaluate the performance of the proposed method.
Textual Content Based Baseline Group333In the textual content based group, the SVM model was using TFIDF feature vector, and neural network based models were using 300 dimensional dense feature vectors pre-trained by FastText (Grave et al., 2017) in an enormous corpus provided by Taobao.: this group of baselines only utilized textual information for aspect category detection tasks.
1. Support Vector Machine (Steinwart and Christmann, 2008): We followed (Ganu et al., 2009) to train an “one vs. all” classifier on target domain reviews. The similar approach had achieved the top ranking in the aspect category detection subtask of SemEval-2014 (Kiritchenko et al., 2014). This baseline was denoted as SVM.
2. Convolutional Neural Network (Kim, 2014): We trained the convolutional neural networks (CNN) on top of the pre-trained word vectors for aspect category detection task. (Ruder et al., 2016) utilized this approach for aspect-level sentiment analysis. This baseline was denoted as CNN.
3. Bidirectional Recurrent Neural Network (BiLSTM) with Attention Mechanism (Liu and Zhang, 2017): We used a bidirectional LSTM to represent the word sequence in a review then calculated the weighted values over each word in reviews by an attention model. This baseline was denoted as BiLSTM-Attention.
4. Transformer (Vaswani et al., 2017): We used an encoder structure of Transformer model for learning the review textual representation. As a state-of-the-art model to encode deep semantic information using self-attention mechanism, a similar process has been utilized in several research works for different downstream tasks (Young et al., 2018). This baseline was denoted as Transformer.
5. Semi-Supervised Learning: First, an SVM model on limited “real” training data was initially trained. Second, 1,000 unlabeled reviews were selected for pseudo labeling by the previous trained model. Third, the pseudo-labeled data was added into training set for model re-training. We repeated this process until the best performing model was finally found. This baseline only used target domain reviews for training, denoted as Semi-Supervised.
6. Domain adaptation (Zellinger et al., 2017): This recent model is the state-of-the-art domain adaptation method for sentiment classification which learns the domain-invariant representations with neural networks. This baseline requires the training review data from both source and target domains, denoted as Domain Adaptation. We tuned the weighting parameter , and picked up a best performed parameter setting for experiment.
Graph Embedding Based Baseline Group444All graph based baseline algorithms employed SVM as a classification model for aspect category detection tasks. The input representation was obtained by concatenating the TFIDF feature vector and graph embedding feature vector.: This group of baselines generated the graphical embeddings based on the constructed heterogeneous graphs which integrated multiple types of information (including textual information and user behavior information).
7. DeepWalk (Perozzi et al., 2014): We used a DeepWalk algorithm to learn the graph embeddings, denoted as DeepWalk.
8. LINE (Tang et al., 2015): This model was aimed at preserving first-order and second-order proximity in concatenated embeddings, denoted as LINE.
9. Node2vec (Grover and Leskovec, 2016): We used node2vec algorithm to learn graph embeddings via second order random walks in the graph, denoted as Node2vec. We tuned return parameter and in-out parameter with a grid search over and picked up a best performing parameter setting for the experiment, as suggested by (Grover and Leskovec, 2016).
10. Metapath2vec++ (Dong et al., 2017): This model was originally designed for heterogeneous graphs. It learns heterogeneous graph embeddings via metapath based random walk and heterogeneous negative sampling in the graph. Metapath2vec++ requires a human-defined metapath scheme to guide random walks. We tried two different metapaths for this experiment: (1) (this metapath was associated with textual information solely, denoted as Metapath2Vec++(T)), (2) (this metapath was not only related to the textual information, but also involving behavior information, denoted as Metapath2Vec++(B)).
11. Graph Convolutional Networks (Kipf and Welling, 2017): We trained convolutional neural networks on an adjacency matrix of graph and a topological feature matrix of vertexes in a vertex classification task (semi-supervised task), as suggested by (Kipf and Welling, 2017). This more recent baseline was denoted as GCN.
Please note that, because DeepWalk, LINE, Node2vec, and GCN were originally designed for homogeneous graph, for a fair comparison, we have to construct a homogeneous graph based on the heterogeneous one. We first integrated all relations between two vertexes into one edge, then estimated the edge weight (transition probability) by summing of all integrated relations.
Comparison Groups: We compared the performances of several variants of the proposed method in order to highlight our technical contributions.
THGRL: The default setting for the proposed THGRL model, which used hierarchical random walk generator for walking path generation, integrated vertex-tracer representation for graph embedding, and SVM model for aspect category detection task.
THGRL: We removed the walker tracer representation information from vertex-tracer representation (only vertex representation left).
THGRL: We replaced the hierarchical random walk generator by an ordinary random walk generator.
THGRL: We replaced the SVM classification model by a BiLSTM-Attention based classification model.
|(Train:70; Test:8397)||(Train:100; Test:6972)||(Train:90; Test:5523)||(Train:100; Test:6972)|
|Text||Textual Content Based Baseline Group||Micro-F1||Macro-F1||Micro-F1||Macro-F1||Micro-F1||Macro-F1||Micro-F1||Macro-F1|
|SVM (Ganu et al., 2009)||0.4255||0.2938||0.4736||0.3636||0.2045||0.1982||0.5210||0.3700|
|CNN (Kim, 2014)||0.3056||0.1457||0.2588||0.1529||0.2463||0.1529||0.2830||0.1632|
|BiLSTM-Attention (Liu and Zhang, 2017)||0.4239||0.2520||0.4357||0.2601||0.3349||0.2652||0.4094||0.2464|
|Transformer (Vaswani et al., 2017)||0.3997||0.2106||0.3447||0.2252||0.3183||0.2699||0.3456||0.2113|
|Domain Adaptation (Zellinger et al., 2017)||0.4700||0.3298||0.5525||0.4429||0.3412||0.2931||0.5043||0.4252|
|Graph||Graph Embedding Based Baseline Group||Micro-F1||Macro-F1||Micro-F1||Macro-F1||Micro-F1||Macro-F1||Micro-F1||Macro-F1|
|DeepWalk (Perozzi et al., 2014)||0.4992||0.3858||0.6295||0.5592||0.3730||0.3599||0.6332||0.5291|
|LINE (Tang et al., 2015)||0.5057||0.3950||0.6489||0.5646||0.3848||0.3603||0.6293||0.5223|
|Node2vec (Grover and Leskovec, 2016)||0.4858||0.3658||0.6308||0.5701||0.3787||0.3678||0.6326||0.5498|
|Metapath2Vec++(T) (Dong et al., 2017)||0.4574||0.3854||0.6702||0.5477||0.3573||0.3401||0.5360||0.4387|
|Metapath2Vec++(B) (Dong et al., 2017)||0.4514||0.3302||0.5444||0.4199||0.2360||0.2343||0.5678||0.4133|
|GCN (Kipf and Welling, 2017)||0.4424||0.3242||0.5027||0.3840||0.2116||0.2176||0.5411||0.3841|
|THGRL (Proposed Method)||0.5376*||0.4381*||0.6982*||0.6122*||0.4244*||0.3951*||0.6579*||0.5920*|
Unless otherwise noted, all graph-based algorithms (including baseline group and comparison group) were using SVM as classification model, and the input representation was obtained by concatenating the TFIDF feature vector and graph embedding feature vector.
Training and Testing Set. An additional source domain may bring many more training samples for shared aspect categories. For a fair comparison and better addressing the research questions, we only detected the target domain-specific aspect categories. For all labeled review instances in the target domain, limited training samples were randomly selected (cold start). Each target domain-specific aspect category would have only 10 labeled review samples for training, while the rest of the reviews were used as the testing set (all and relations from the testing set were removed for fair comparison). This experimental setting can be very challenging. For instance, in the clothing shoes task, the target domain featured seven specific aspects, and there were only 70 training instances (the testing instance number was 8,397). The classification models (including baselines and proposed model) were trained on labeled data from the target domain. The only exception was the baseline model “Domain Adaptation” (Zellinger et al., 2017), which required training data from both domains. For evaluation, the different models were evaluated by two commonly used measures, Micro-F1 and Macro-F1.
Experimental Set-up. For the proposed THGRL model, we utilized the following setting: (1) the number of walks per vertex : 10; (2) the walk length : 80; (3) the embedding dimension : 128; (4) the context window size : 10. For experiment fairness, all the random walk based embedding baselines shared the same parameters. Please note that we didn’t tune those parameters. Most graph-embedding methods reported the above parameter settings in their original paper (Perozzi et al., 2014; Grover and Leskovec, 2016). The walker tracer number was 100.
4.2. Experiment Result and Analysis
Overall Results. The cross-domain aspect category detection performance results of different models are reported in Table 3. Based on the experiment results, we have the following observations:
THGRL vs. Baselines. The proposed method outperformed the other baseline models for all evaluation metrics in all tasks. For instance, in terms of Micro-F1, THGRL outperformed the best-performing baseline by 10.3% for “clothing bags” task. With respect to Macro-F1, THGRL outperformed the best-performing baseline by 11.0% for “clothing shoes” task.
Graph vs. Text. The baselines (e.g., SVM (Ganu et al., 2009), CNN (Kim, 2014), Transformer (Vaswani et al., 2017), and BiLSTM-Attention (Liu and Zhang, 2017)) solely relied on textual information and didn’t perform well. Correspondingly, most graph-embedding approaches with different mechanisms can improve task performance. This observation proves the hypothesis that user behavior information can provide important potentials for cross-domain aspect category transfer and detection.
Deep vs. Simple. Unfortunately, the deep neural network approach in this experiment cannot outperform the simple classification models, mainly because of the training data sparsity. As section 1 mentioned, the goal of this work is to address the training data sparseness problem in the target domain (e.g., 70 training instances and 8,397 testing instances for task). Compared with the simple models, the deep learning family needs more training data for optimization.
Different Mining Methods on Textual Information. By introducing more pseudo-labeling data, the semi-supervised approach could somehow improve the detection performance. Meanwhile, domain adaptation model (Zellinger et al., 2017) was proposed to learn the invariant knowledge from both domains, which could address the data distribution difference. However, this model cannot efficiently cope with the feature and aspect (label) difference of the source and target domains. The performances of these two baselines were still unsatisfactory. This phenomenon shows, when the labeled samples are quite limited, the marginal effect that can be obtained by mining the textual information is also limited.
Cross Domain vs. Single Domain. The experimental results of the domain adaptation model and all graph embedding-based models indicate that an additional related source domain can be useful for the target domain’s task. In more detail, most graph embedding-based models (including baselines and comparison group) can outperform the domain adaptation model (only using textual information). This observation further proves the usefulness of user behavior information.
Metapath2Vec++. Although designed for heterogeneous graph embedding, two Metapath2Vec++ (Dong et al., 2017) baselines didn’t show a significantly different performance from other homogeneous graph embedding models. A possible explanation is that, in this task, no single metapath can cover the aspect detection requirement. In addition, metapath based random walk can be too strict to explore potential useful neighbourhoods for graph representation learning.
GCN. GCN (Kipf and Welling, 2017) didn’t perform well in the experiment. The reason may be multifaceted. First, GCN is designed for a homogeneous graph, but in this study, we utilized heterogeneous graphs. Second, the semi-supervised task of original GCN model was vertex classification, which was inconsistent with final task.
Components of THGRL. To evaluate the components of the proposed method: (1) when we removed the tracer representation information from vertex-tracer representation, the aspect detection performance decreased in all tasks. This result shows that the proposed “walker tracer” does capture the useful global semantic information, which plays an important role in eliminating noise in graph random walks and enhances the task performance. (2) If we replaced the hierarchical random walk generator with an ordinary random walk generator, the performance also declined in all tasks. It is clear that hierarchical random walk can contribute to the heterogeneous graph based random walk and graph representation accuracy significantly. (3) because of the sparseness of the training data, a relatively simple classification model can outperform the sophisticated neural network based models.
Comparison of Different Feature Vectors with Various Classification Models. To gain a deeper understanding regarding the representation capacity of the proposed THGRL method, we compared the aspect category detection task performance using different classification models with different features. As Figure 2 shows: (1) compared to solely using the textual information (TFIDF feature vectors for SVM and pretrained dense word feature vectors for neural network based models), by utilizing “THGRL” embedding, all task performances show significant improvements. For instance, the CNN classification model can achieve an improvement up to 115% (from 0.2588 to 0.5554) in the “shoes clothing” task; while the SVM classification model can achieve an improvement up to 107% (from 0.2046 to 0.4244) in the “clothing bags” task. (2) the SVM model with THGRL features achieved the best performance in all tasks. This observation once again confirms that a relatively simple model can outperform the complicated models in the case of limited training samples. (3) Overall, the CNN model can achieve the greatest improvement (an average increase of 79.6%). The SVM model comes second (an average increase of 51.9%). The improvement of BiLSTM-Attention is relatively small (an average increase of 18.9%).
The above comparison results also demonstrate that user behavior information is vital in the aspect category detection task, and the proposed method could learn a better representation to effectively accomplish the cross-domain transfer and information augmentation tasks.
Trends under Increasing Training Samples. To further validate the performance of the proposed method, we compared the aspect category detection performance by continually adding training samples in two cross-domain tasks. As Figure 3 shows: (1) By adding more training samples, the performance of all models have improved. The proposed method is consistently better than SVM and BiLSTM-Attention which only employing textual features. (2) When the training samples increase, the gap between the THGRL and SVM narrows. (3) The performance of BiLSTM-Attention is not very stable. For instance, in the “clothing shoes” task, it performs worst in most cases, while in the “clothing bags” task, the growth rate of this model is relatively small.
Parameter Sensitivity Analysis We also conducted a sensitivity analysis of THGRL by tailoring the tracer number and representation dimensions. Figure 4 depicts their impacts on the aspect category detection performance. Based on the comparison, we find that, in the “clothing shoes” task, the proposed method was not very sensitive to these two hyper-parameters, while 100 tracers and 128 dimensions were the best-performing parameter setting. Meanwhile, in the “clothing bags” task, 150 tracers and 128 dimensions were the best-performing parameter setting, and the change of the hyper-parameters had a relatively large impact for task performance. This may be caused by different characteristics of different cross-domain tasks. As shown in Table 2, in the “clothing shoes” task, there are more shared customers and sellers (bridges on the graph). Hence, the graphical representation learning could be easier. Furthermore, the constructed heterogeneous graph of “clothing bags” has more vertexes and edges, which may require more tracers to achieve the optimal performance.
Embedding Visualization. We used a heat map to visualize the traceable heterogeneous graph representation of four tracers and the associated vertexes in the experiment. In this heat map, each row is a representation, and colors depict data values. As Figure 5 shows, for each walker tracer, a group of closely related heterogeneous vertexes (e.g., aspects, words, products, customers) are successfully captured. The vertexes in the same group tend to deliver similar semantic knowledge. For instance, the vertexes with high capturing probability of Tracer#1, are all closely related to service, e.g., aspect vertex of “seller’s service,” word vertexes of “custom-service” and “thoughtful.” Meanwhile, different types of vertexes can be captured simultaneously. For instance, the types of vertexes with high capturing probability of Tracer#4 are very diverse, including product, word and customer, etc.
From the similarity viewpoint, it is clear that the representations (color pattern) between various groups are significantly different; while in the same group, they are very similar.
These observations indicate that: (1) The e-commercial behavior information (among customer, seller and products) follows certain patterns, and these behavior patterns have the potential to mirror the aspect information for review mining; (2) The proposed traceable heterogeneous graph representation learning approach could successfully capture this kind of global information, which could be used for aspect augmentation and aspect knowledge transfer.
5. Related Work
Aspect-level review analysis. Aspect-level review analysis is a fine-grained opinion mining task. Identifying the aspect category helps to get target-dependent sentiment and contributes to aspect-specific opinion summarization (Zhou et al., 2015). Prior works mainly focused on aspect extraction (Hu and Liu, 2004; Titov and McDonald, 2008). Recently, International Workshop on Semantic Evaluation (SemEval) developed a series of tasks with the aim of facilitating and encouraging research in aspect based sentiment analysis and its related fields (Maria et al., 2014; Pontiki et al., 2016). Aspect category detection aims to identify the related aspects expressed in a given review, which is a fundamental task for aspect-level sentiment analysis (Tang et al., 2016). Ganu et al. (Ganu et al., 2009) used SVM to train one vs. all classifiers on restaurant review datasets for aspect category detection. TFIDF vector of stem words was used as features in their study. A similar algorithm was applied in (Kiritchenko et al., 2014) with a Yelp word-aspect association lexicon to boost the performance. McAuley et al. (McAuley et al., 2012) proposed a discriminative model to predict product aspect. With the rise of deep learning, recent works began to adopt neural network structure in their researches. (Ruder et al., 2016) used continuous word representations and a convolutional neural network (CNN) for review representation learning, while (Nguyen and Shirai, 2015) used Recursive Neural Network(RNN). Bi-LSTM with attention mechanism (Yang et al., 2016) was applied in (Liu and Zhang, 2017) for aspect-level sentiment analysis. Although the existing methods have achieved promising performance, they all rely on textual information for aspect-level review analysis. Unlike existing works, in this study, we investigated a novel and important potential of aspect category detection by leveraging user behavior information.
Transfer learning. It’s often expensive and time consuming to obtain enough labeled reviews for aspect category detection tasks. Domain adaptation, a.k.a, homogeneous transfer learning (Daume III and Marcu, 2006; Weiss et al., 2016), was proposed for training and testing models on different domain distributions with same feature and output spaces. For instance, Glorot et al. (Glorot et al., 2011) proposed a deep learning model for large-scale cross-domain sentiment polarity classification. Zellinger et al. (Zellinger et al., 2017) tried to learn the domain-invariant representations in the context of domain adaptation with neural networks. However, such methods can hardly be employed for cross-domain aspect category detection task because different domains always have different feature spaces, data distributions, and aspect spaces.
Heterogeneous transfer learning (Weiss et al., 2016) was proposed for the non-equivalent of feature spaces or label spaces. However, according to the transfer learning survey (Day and Khoshgoftaar, 2017), few existing methods addressed the issue of differing label spaces. Furthermore, the existing methods, which can directly address the issue of differing label spaces, usually had additional restrictions. For instance, (Feuz and Cook, 2015) required the construction of meta-features, and (Moon and Carbonell, 2016) required the output label to be a pre-trained word embedding. Therefore, the existing heterogeneous transfer learning methods cannot be directly applied in the proposed cross-domain aspect category transfer and detection problem.
Graph embedding. Graph embedding algorithms, namely network representation learning models, aim to learn the low dimensional feature representations of nodes in networks. Although the techniques utilized in the models are different, most existing graph embedding models focus more on local graph structure representation, e.g., DeepWalk (Perozzi et al., 2014) and Node2vec (Grover and Leskovec, 2016) considered a fixed-size context window of random walk generated node sequences; LINE (Tang et al., 2015) modeled first- and second-order graph neighbourhood; GCN (Kipf and Welling, 2017) used convolutional operation to capture the local adjacency information. Moreover, the above algorithms were all designed for homogeneous graphs that might experience problems when applied to a heterogeneous graph. Metapath2vec++ (Dong et al., 2017) designed a global random walk pattern in heterogeneous graph to enhance the representation performance. But this method relies on human defined rules, which could be time-consuming, incomplete, and biased.
Unlike prior studies, we utilized e-commerce user behavior information to bridge the gap between the source and target domains. The proposed THGRL method enables the graphical aspect information transfer, which can not only project different domains’ feature spaces into a common one but also allow data distributions and output spaces stay differently. Meanwhile, THGRL is fully automatic without handcrafting feature usage. To the best of our knowledge, few existing studies have investigated the user behavior information with graphical approach for cross-domain aspect category detection problem, and the proposed heterogeneous graph mining algorithm is innovative.
In this paper, we propose a traceable heterogeneous graph representation learning model (THGRL) for cross-domain aspect category transfer and detection. Unlike most of the prior studies, which only employ text information, THGRL leverages user behavior information, which offers important potential to address the cold-start problem. The proposed model can project heterogeneous objects (aspect, review, word, product, customer, and seller) from different domains into a joint embedding space. An innovative latent variable “Walker Tracer” is introduced to characterize the global semantic/aspect dependencies and capture the informative vertexes on the random walk paths.
The performance of the proposed method is comprehensively evaluated in three real-world datasets with four challenging tasks. The experimental results show that the proposed model significantly outperforms a series of state-of-the-art methods. Meanwhile, the case study empirically proves that the proposed model can successfully capture the global semantic/aspect dependencies (coherency pattern) in the heterogeneous graph, which is essential for cross-domain aspect knowledge transfer to overcome the problems of different data distributions and diverse output spaces.
The proposed THGRL algorithm is an unsupervised graph embedding model for heterogeneous graph mining. Theoretically, other tasks can adopt this algorithm as long as the dataset can be represented in a heterogeneous graph form.
In the future, we will explore the proposed method on other heterogeneous graph based tasks, e.g., product or social recommendations. Meanwhile, we will investigate a more sophisticated method to improve the heterogeneous graph representation learning performance, such as enabling personalized heterogeneous graph navigation for random walk optimization. Furthermore, we will conduct in-depth studies of the user behavioral information, such as the impact of different entities (seller, customer, etc.) or different relations (, , etc.) on cross-domain aspect transfer problem.
Acknowledgements.This work is supported by the National Natural Science Foundation of China (61876003, 81971691), the China Department of Science and Technology Key Grant (2018YFC1704206), and Fundamental Research Funds for the Central Universities (18lgpy62).
- Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §3.2.
- Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §3.2, §3.2.
- Probabilistic topic models. Communications of the ACM 55 (4), pp. 77–84. Cited by: §3.2.
- Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research 26, pp. 101–126. Cited by: §1, §5.
- A survey on heterogeneous transfer learning. Journal of Big Data 4 (29), pp. 1–42. Cited by: §5.
- Metapath2vec: scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144. Cited by: §1, §2, §3.2, §4.1, §4.2, Table 3, §5.
- Transfer learning across feature-rich heterogeneous feature spaces via feature-space remapping (fsr). ACM Transactions on Intelligent Systems and Technology (TIST) 6 (1), pp. 1:42. Cited by: §5.
- Beyond the stars: improving rating predictions using review text content.. In WebDB, Vol. 9, pp. 1–6. Cited by: §1, §4.1, §4.2, Table 3, §5.
- Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 513–520. Cited by: §1, §5.
- Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL, pp. 3–7. Cited by: footnote 3.
- Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Cited by: §3.2, §4.1, §4.1, Table 3, §5.
- Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. Cited by: §5.
- Mathematics content understanding for cyberlearning via formula evolution map. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 37–46. Cited by: §1.
- Cross-language citation recommendation via hierarchical representation learning on heterogeneous graph. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 635–644. Cited by: §3.2.
- Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: §4.1, §4.2, Table 3.
- Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), pp. 1–14. Cited by: §1, §4.1, §4.2, Table 3, §5.
- NRC-canada-2014: detecting aspects and sentiment in customer reviews. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 437–442. Cited by: §1, §4.1, §5.
- Attention modeling for targeted sentiment. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Vol. 2, pp. 572–577. Cited by: §4.1, §4.2, Table 3, §5.
- Topical word embeddings.. In AAAI, pp. 2418–2424. Cited by: §3.2.
- SemEval-2014 task 4: aspect based sentiment analysis. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp. 27–36. Cited by: §5.
- Learning attitudes and attributes from multi-aspect reviews. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 1020–1025. Cited by: §5.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §3.2.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §3.2, §3.2.
- Proactive transfer learning for heterogeneous feature and label spaces. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 706–721. Cited by: §5.
- Phrasernn: phrase recursive neural network for aspect-based sentiment analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2509–2514. Cited by: §5.
- Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th international conference on World wide web, pp. 751–760. Cited by: §1.
- Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1, §3.2, §4.1, §4.1, Table 3, §5.
- SemEval-2016 task 5: aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pp. 19–30. Cited by: §1, §5.
- Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 569–577. Cited by: §3.2.
- Insight-1 at semeval-2016 task 5: deep learning for multilingual aspect-based sentiment analysis. In Proceedings of International Workshop on Semantic Evaluation 2016 (SemEval-2016), pp. 330–336. Cited by: §1, §4.1, §5.
- Support vector machines. Springer Science & Business Media. Cited by: §3.2, §4.1.
- Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4 (11), pp. 992–1003. Cited by: §2.
- Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 214–224. Cited by: §5.
- Line: large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: §1, §4.1, Table 3, §5.
- Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web, pp. 111–120. Cited by: §5.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §4.1, §4.2, Table 3.
- A survey of transfer learning. Journal of Big Data 3 (9), pp. 1–40. Cited by: §5, §5.
- Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §3.2, §5.
- Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13 (3), pp. 55–75. Cited by: §4.1.
- Central moment discrepancy (cmd) for domain-invariant representation learning. In International Conference on Learning Representations (ICLR 2017 - Conference Track), pp. 1–13. Cited by: §1, §4.1, §4.1, §4.2, Table 3, §5.
- Representation learning for aspect category detection in online reviews.. In AAAI, pp. 417–424. Cited by: §1, §1, §5.