Knowledge Infused Learning (K-IL):Towards Deep Incorporation of Knowledge in Deep Learning

Knowledge Infused Learning (K-IL): Towards Deep Incorporation of Knowledge in Deep Learning


Learning the underlying patterns in data goes beyond instance-based generalization to external knowledge represented in structured graphs or networks. Deep learning that primarily constitutes neural computing stream in AI has shown significant advances in probabilistically learning latent patterns using a multi-layered network of computational nodes (i.e., neurons/hidden units). Structured knowledge that underlies symbolic computing approaches and often supports reasoning, has also seen significant growth in recent years, in the form of broad-based (e.g., DBPedia, Yago) and domain, industry or application specific knowledge graphs. A common substrate with careful integration of the two will raise opportunities to develop neuro-symbolic learning approaches for AI, where conceptual and probabilistic representations are combined. As the incorporation of external knowledge will aid in supervising the learning of features for the model, deep infusion of representational knowledge from knowledge graphs within hidden layers will further enhance the learning process. Although much work remains, we believe that knowledge graphs will play an increasing role in developing hybrid neuro-symbolic intelligent systems (bottom-up deep learning with top-down symbolic computing) as well as in building explainable AI systems for which knowledge graphs will provide scaffolding for punctuating neural computing. In this position paper, we describe our motivation for such a neuro-symbolic approach and framework that combines knowledge graph and neural networks.

1 Introduction

Data-driven bottom-up machine/deep learning (ML) and top-down knowledge-driven approaches to creating reliable models, have shown remarkable success in specific areas, such as search, speech recognition, language translation, computer vision, and autonomous vehicles. On the other hand, they have had limited success in understanding and deciphering contextual information, such as detection of abstract concepts in online/offline human interactions. Current challenges in the translation of research methods and resources into practice often draw from a class of rarely studied problems that do not yield to contemporary bottom-up ML methods. Policymakers and practitioners assert serious usability concerns that constrain adoption, notably in high-consequence domains [71]. In most cases, data-dependent ML algorithms require high computing power and large datasets, where the crucial signals may still be sparse or ambiguous, threatening precision [13]. Moreover, the ML models that are deployed in the absence of transparency and accountability [57] and trained on biased datasets, can lead to grave consequences, such as potential social discrimination and unfair treatment [52]. Further, the potentially severe implications of false alarms in an ML-integrated real-world application may affect millions of people [35, 39].

The fundamental challenges are common to a majority of problems in a variety of domains with real world impact. Specifically, these challenges are: (1) dependency on large datasets required for bottom-up, data-dependent ML algorithms [72, 16], (2) bias in the dataset, enabling the model to emphpotentially cause social discrimination and unfair treatment, (3) multidimensionality, ambiguity and sparsity, as the data involves unconstrained concepts and relationships with meaning from different contextual dimensions of the content such as religion, history and politics [35, 39]. Further, the limited number of labeled instances available for training may fail to represent the true nature of concepts and relationships in data sets, leading to ambiguous or sparse true signals (4) the lack of information traceability for model explainability, (5) the coverage of information specific to a domain that would be missed otherwise, (6) the complexity of model architecture in time and space2, and (7) false alarms in model performance. Consequently, we believe standard separate knowledge graph KG and ML methods are vulnerable to deduce or learn spurious concepts and relationships that appear deceptively good on a KG or training datasets, yet do not provide adequate results when the data set contains contextual and dynamically changing concepts and relations.

In this position paper, we describe innovations that will operationalize more abstract models built upon the characteristics of a domain to render them computationally accessible within neural network architectures. We propose a neuro-symbolic method, knowledge-infused learning that measures information loss in latent features learned by neural networks through KGs with conceptual and structural relationship information, for addressing the aforementioned challenges. The infusion of knowledge during the representation learning phase raises the following central research questions: (i) How do we decide whether to infuse knowledge or not, at a particular stage while learning between layers, and how to quantify knowledge to be infused? (ii) How to merge latent representations between layers with external knowledge representations, and (iii) How to propagate the knowledge through the learned latent representation? Considering the future deployment of AI in applications, the potential impact of this approach is significant. As stated in [32], the deeper the network, the denser the representation and better the learning. A large number of parameters and the layered nature of neural networks make them modifiable based on specific problem characteristics. However, the challenges (1, 3, 5 and 7) make neural networks vulnerable to the sudden appearance of relevant-but-sparse or ambiguous features, in often noisy big data [72, 16, 37]. On the other hand, KG-based approaches structure search within a feature space defined by domain experts. To compensate for the vulnerability of the aforementioned challenges, incorporating knowledge to the learned representation in principled fashion is required. A promising approach is to base this on a measurable discrepancy between the knowledge captured in the neural network and external resources.

Computational modeling coupled with knowledge infusion in a neural network will disambiguate important concepts defined in a KG with their different semantic meanings through its structural relations. Knowledge infusion will redefine the emphasis of sparse but essential, and irrelevant but frequently occurring terms and concepts, boosting recall without reducing precision. Further, it will provide explanatory insight into the model, robustness to noise and reduce dependency on frequency in the learning process. This neuro-symbolic learning approach will potentially transform existing methods for data analysis and building computational models. While the impact of this approach is transferable (and replicable) to a majority of domains, the explicit implications are particularly apparent for social science [35] and healthcare domains [21].

2 Related Work

As the incorporation of knowledge has been explored in various forms in prior research, in this section, we describe the methodologies and applications specifically related to knowledge-infused learning: Neural language models, neural attention models, knowledge based neural networks, all of which utilize external knowledge before/after the representation has been generated.

2.1 Neural Language Models (NLMs)

NLMs are a category of neural networks capable of learning sequential dependencies in a sentence, and preserve such information while learning a representation. In particular, LSTM (Long Short Term Memory) networks [28] have emerged from the failure of RNNs (Recurrent Neural Networks) in remembering long-term information. Concerning the loss of contextual information while learning, [14] proposed a context-feed forward LSTM architecture in which context is learned by the previous layer merged with forgetting and modulation gates of the next layer. However, if erroneous contextual information is learned in previous layers, it is difficult to correct [47], which is a problem magnified by noisy data and content sparsity (e.g. Twitter, Reddit, Blogs).

As the inclusion of structured knowledge (e.g., Knowledge Graphs) in deep learning, improves information retrieval [63], prior research has shown the significance of knowledge in the pursuit of improving NLMs, such as in commonsense reasoning [41]. The transformer NLMs such as BERT, [17] (including its variants BioBert and SciBERT), are still data dependent. BERT has been utilized in hybrid frameworks such as [60] in the creation of sense embeddings using BabelNet and NASARI. [43] proposed K-BERT, that enriches the representations by injecting the triples from KGs into the sentence. As this incorporation of knowledge for BERT takes place in the form of attention, we consider the K-BERT as semi-deep infusion [62]. Similarly, ERNIE [68] incorporated external knowledge to capture lexical, syntactic, and semantic information, enriching BERT.

2.2 Neural Attention Models (NAM)

NAM [58] highlights particular features that are important for pattern recognition/classification based on a hierarchical architecture. The manipulation of attentional focus is effective in solving real world problems involving massive amounts of data [25, 67]. On the other hand, some applications demonstrate the limitation of attentional manipulation in a set of problems such as sentiment (mis)classification [48] and suicide risk [15], where feature presence is inherently ambiguous, just as in the online radicalization problem [35]. For example, in the suicide risk prediction task, references to suicide-related terminology appear in social media posts of both victims as well as supportive listeners, and the existing NAMs fail to capture semantic relations between terms that help differentiate the suicidal user from a supportive user [20]. To overcome such limitations in a sentiment classification task, [73] adds sentiment scores into the feature set for enhancing the learned representation and modifies the loss function to respond to values of the sentiment score during learning. However, [64, 33] have pointed out the importance of using domain-specific knowledge especially in cases where the problem is complex in nature [55]. [5] has empirically demonstrated the effectiveness of combining richer semantics from domain knowledge with morphological and syntactic knowledge in the text, by modeling knowledge as an auxiliary task that regularizes the learning of the main objective in a deep neural network.

2.3 Knowledge-based Neural Networks

[75] introduced a knowledge-based, recurrent attention neural network (KB-RANN) that modifies the attentional mechanism by incorporating domain knowledge to improve model generalization. However, their domain-knowledge is statistically derivable from the input data itself and is analogous to merely learning an interpolation function over the existing data. [18] proposed a modification in the neural network by adopting Lipschitz functions for its activation function. [30] proposed a combination of deep neural networks with logic rules by employing knowledge distillation procedure [27] of transferring the learned tacit knowledge from larger neural network, to the weights of the smaller neural network in data-limited settings. These studies for incorporating knowledge in a deep learning framework have not involved declarative knowledge structures in the form of KGs (e.g., DBpedia) [12]. However, [11] recently showed how the Cardiovascular Disease Ontology (CDO) provided context and reduced ambiguity, improving performance on a synonym detection task. [61] employed embeddings of entities in a KG, derived through Bi-LSTMs, to enhance the efficacy of NAMs. [59] presented a conceptual framework for explaining artificial neural networks’ classification behavior using background knowledge on the semantic web. [46] explained a deep learning approach to learn RDFS3 rules from both synthetic and real-world semantic web data. They also claim their approach improves the noise-tolerance capabilities of RDFS reasoning.

All of the frameworks in the above subsections utilized external knowledge before or after the representation has been generated by NAMs, rather than within the deep neural network as in our approach [62]. We propose a learning framework that infuses domain knowledge within the latent layers of neural networks for modeling.

3 Preliminaries

Symbolic representation of a domain, besides its probabilistic representation, is crucial for neuro-symbolic learning. In our approach, we propose to homogenize symbolic information from KGs (see Section Knowledge Graphs) and contextual neural representations (see Section Contextual Modeling), in neural networks.

3.1 Knowledge Graphs

A Knowledge graph (KG) is a conceptual model of a domain that stores and structures declarative knowledge in a human and machine-readable format, constituting factual ground truth and embodying a domain ontology of objects, attributes, and relations. KGs rely on symbolic propositions, employing generic conceptual relationships in taxonomies, partonomies and specific content with labeled links. Examples include DBpedia, UMLS, and ICD-10. The factual information about the domain is represented in the form of instances (or individuals) of those concepts (or classes) and relationships [24, 65]. Therefore, a domain can be described or modeled through KGs in a way that both computers and humans can understand. As KGs differentiate contextual nuances of concepts in the content, they play a key role in our framework with extensive use by several functions.

3.2 Contextual Modeling

Capturing contextual cues in the language is crucial in our approach; hence, we utilize NLMs to generate embeddings of the content. Recent embedding algorithms have emerged to create such representations such as Word2Vec [22], GLoVe [54], FastText [1] and BERT [17].

Figure 1: Contextual Dimension Modeling Diagram [35] Embedding algorithm above (W2V: Word2Vec) can be replaced by other algorithms such as BERT. For each dimension, a specific corpus is utilized to create the model and the generated representation of content is concatenated. Generating the three contextual dimension representations of a social media post will emphasize the weights of such essential lexical cues.

Modeling context-sensitive problems in different domains (e.g., healthcare, cyber social threats, online extremism and harassment), depends heavily on carefully designed features to extract meaningful information, based on characteristics of the problems and a ground truth dataset. Moreover, identifying these characteristics and differentiating the content requires different levels of granularity in the organization of features. For instance, in the problem of online Islamist extremism, the information being shared in social media posts by users in extremism-related social networks displays an intent that depends on the user’s type (e.g., recruiter, follower). Hence, as these user types show different characteristics [36], for reliable analysis, it is critical to consider different contextual dimensions [35, 39]. Moreover, the ambiguity of diagnostic terms (e.g., jihad) also mandates representation of terms in different contexts. Hence, to better reflect these differences, creating multiple models enables us to represent the multiple contextual dimensions for a reliable analysis. Figure 1 details the contextual dimension modeling workflow.

Figure 2: Overall Architecture: Contextual representations of data are generated, and domain knowledge amplifies the significance of specific important concepts that are missed in the learning model. Classification error determines the need for updating a Seeded SubKG with more relevant knowledge, resulting in a Seeded SubKG that is more refined and informative to our model.

4 A Proposed Comprehensive Approach

Although the existing research [21, 4] shows the contribution of incorporating external knowledge in ML, this incorporation mostly takes place before or after the actual learning process (e.g., feature extraction, validation); thus remaining shallow. We believe that deep knowledge infusion, within the hidden layers of neural networks, will greatly improve the performance by: (i) reducing false alarm and information loss, (ii) boosting recall without sacrificing precision, (iii) providing finer granular representation, (iv) enabling explainability [31, 38] and (v) reducing bias. Specifically, we believe that it will become a critical and integral component of AI models that are integrated in deployed tools, e.g, in healthcare, where domain knowledge is crucial and indispensable in decision making processes. Fortunately, these domains are rich in terms of their respective machine-readable knowledge resources, such as manually curated medical KGs (e.g., UMLS [49], ICD-10 [9] and DataMed [51]). In our prior research [21], we utilized ML models coupled with these KGs to predict mental health disorders among 20 Mental Disorders (defined in the DSM-5) for Reddit posts. Typical approaches for such predictions employ word embeddings, such as Word2Vec, resulting in sub-optimal performance when they are used in domain-specific tasks. We have incorporated knowledge into the embeddings of Reddit posts by (i) using Zero Shot learning [53], (ii) modulating (e.g., re-weighting) their embeddings, similar to NAMs, and obtained a significant reduction in the false alarm rate, from 13% (without knowledge) to 2.5% (with knowledge). In another study, we have leveraged the domain knowledge in KGs to validate model weights that explain diverse crowd behavior in the Fantasy Premier League participants (FPL) [3]. However, very little previous work has tried to integrate such functional knowledge to an existing deep learning framework.

We propose to further develop an innovative deep knowledge-infused learning approach that will reveal patterns that are missed by traditional approaches because of sparse feature occurrence, feature ambiguity and noise. This approach will support the following integrated aims: (i) Infusion of Declarative Domain Knowledge in a Deep Learning framework, and (ii) Optimal Sub-Knowledge Graph Creation and Evolution. The overall architecture in Figure 2 guides our proposed research on these two aims. Our methods will disambiguate important concepts defined in the respective KGs with their different semantic meanings through its structural relations. Knowledge infusion will redefine the emphasis of sparse-but-essential, and irrelevant-but-frequently-occurring terms and concepts, boosting recall without reducing precision.

4.1 Knowledge-Infused Learning

Each layer in a neural network architecture produces a latent representation of the input vector (). The infusion of knowledge during the representation learning phase raises the following central research questions: R1: How do we decide whether to infuse knowledge or not, at a particular stage while learning between layers, and how to quantify knowledge to be infused? R2: How to merge latent representations between layers with external knowledge representations, and R3: How to propagate the knowledge through the learned latent representation? We propose to define two functions to address these two questions: Knowledge-Aware Loss Function (K-LF) and Knowledge Modulation Function (K-MF), respectively.

Configurations of neural networks can be designed in various ways depending on the problem. As our aim is to infuse knowledge within the neural network, such an operation can take place (i) before the output layer (e.g., SoftMax), (ii) between hidden layers (e.g., reinforcing the gates of an NLM layer, modulating the hidden states of NLM layers, Knowledge-driven NLM dropout and recurrent dropout between layers). To illustrate (i), we describe our initial approach to neural language models that infuses knowledge before the output layer, which we believe will shed the light towards a reliable and robust solution with more research and rigorous experimentations.

Seeded Sub-Knowledge Graph

The Seeded Sub-Knowledge Graph, is a subset of KGs, which participate broadly in our technical approach. Generic KGs (e.g., DBpedia [6], YAGO2 [29], Freebase [7]) may contain over a million entities and close to a billion relationships. Using the entire graph of linked data on the web can cause; (1) unnecessary computation and (2) noise due to irrelevant knowledge, and has sometimes failed to benefit intelligent application [56]. However, real-world problems are domain-specific and require only a relevant (sub) portion of the full graph. Creation of a Seeded Sub-KG [40] based on a ground truth dataset is needed, to represent a particular domain using information-theoretic approaches (e.g., KL divergence) and probabilistic soft logic [34]. Further, a sub-graph discovery approach [10, 40] can also be used utilizing probabilistic graphical models (e.g., deep belief networks, conditional random fields). In our approach, the Seeded SubKG will be updated with more knowledge based on difference between the learned representation and relevant knowledge representation from the KG (see Section Differential Knowledge Engine).

: Knowledge Embedding Creation

Representation of knowledge in the Seeded SubKG will be generated as embedding vectors. Specific contextual dimension models and/or more generic models can be utilized to create an embedding of each concept and their relations in the Seeded SubKG. Unlike traditional approaches that compute the representation of each concept in the KGs by simply taking an average of embedding vectors of concepts, we leverage the existing structural information of the graph. This procedure is formally defined:


where is the representation of the concepts enriched by the relationships in the Seeded-KG, (, ) is the relevant pair of concepts in the Seeded-KG, is the distance measure (e.g., Least Common Subsumer [2]) between the two concepts and . Novel methods will further be examined building upon this initial approach above as well as existing tools that include TRANS-E [8], TRANS-H [74], and HOLE [50] for the creation of embeddings from KGs.

Knowledge Infusion Layer

In a many-to-one NLM [66] network with hidden layers, the layer contains the learned representation before the output layer. The output layer (e.g., SoftMax) of the NLM model estimates the error to be back-propagated. As the techniques for knowledge infusion between hidden layers or just before the output layers will be explored, in this subsection, we explain the Knowledge Infusion Layer (K-IL) which takes place just before the output layer.

1:procedure KnowledgeInfusion
4:      for ne=1 to #Epochs do
5:            , TrainingNLM(,)
6:            while ( do
8:                  -             
10:      return:
Algorithm 1 Routine for Infusion of Knowledge in NLMs

Algorithm 1 takes the type of neural language model, number of epochs, iterations and the seeded knowledge graph embedding as input, and returns a knowledge infused representation of the hidden state . In line 4, the infusion of knowledge takes place after each epoch without obstructing the learning of the vanilla NLM model and is explained in lines 5-10. Within the knowledge infusion process (lines 7-9), we optimize the loss function in equation 2 with convergence condition defined as the reduction in the difference between the of and in the presence of . Considering the vanilla structure of a NLM [23], is utilized by the fully connected layer for classification.

To illustrate an initial approach in Figure 3, we use LSTMs as NLMs in our neural network. K-IL functions add an additional layer before the output layer of our proposed neural network architecture. This layer takes the latent vector () of the penultimate layer, the latent vector of the last hidden layer () and the knowledge embedding (), as input.

In this layer, we define two particular functions that will be critical for merging the latent vectors from the hidden layers and the knowledge embedding vector from the KG. Note that the dimensions of these vectors are the same because they are created from the same models (e.g., contextual models), which makes the merge operation of those vectors possible and valid.

Figure 3: Inner Mechanism of the Knowledge Infusion Layer

K-LF: Knowledge-Aware Loss Function

In neural networks, hidden layers may de-emphasize important patterns due to the sparsity of certain features during learning, which causes information loss. In some cases, such patterns may not even appear in the data. However, such relations or patterns may be defined in KGs with even more relevant knowledge. We call this information gap between the learned representation of the data and knowledge representation as differential knowledge. Information loss in a learning process is relative to the distribution that suffered the loss. Hence, we propose a measure to determine the differential knowledge and guide the degree of knowledge infusion in learning. As our initial approach to this measure, we developed a two-state regularized loss function by utilizing Kullback Leibler (KL) divergence. Our choice of KL divergence measure is largely influenced by the Markov assumptions made in language modeling and have been highlighted in [45]. The K-LF measure estimates the divergence between the hidden representations () and knowledge representation (), to determine the differential knowledge to be infused.

Formally we define it as:
, where is an input for convergence constraint.


We minimize the relative entropy for information loss to maximize the information gain from the knowledge representation (e.g., ). We will compute differential knowledge () through such optimization approach; thus, the computed differential knowledge will also determine the degree of knowledge to be infused. will be computed in the form of embedding vectors, and the dimensions from will be preserved.

K-MF: Knowledge Modulation Function

We need to merge the differential knowledge representation with the partially learned representation. However, this operation cannot be done arbitrarily as the vector spaces of both representations are different both in dimension and distribution if not same [19]. We explain an initial approach for the K-MF to modulate the learned weight matrix of the neural network with the hidden vector through an appropriate operation (e.g., Hadamard pointwise multiplication). This operation at the layer can be formulated as:

Equation for ,  where is the learned weight matrix infusing knowledge, is learning momentum [69], is differential knowledge. The weight matrix () is computed through the learning epochs utilizing the differential knowledge embedding (). Then we merge with the hidden vector through the K-MF. Considering that we use Hadamard pointwise multiplication as our initial approach, we formally define the output of K-MF as:

This operation at the layer can be formulated as:


where is Knowledge-Modulated representation, is the hidden vector and is the learned weight matrix infusing knowledge. Further investigations of techniques for K-MF constitutes a central research topic for the research community.

4.2 Differential Knowledge Engine

In deep neural networks, each epoch generates an error that is back-propagated until the model reaches a saddle point in the local minima, and the error is reduced in each epoch. The error indicates the difference between probabilities of actual and predicted labels, and this difference can be used to enrich the Seeded SubKG in our proposed knowledge-infused learning (K-IL) framework.

In this subsection, we discuss the sub-knowledge graph operations that are based on the difference between the learned representation of our knowledge-infused model (), and the representation of the relevant sub-knowledge graph from the KG, which we name the differential sub-knowledge graph. We define a Knowledge Proximity function to generate the Differential Sub-knowledge Graph, and Update Seeded SubKG to insert the differential sub-knowledge graph into the Seeded SubKG.

Knowledge Proximity

Upon the arrival of the learned representation from the knowledge-infused learning model, we query the KG for retrieving related information to the respective data point. In this particular step, it is important to find the optimal proximity between the concept and its related concepts. For example, from the “South Carolina” concept, we may traverse the surrounding concepts with a varying number of hops (empirically decided). Finding the optimal number of hops towards each direction from the concept in question is still an open research question. As we find optimal proximity of a particular concept in the KG, we propagate KG based on the proximity starting from the concept in question.

Differential SubKG

Once we obtain the SubKG from the graph propagation, we create a differential SubKG that will reflect the difference in knowledge from the Seeded SubKG. For this procedure, research is needed to formulate the problem using variational autoencoders to extract a differential subKG() and, we believe it will provide missing information in the Seeded-KG.

Update function

The differential subKG generated as a result of minimizing knowledge proximity is considered as an input factual graph to the update procedure. As a result, the procedure dynamically evolves the Seeded subKG with missing information from differential subKG. We propose to utilize Lyapunov stability theorem [42] and Zero Shot learning to update the Seeded-KG using . and Seeded-KG represent two knowledge structures requiring a process of transferring the knowledge from one structure to another [26]. We define this process as generation of semantic mapping weights that encodes and decodes the two semantic spaces, utilizing the Lyapunov stability constraint and Sylvester optimization approach: Given two semantic spaces belonging to a domain D (e.g., online extremism, mental health), we tend to attain an equilibrium position defined as:


represents Frobenius norm and is a proportionality constant belong to . Equation 4 reflects Lyapunov stability theorem and to achieve such a stable state we define our optimization function as follows:


Equation 5 is solvable using Sylvester optimization and its derivation is defined in a recent study [21].

5 Applications for K-IL

Artificial intelligence models will be widely deployed in real world decision making processes in the foreseeable future, once the challenges described in Section 1, are overcome. As we argue that the incorporation of external structured knowledge will address these challenges, it will benefit various application domains such as social and health sciences, automating processes that require knowledge and intelligence. Specifically, it will have a potentially significant impact on predictive analysis of online communications such as misinformation and extremism, conversational modeling, and disease prediction.

As predicting online extremism is challenging and false alarms create serious implications potentially affecting millions of individuals, [35] showcased that the (shallow) infusion of external domain-specific knowledge improves precision, reducing potential social discrimination. Further, in prediction of mental health diseases defined in DSM-5, [21] shallow knowledge infusion reduces false alarms by 30%. On the other hand, conversational models pose an important application area as [44] proposed a conversation framework where the fusion of KGs and text mutually reinforce each other to generate knowledge-aware responses, improving the model in generalizability and explainability. In another study, [76] integrated commonsense knowledge into the conversational models selecting the most appropriate response. While machine learning finds many application areas in medicine for disease prediction, large data is not always available. In this case knowledge-infused learning generates more representative features thereby avoiding overfitting. A study [70] on early diagnosis of lung cancer using computed tomography images, infused knowledge in the form of expert-curated features into the learning process through CNN. Despite the small data set, the enriched feature space in their knowledge-infused learning process improved sensitivity and specificity of the model.

In contrast to the applications above, we believe that the deep infusion of external knowledge within latent layers will enhance the coverage of the information being learned by the model based on KGs. Hence, this will provide better generalizability, reduction in bias and false alarms, disambiguation, less reliance on large data, explainability, reliability and robustness, to the real world applications in critical aforementioned domains with significant impact.

6 Conclusion

Combining deep learning and knowledge graphs in a hybrid neural-symbolic learning framework will further enhance performance and accelerate the convergence of the learning processes. Specifically, the impact of this improvement in very sensitive domains such as health and social science, will be significant with respect to their implications for real-world deployment. Adoption of the tools that automate tasks that require knowledge and intelligence, and are traditionally done by humans, will improve with the help of this framework that marries deep learning and knowledge graph techniques. Specifically, we envision that the infusion of knowledge as described in this framework will capture information for the corresponding domain in finer granularity of abstraction. We believe that this approach will provide reliable solutions to the problems faced in deep learning, as described in Sections 1 and 5. Hence, in real world applications, resolving these issues with both knowledge graphs and deep learning in a hybrid neuro-symbolic framework will greatly contribute to fulfilling AI’s promise.


We acknowledge partial support from the National Science Foundation (NSF) award CNS-1513721: “Context-Aware Harassment Detection on Social Media”. Any opinions, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.


  1. footnotemark:


  1. B. Athiwaratkun, A. G. Wilson and A. Anandkumar (2018) Probabilistic fasttext for multi-sense word embeddings. arXiv preprint arXiv:1806.02901. Cited by: §3.2.
  2. F. Baader, B. Sertkaya and A. Turhan (2007) Computing the least common subsumer wrt a background terminology. Journal of Applied Logic 5 (3), pp. 392–420. Cited by: §4.1.2.
  3. S. Bhatt, M. Gaur, B. Bullemer, V. L. Shalin, A. P. Sheth and B. Minnery (2018) Enhancing crowd wisdom using explainable diversity inferred from social media. In IEEE/WIC/ACM International Conference on Web Intelligence, Santiago, Chile. Cited by: §4.
  4. S. Bhatt, M. Gaur, B. Bullemer, V. Shalin, A. Sheth and B. Minnery (2018) Enhancing crowd wisdom using explainable diversity inferred from social media. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 293–300. Cited by: §4.
  5. J. Bian, B. Gao and T. Liu (2014) Knowledge-powered deep learning for word embedding. In Joint European conference on machine learning and knowledge discovery in databases, pp. 132–148. Cited by: §2.2.
  6. C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak and S. Hellmann (2009) DBpedia-a crystallization point for the web of data. Web Semantics: science, services and agents on the world wide web 7 (3), pp. 154–165. Cited by: §4.1.1.
  7. K. Bollacker, C. Evans, P. Paritosh, T. Sturge and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. Cited by: §4.1.1.
  8. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §4.1.2.
  9. K. L. Brouch (2000) Where in the world is icd-10?. Where in the World Is ICD-10?/AHIMA, American Health Information Management Association. Cited by: §4.
  10. D. Cameron, R. Kavuluru, T. C. Rindflesch, A. P. Sheth, K. Thirunarayan and O. Bodenreider (2015) Context-driven automatic subgraph creation for literature-based discovery. Journal of biomedical informatics 54, pp. 141–157. Cited by: §4.1.1.
  11. M. A. Casteleiro, G. Demetriou, W. Read, M. J. F. Prieto, N. Maroto, D. M. Fernandez, G. Nenadic, J. Klein, J. Keane and R. Stevens (2018) Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature. Journal of biomedical semantics 9 (1), pp. 13. Cited by: §2.3.
  12. B. Chen, Z. Hao, X. Cai, R. Cai, W. Wen, J. Zhu and G. Xie (2019) Embedding logic rules into recurrent neural networks. IEEE Access 7, pp. 14938–14946. Cited by: §2.3.
  13. J. Cheng (2018) AI reasoning systems: pac and applied methods. arXiv preprint arXiv:1807.05054. Cited by: §1.
  14. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.1.
  15. D. J. Corbitt-Hall, J. M. Gauthier, M. T. Davis and T. K. Witte (2016) College students’ responses to suicidal content on social networking sites: an examination using a simulated facebook newsfeed. Suicide and Life-Threatening Behavior 46 (5), pp. 609–624. Cited by: §2.2.
  16. G. De Palma, B. Kiani and S. Lloyd (2019) Random deep neural networks are biased towards simple functions. In Advances in Neural Information Processing Systems, pp. 1962–1974. Cited by: §1, §1.
  17. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1, §3.2.
  18. C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau and R. Garcia (2009) Incorporating functional knowledge in neural networks. Journal of Machine Learning Research 10 (Jun), pp. 1239–1262. Cited by: §2.3.
  19. S. Dumančić and H. Blockeel (2017) Demystifying relational latent representations. In International Conference on Inductive Logic Programming, pp. 63–77. Cited by: §4.1.5.
  20. M. Gaur, A. Alambo, J. P. Sain, U. Kursuncu, K. Thirunarayan, R. Kavuluru, A. Sheth, R. Welton and J. Pathak (2019) Knowledge-aware assessment of severity of suicide risk for early intervention. In The World Wide Web Conference, pp. 514–525. Cited by: §2.2.
  21. M. Gaur, U. Kursuncu, A. Alambo, A. Sheth, R. Daniulaityte, K. Thirunarayan and J. Pathak (2018) ” Let me tell you about your mental health!” contextualized classification of reddit posts to dsm-5 for web-based intervention. Cited by: §1, §4.2.3, §4, §5.
  22. Y. Goldberg and O. Levy (2014) Word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. Cited by: §3.2.
  23. K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink and J. Schmidhuber (2017) LSTM: a search space odyssey. IEEE transactions on neural networks and learning systems 28 (10), pp. 2222–2232. Cited by: §4.1.3.
  24. T. Gruber (2008) Ontology, encyclopedia of database systems, ling liu and m. Tamer Özsu (Eds.). Cited by: §3.1.
  25. A. Halevy, P. Norvig and F. Pereira (2009) The unreasonable effectiveness of data. IEEE Intelligent Systems 24 (2), pp. 8–12. Cited by: §2.2.
  26. T. Hamaguchi, H. Oiwa, M. Shimbo and Y. Matsumoto (2017) Knowledge transfer for out-of-knowledge-base entities: a graph neural network approach. arXiv preprint arXiv:1706.05674. Cited by: §4.2.3.
  27. G. Hinton, O. Vinyals and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.3.
  28. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  29. J. Hoffart, F. M. Suchanek, K. Berberich and G. Weikum (2013) YAGO2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, pp. 28–61. Cited by: §4.1.1.
  30. Z. Hu, X. Ma, Z. Liu, E. Hovy and E. Xing (2016) Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318. Cited by: §2.3.
  31. S. R. Islam, W. Eberle, S. Bundy and S. K. Ghafoor (2019) Infusing domain knowledge in ai-based” black box” models for better explainability with application in bankruptcy prediction. arXiv preprint arXiv:1905.11474. Cited by: §4.
  32. A. Karpathy (2015) The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy blog 21. Cited by: §1.
  33. S. J. Kho, S. Padhee, G. Bajaj, K. Thirunarayan and A. Sheth (2019) Domain-specific use cases for knowledge-enabled social media analysis. In Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, pp. 233–246. Cited by: §2.2.
  34. A. Kimmig, S. Bach, M. Broecheler, B. Huang and L. Getoor (2012) A short introduction to probabilistic soft logic. In Proceedings of the NIPS Workshop on Probabilistic Programming: Foundations and Applications, pp. 1–4. Cited by: §4.1.1.
  35. U. Kursuncu, M. Gaur, C. Castillo, A. Alambo, K. Thirunarayan, V. Shalin, D. Achilov, I. B. Arpinar and A. Sheth (2019) Modeling islamist extremist communications on social media using contextual dimensions: religion, ideology, and hate. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 151. Cited by: §1, §1, §1, §2.2, Figure 1, §3.2, §5.
  36. U. Kursuncu, M. Gaur, U. Lokala, A. Illendula, K. Thirunarayan, R. Daniulaityte, A. Sheth and I. B. Arpinar (2018) ” What’s ur type?” contextualized classification of user types in marijuana-related communications using compositional multiview embedding. In IEEE/WIC/ACM International Conference on Web Intelligence(WI’18), Cited by: §3.2.
  37. U. Kursuncu, M. Gaur, U. Lokala, K. Thirunarayan, A. Sheth and I. B. Arpinar (2019) Predictive analysis on twitter: techniques and applications. In Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, pp. 67–104. Cited by: §1.
  38. U. Kursuncu, M. Gaur, K. Thirunarayan and A. Sheth (2019) Explainability of medical ai through domain knowledge. Ontology Summit 2019, Medical Explanation. Cited by: §4.
  39. U. Kursuncu (2018) Modeling the persona in persuasive discourse on social media using context-aware and knowledge-driven learning. Ph.D. Thesis, University of Georgia. Cited by: §1, §1, §3.2.
  40. S. Lalithsena (2018) Domain-specific knowledge extraction from the web of data. Ph.D. Thesis, Wright State University. Cited by: §4.1.1.
  41. H. Liu and P. Singh (2004) Commonsense reasoning in and over natural language. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pp. 293–306. Cited by: §2.1.
  42. M. Liu, D. Zhang and S. Chen (2014) Attribute relation learning for zero-shot classification. Neurocomputing 139, pp. 34–46. Cited by: §4.2.3.
  43. W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng and P. Wang (2019) K-bert: enabling language representation with knowledge graph. arXiv preprint arXiv:1909.07606. Cited by: §2.1.
  44. Z. Liu, Z. Niu, H. Wu and H. Wang (2019) Knowledge aware conversation generation with explainable reasoning over augmented graphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1782–1792. Cited by: §5.
  45. C. Longworth (2010) Kernel methods for text-independent speaker verification. Ph.D. Thesis, University of Cambridge. Cited by: §4.1.4.
  46. B. Makni and J. Hendler Deep learning for noise-tolerant rdfs reasoning. Cited by: §2.3.
  47. N. Y. Masse, G. D. Grant and D. J. Freedman (2018) Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences 115 (44), pp. E10467–E10475. Cited by: §2.1.
  48. A. K. Maurya (2018) Learning low dimensional word based linear classifiers using data shared adaptive bootstrap aggregated lasso with application to imdb data. arXiv preprint arXiv:1807.10623. Cited by: §2.2.
  49. B. T. McInnes, T. Pedersen and S. V. Pakhomov (2009) UMLS-interface and umls-similarity: open source software for measuring paths and semantic similarity. In AMIA Annual Symposium Proceedings, Vol. 2009, pp. 431. Cited by: §4.
  50. M. Nickel, L. Rosasco and T. A. Poggio (2016) Holographic embeddings of knowledge graphs.. In AAAI, Vol. 2, pp. 3–2. Cited by: §4.1.2.
  51. L. Ohno-Machado, S. Sansone, G. Alter, I. Fore, J. Grethe, H. Xu, A. Gonzalez-Beltran, P. Rocca-Serra, A. E. Gururaj and E. Bell (2017) Finding useful data across multiple biomedical data repositories using datamed. Nature genetics 49 (6), pp. 816. Cited by: §4.
  52. A. Olteanu, C. Castillo, F. Diaz and E. Kiciman (2019) Social data: biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2, pp. 13. Cited by: §1.
  53. M. Palatucci, D. Pomerleau, G. E. Hinton and T. M. Mitchell (2009) Zero-shot learning with semantic output codes. In Advances in neural information processing systems, pp. 1410–1418. Cited by: §4.
  54. J. Pennington, R. Socher and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.
  55. S. Perera, P. N. Mendes, A. Alex, A. P. Sheth and K. Thirunarayan (2016) Implicit entity linking in tweets. In International Semantic Web Conference, pp. 118–132. Cited by: §2.2.
  56. A. Roy, Y. Park and S. Pan (2017) Learning domain-specific word embeddings from sparse cybersecurity texts. arXiv preprint arXiv:1709.07470. Cited by: §4.1.1.
  57. C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §1.
  58. A. M. Rush, S. Chopra and J. Weston (2015) A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. Cited by: §2.2.
  59. M. K. Sarker, N. Xie, D. Doran, M. Raymer and P. Hitzler (2017) Explaining trained neural networks with semantic web technologies: first steps. arXiv preprint arXiv:1710.04324. Cited by: §2.3.
  60. B. Scarlini, T. Pasini and R. Navigli (2020) SENSEMBERT: context-enhanced sense embeddings for multilingual word sense disambiguation. In Proc. of AAAI, Cited by: §2.1.
  61. Y. Shen, Y. Deng, M. Yang, Y. Li, N. Du, W. Fan and K. Lei (2018) Knowledge-aware attentive neural network for ranking question answer pairs. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 901–904. Cited by: §2.3.
  62. A. Sheth, M. Gaur, U. Kursuncu and R. Wickramarachchi (2019) Shades of knowledge-infused learning for enhancing deep learning. IEEE Internet Computing 23 (6), pp. 54–63. Cited by: §2.1, §2.3.
  63. A. Sheth and P. Kapanipathi (2016) Semantic filtering for social data. IEEE Internet Computing 20 (4), pp. 74–78. Cited by: §2.1.
  64. A. Sheth, S. Perera, S. Wijeratne and K. Thirunarayan (2017) Knowledge will propel machine understanding of content: extrapolating from current examples. arXiv preprint arXiv:1707.05308. Cited by: §2.2.
  65. A. Sheth and K. Thirunarayan (2012) Semantics empowered web 3.0: managing enterprise, social, sensor, and cloud-based data and services for advanced applications. Synthesis Lectures on Data Management 4 (6), pp. 1–175. Cited by: §3.1.
  66. P. G. Shivakumar, H. Li, K. Knight and P. Georgiou (2018) Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. arXiv preprint arXiv:1802.02607. Cited by: §4.1.3.
  67. C. Sun, A. Shrivastava, S. Singh and A. Gupta (2017) Revisiting unreasonable effectiveness of data in deep learning era. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 843–852. Cited by: §2.2.
  68. Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian and H. Wu (2019) Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Cited by: §2.1.
  69. I. Sutskever, J. Martens, G. Dahl and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §4.1.5.
  70. J. Tan, Y. Huo, Z. Liang and L. Li (2019) Expert knowledge-infused deep learning for automatic lung nodule detection. Journal of X-ray science and technology 27 (1), pp. 17–35. Cited by: §5.
  71. E. J. Topol (2019) High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 25 (1), pp. 44–56. Cited by: §1.
  72. L. G. Valiant (2000) Robust logics. Artificial Intelligence 117 (2), pp. 231–253. Cited by: §1, §1.
  73. K. Vo, D. Pham, M. Nguyen, T. Mai and T. Quan (2017) Combination of domain knowledge and deep learning for sentiment analysis. In International Workshop on Multi-disciplinary Trends in Artificial Intelligence, pp. 162–173. Cited by: §2.2.
  74. Z. Wang, J. Zhang, J. Feng and Z. Chen (2014) Knowledge graph embedding by translating on hyperplanes.. In AAAI, Vol. 14, pp. 1112–1119. Cited by: §4.1.2.
  75. K. Yi, Z. Jian, S. Chen, Y. Chen and N. Zheng (2018) Knowledge-based recurrent attentive neural network for traffic sign detection. arXiv preprint arXiv:1803.05263. Cited by: §2.3.
  76. T. Young, E. Cambria, I. Chaturvedi, H. Zhou, S. Biswas and M. Huang (2018) Augmenting end-to-end dialogue systems with commonsense knowledge. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description