Knowledge Representation on the Web revisited:
Tools for Prototype Based Ontologies
In recent years RDF and OWL have become the most common knowledge representation languages in use on the Web, propelled by the recommendation of the W3C. In this paper we present a practical implementation of a different kind of knowledge representation based on Prototypes. In detail, we present a concrete syntax easily and effectively parsable by applications. We also present extensible implementations of a prototype knowledge base, specifically designed for storage of Prototypes. These implementations are written in Java and can be extended by using the implementation as a library. Alternatively, the software can be deployed as such. Further, results of benchmarks for both local and web deployment are presented. This paper augments a research paper, in which we describe the more theoretical aspects of our Prototype system.
Keywords:Linked Data, Knowledge Representation, Prototypes
Recently, we proposed Prototypes as a way to represent knowledge on the web [researchPaper]111 Please use that paper as a reference for all definitions.. That paper has its focus on theoretical aspects and analysis. In this resource paper we describe the tools we developed to deploy prototypes. First, in section 2 we present the implementation of a knowledge base, based on the implementation of a Java interface, which can be used for storing prototypes. Then, we reuse this system to show how the knowledge base can be used in remote and distributed settings (section 3). Each of the sections includes some amount of benchmarking to give the reader an impression of the practical re-usability of the provided solutions. We do assume that the reader has some familiarity with the ideas behind prototypes. The implementation and the code used for the benchmarks is licensed under the LGPLv3 license and can be downloaded from https://github.com/miselico/knowledgebase.
2 A Standalone Knowledge Base
A prototype knowledge base (KB) consists of a collection of prototypes. To mirror this, the IKnowledge interface, which we define as the basis for a KB, has only one method which must222Other methods are Java 8 default methods. be implemented. The signature of this method is Optional<? extends Prototype> isDefined(ID id);. The provided Java source code contains five implementations of this interface, namely
is a KB without any content. However, as per the definition, it still contains the empty prototype .
is a KB containing string and integer constants and is described in more details in this section.
stores prototypes. It can be constructed using another IKnowledgeBase as a basis. The underlying basis will be queried in case the requested prototype is not directly defined in the KnowledgeBase.
gets its prototypes from a remote KB. This implementation is described further in section 3.
is an IknowledgeBase which connects multiple IKnowledgeBases together. The KBs are checked in turn until one where the Prototype is defined is found. If none is found, an empty Optional is returned, indicating that no Prototype could be found.
Each prototype consists of four components, namely 1) its own ID, 2) the ID of its base, 3) the change set for adding parts, and 4) the change set for removing parts [researchPaper]. This structure is closely mimicked in our implementation. The IDs are essentially represented using String types and the change sets using multimaps (i.e., maps which can associate multiple values for a given key). The formal definition allows the creation of a prototype which would remove all values for a given property. According to the theoretical definition, that computation would involve the set of all possible IDs by enumeration, which is unfeasible. Hence our implementation has two distinct changeset implementations and treats the ‘remove all’ as a special case which does not require enumeration.
Another aspect which a concrete implementation should cover is the use of literals. At its current state the formal Prototype KB definition does not support literals as the value of a property. Instead, one has to represent a literal by using an agreed prototype. Therefore, we designed PredefinedKB which acts like a KB which implicitly contains all possible string and integer literals encoded as prototypes. The extension of the supported literal types to any type is facilitated.
2.1 Consistency Checking
When a KB is created out of a set of prototypes, it should be checked that the result is in accordance with our Prototype Knowledge Base definition [researchPaper, Definition 4]. In this paper we say that the KB must be checked for consistency. This consistency check is performed in the KnowledgeBase implementation. First, all IDs and property names must be valid absolute IRIs, which is enforced by the type system. Next, if there is any prototype with a definition involving an which cannot be found in the KB, then the creation will be refused. Then, it is checked whether all inheritance chains eventually (recursively) end up at the empty prototype. If also that is the case then a check is performed to ensure that no ID is used twice (this includes checking the underlying KB). In practice, one might want to remove this last check and the issue of duplicates could be resolved in favor of prototypes in a given KB. This is possible using the ChainedKB.
The design of our software helps to build KB which are consistent. The KB provides a builder class which can be used for construction. Further, the KB itself is completely immutable. Changes are made by creating a new KB. This ensures the consistency at any point in time.
2.2 Fixpoint Computation
Given a knowledge base we implemented a method to compute its interpretation . This interpretation contains for each prototype definition with ID in a new prototype definition of the form . Where is such that under an interpretation . This boils down to computing the fixpoint for each of the prototype expressions. However, a direct implementation of the definition would not work since there is a universal quantification over all IDs (an infinite set). Hence, we implement this such that simple change expressions are created only for these IDs which are actually used.
We implemented both the consistency check and the fixpoint computation in a scalable fashion. For both the consistency check and the computation of the fixpoints, the implementation is optimized such that it will not compute things twice. When the fixpoint for a prototype has already been computed, then the computation of the fixpoint for a prototype with will reuse this result. Similarly during the consistency check for recursive derivation from : if it is already known that a prototype derives recursively from , then we reuse this information to conclude that with base derives from . Next, we will introduce the data sets and the benchmarks in which they are used.
2.3 Data Sets and Benchmarks
Since the prototype system is new there are no existing real-world dataset which make use of its features. It would be possible to use existing RDF datasets as prototypes, but it would result in a KB which does not use the inheritance feature specific to prototypes. Therefore, we decided to use synthetic data sets for our benchmarks. We created three types of data sets in three different sizes, resulting in nine data sets altogether. Note that we do not use the remove capabilities in our data sets. This is not required since a remove will only reduce the burden on the system. An overview of the datasets can be found in table 1.
The first type of data sets does have beneficial properties for consistency checking and fixpoint computation. Further it does not have any properties attached to the proptotypes. Hence, we will call these baseline data sets. To generate the data we start with one prototype which derives from , next we create two prototypes which derive form the one, then we create four prototypes which derive from these two, and so on until we create prototypes which derive from (for ).
For the second type we change the set-up to be less ideal and introduce properties. We create , , and blocks of prototypes and hence this type will be called blocks. All prototypes in each block derive from a randomly chosen prototype in a lower block. Then, each of the prototypes has a property with a value randomly chosen from the block below. In the lowest block, the base is always and the value for the property is always the same fixed prototype.
The third type of data sets, which we call incremental, is more demanding for the fixpoint computation and consistency check. This time we add 1, 2, and 3 million prototypes to the KB, one at a time. Each prototype gets a randomly selected earlier created one as its base. Furthermore, each prototype gets between and properties chosen from distinct ones (with replacement). The value of each property is chosen randomly among the prototypes.
For each data set we measure how long it takes to perform the consistency check and to compute the fixpoint of all prototypes (i.e., compute the whole interpretation). We also measure the final average number of properties per prototype. These results can be found in table 2.
As can be seen from the table, the consistency check scales linear with the number of prototypes in the system. In some cases it seems like the behavior is even sub-linear. This is likely caused by just-in-time compilation.
For the fixpoint, the baseline dataset provides close to linear performance, which is expected. Again, larger sets seem to compute even faster (or rather, the small set is handled slower because the JIT compiler has not yet optimized the code). The blocks and incremental experiments also show the expected scalability. They do, however, not have a linear scaling because in the larger experiments the numer of properties per prototype in the fixpoints is larger.
The results are obtained by running the code on a single ‘Intel(R) Xeon(R) E5-2670 @ 2.60GHz’ core (using taskset). To keep results comparable we allowed the JVM to use large amounts of memory. The memory was mainly used to store the fixpoint of the KB, something one would in practice rarely keep in memory. These results show that the prototype system can scale well, even to millions of prototypes.
3 Distributed Knowledge Bases
Our goal is to create a KB which can be used on the web. Hence, it is not sufficient to show that our implementation can be used locally. In this section we present how we implemented the client-server version of our KB. In our implementation we use the ubiquitous HTTP protocol and well known data formats. We also illustrate that KBs can have explicit links to each other trough the Link header of the HTTP protocol. To show the knowledge sharing scenario in action we perform benchmarks in which we query the KB in a simulated web environment.
3.1 Prototype Serialization and Joining
3.1.1 Joining of Prototype Definitions
is needed when retrieving the representation of a prototype from multiple sources. In general the approach to this problem is dependent on the data sources and the amount of trust the client has in them. Imagine that one queries data source A for information about Germany. One of the properties of Germany returned by A is that the country has 80M inhabitants. When service B is asked about the same prototype a population of 20M is claimed. Now, the client has a specification of the schema it wants the countries to fulfill. Concrete, it could be that a country can only have one number for the population count property. Hence, the application would choose for the amount from the most trusted source. Several implementations of joining the changesets of prototypes are provided.
3.2 Deployment On Web Architecture
To serve prototypes on the web we use the HTTP protocol. To get the (serialized form of) the prototype, one needs to send a GET request with the protoype ID as a query parameter with the name . For example, if the server is located at http://example.com/ and one wants to request the prototype isbn:123-4-56-789012-3, then the request URL will be http://example.com?p=isbnWe also implemented a way to serve fixpoints of prototypes. Since we are using HTTP, we can also use the existing optimizations and caching startegies available. From the server perspective, we use gzip compression in case the client supports it. Further, the server indicates how long the prototype will remain unchanged using the Cache-Control header [rfc7234]. Besides, the ETag header [rfc7232] is used; if the client wants to use the prototype but the cache time has expired, then it only needs to check with the server whether the ETag has changed to know whether it is up-to-date. The server implementation uses an embedded Jetty server which can be configured as desired. For instance, it is possible to deploy the implemented handler using HTTPS or HTTP/2. The client side (RemoteKB