A Neural-Symbolic Architecture for Inverse Graphics Improved by Lifelong Meta-Learning
We follow the idea of formulating vision as inverse graphics and propose a new type of element for this task, a neural-symbolic capsule. It is capable of de-rendering a scene into semantic information feed-forward, as well as rendering it feed-backward. An initial set of capsules for graphical primitives is obtained from a generative grammar and connected into a full capsule network. Lifelong meta-learning continuously improves this network’s detection capabilities by adding capsules for new and more complex objects it detects in a scene using few-shot learning. Preliminary results demonstrate the potential of our novel approach.
The idea of inverting grammar parse-trees to generate neural networks is not new [23, 24], but has been largely abandoned. We revisit this idea and invert a generative grammar into a network of neural-symbolic capsules. Instead of labels, this capsule network outputs an entire scene-graph of the image, which is commonplace in modern game engines like Godot . Our approach is an inverse graphics pipeline for the prospective idea of an inverse game-engine .
We begin by introducing the generative grammar and how to invert symbols and rules to obtain neural-symbolic capsules (Section 3). They are internally different to the ones proposed by Hinton et al. , as they essentially act as containers for regression models. Next, we present a modified routing-by-agreement and training protocol (Section 4), coupled to a lifelong meta-learning pipeline (Section 5). Through meta-learning, the capsule network continuously grows and trains individual capsules. We finally demonstrate the potential of the approach by presenting some results based on an ”Asteroids”-like environment (Section 6) before ending with a conclusion.
2 Related Work
Capsule Networks. In  Hinton et al. introduced capsules, extending the idea of classical neurons by allowing them to output vectors instead of scalars. These vectors can be interpreted as attributes of an object and aim to reduce information-loss between layers of convolutional neural networks (CNN) [6, 8]. Capsules require specialized routing protocols, such as routing-by-agreement , where the activation probability of a capsule is dependent on the agreement of its real inputs with their expected values. Further extensions for capsules have been proposed, such as using matrices internally  or using 3D input .
Neural-Symbolic Methods. There has been a strong effort to make a purely neural approach to vision more interpretable [9, 12, 15, 18, 20, 30]. An alternative approach to interpretability has been to deeply intertwine symbolic methods into connectionist methods. For computer vision, this is proving fruitful for de-rendering and scene decomposition [7, 10, 11, 22, 25, 27, 33]. Many of these approaches use shape programs, a decomposition of the scene into a set of rendering instructions. The scene-graphs we construct can be viewed as such shape programs, but in a different representation. This symbolic information is well suited for more complex tasks, such as visual question answering (VQA) [13, 29] and scene manipulation . We follow many of the presented ideas in our work.
3 Generative Grammar
The neural-symbolic capsules are derived from a modified  generative attributed context-free grammar for image generation. We require that our grammar is non-recursive and has a finite number of symbols to avoid infinite productions. The following notation is used for our grammar :
: The axiom (starting) symbol of our grammar is some object name (e.g., or ) for which we want to generate further, more detailed symbols.
: The finite set of non-terminal symbols represents parts of the axiom symbol (e.g., ) or parts of parts (e.g., ). The further down the grammar parse-tree a symbol resides, the more primitive it becomes.
: The set of terminal symbols consists of graphical primitives. These may include elements such as or . Whichever terminal symbols are chosen will determine the possible complexity that can be represented by the non-terminal symbols.
: A production rule is of the form
The right-hand-side (RHS) of a rule has the form , where is either terminal or non-terminal and by we denote the total number of produced symbols. There may be multiple rules in that have the same left-hand-side (LHS) symbol. We also introduce a special function called draw that applies to terminal symbols, i.e., primitives, forcing them to produce a set of pixels corresponding to their graphical representation (cf. Figure 1)
: Every terminal and non-terminal symbol with rule , as well as each , is associated with an attribute vector .
: For to produce meaningful attributes, they must be constrained by a set of realistic laws that allow for a wide spectrum of results. Particularly, we introduce a set of non-linear equations which constrain the attributes. Each attribute of a symbol produced by rule is associated with a constraint and calculated using . By we denote the set of attributes and by the set of all constraints for . The draw function is also considered to be such a constraint .
The order of the symbols produced by a rule represents depth-sorting. For example, first draws the triangle, then the square. We interpret two rules with the same LHS symbol to be equivalent to either drawing different unique viewpoints (from the front, from the back) of the same object, drawing a primitive using different part configurations (chair with padding, chair without padding) or drawing it in different styles (sketch, photo).
4 Neural-Symbolic Capsules
In this section we introduce our neural-symbolic capsules. We ”invert” a symbol of our grammar to create a capsule and connect it reverse to the order in the parse-tree to form a capsule network. Our approach to the capsule’s internals is different to the one introduced in , however, the overall idea of routing-by-agreement and having vectorized outputs remains unchanged.
Terminal Symbols Primitive Capsules. Each terminal symbol represents a renderable graphical primitive that is connected directly to a layer of pixels and we refer to its inversion as a primitive capsule. These capsules perform detection based on pixel inputs.
Non-Terminal Symbols Semantic Capsules. We invert non-terminal symbols to form semantic capsules. By we henceforth interchangeably refer to both the capsule and its corresponding symbol.
Rules Routes. The constraints of a rule take the attributes of symbol and produces new attributes . After inversion, the capsule takes those same attributes as input to generate using . However, is not invertible in most cases and we instead introduce as our best approximation, such that
is minimized (cf. Figure 2). We refer to the inverted rule as a route and depending on context we denote by both the rule and the route. This also holds true for primitive capsules, where detection means the inversion of its draw function.
We use a modified version of the routing-by-agreement protocol to find the best fitting route and attributes. may appear on the LHS of multiple rules or RHS of multiple routes, but as only one route leads to the activation of the capsule, we introduce the activation probability for each route. Our goal is equivariance for the attributes and invariance for the activation probabilities under any feature-preserving transformation. For this we propose the following routing protocol as the internals of our capsule (cf. Figure 3):
The output for a route is calculated using
For each input for a route , we estimate the expected input value as if were unknown, using the following equation:
The activation probability of a route is calculated as
where denotes the set of all inputs that contribute to a route , the route’s input capsule’s probability of activation, an agreement-function with output vector of size , the -norm, some window function with , and the past mean probability for that input.
Steps 1. - 3. are repeated for each .
Find the route that was most likely used
and set the final output as
Steps 1 and 2 correspond to an architecture equivalent to a de-rendering autoencoder, , but with known interpretation for the latent variables (attributes). Here, acts as encoder and as decoder.
For now assume that is known in step 3. The agreement-function measures how well the inputs of correspond to the outputs of . For semantic capsules, we choose the agreement-function
where describes an -dimensional window function and the set of rotationally equivalent . For primitive capsules finding an appropriate depends very much on the design and symmetries of the decoder (draw-function).
4.2 Connecting Capsules
Individual capsules are connected as shown in Figure 2. A full capsule network is constructed from multiple grammars with different axioms (, , …). To avoid multiple capsules for the same symbol in the network, we merge them and introduce an observation table that stores all occurrences on a per-image basis (cf. Figure 3). For instance, a capsule does not need to connect to four capsules, but only to one with four entries in its observation table.
As these observations are reset at the beginning of each pass, we assume that all past entries in are stored in permanent memory elsewhere as
allowing us to calculate the mean value , as well as to perform meta-learning.
During a forward pass the entries in all the observation tables form one or many tree structures, which we call the observed parse-trees. Their topmost symbol is not necessarily one of the axioms of the grammars the capsule network is based on (cf. LHS of Figure 4). Each observed parse-tree, thus, induces its own observed grammar with the topmost symbol being its observed axiom.
For the multitude of observed grammars in the capsule network, we postulate that we can always define a higher-level grammar by simply taking their union and defining a new axiom with a rule that produces the previous axioms (cf. Figure 4). For example, the grammars for and allow us to define a meaningful higher-level grammar with as the axiom.
4.3 Training Capsules
Training Primitive Capsules. We assume the decoder (draw) for primitive capsules is known. Finding an analytical solution to is out of reach. Instead, we use a regression model for and synthesize training sets with . We define the inputs to using quantile functions for each attribute , which we may refine according to our prior knowledge. Next, with some uniform random variable, and some function that applies random backgrounds, occlusions and special effects, we generate ’s virtually infinite training set using
Training Semantic Capsules. If only of a route is known and unknown, we use a similar method to the case above. Ideally, we calculate using and train using the training sets
We must, however, first find a suitable . The initial output attributes of our semantic capsules consist of the distinct set of all input attributes . We are free in our choice for , which is non-injective in most cases. It is expected that there will be collisions, i.e., different sets of inputs leading to the same output. These collisions are the main focus of our meta-learning pipeline. To minimize these collisions, we choose a that calculates the mean of inputs of the same type, weighted by their size (width, height, depth). This weighting by size is to ensure that, for example, a wooden chair with many metallic screws is still considered wooden instead of metallic. For a general th attribute we have:
However, we use special functions for size and position
to ensure that they are in the correct reference frame. Here, are the vectorized subsets of the attribute vector for rotation, size and position, and indicate the Euler rotation matrix calculated from the rotation attributes and and the th corner position vector of the bounding box of (i.e., pairwise permutations of and ).
We can’t create arbitrary inputs for training. Instead we use observations from memory (cf. Equation 12) for augmentation. Detection needs to be invariant under changes of the outputs , and and we let denote transformation functions that rotate (acting on , ), translate (acting on ) and scale (acting on , ) all parts, while leaving the relative rotation, position and size to each other unchanged.
For the other attributes we have little knowledge on how to perform feature-preserving transformations. However, if our original training set (cf. Equation 14) contains an output attribute for which all entries are smaller than some , we can safely assume that we have never encountered an object with this attribute and are free to ”invent” possible values for this attribute type by simply setting all input uniformly to some constant. For example, if our training set is filled with real apples, for which the stem is brown and the body red or green, we can invent a metallic apple by simply assuming that both the stem and body are metallic, as we have no idea what it really would look like. By we denote a linear ”style” transformation that sets a constant value in the range to all unused attributes of the same type and either activates or deactivates it. We finally have our fully augmented set for training :
A single example is sufficient for the above training regime to begin augmentation by translating, rotating and resizing the object (), as well as trying out different styles ().
New Attributes and Re-Training. For semantic capsules, the set of attributes is not static and can grow. We differentiate between adding attributes due to inheritance and due to a trigger from the meta-learning agent.
Inheritance occurs automatically when one of the input capsules is extended by an attribute that is unknown to the current capsule. We simply expand the capsule’s internals by said attribute and retrain it. This inheritance propagates down the network, forcing subsequent capsules to inherit them as well.
The more interesting case arises when the capsule is triggered by meta-learning to expand its attributes by adding . First, we expand every attribute vector in memory for this capsule by the new attribute, but set to . Next, the internal attribute vector of the capsule and its and functions are extended. We begin by replacing with a new regression model of increased input width. The problem here is that we require to train it, which at this point has not been extended yet. Instead we split and into two parts, one containing the newly added attribute as an output and one with the previous attributes as output :
where by we mean the concatenation of two vectors. At this point , and are known and this suffices to start training using
Finally we need to determine . We add a regression model with one output that runs in parallel to and train it using the new decoder :
It is far too difficult to define the entire grammar with all rules and constraints from scratch to generate a complete capsule network. Instead, our approach works bottom-up and we only define the terminal symbols (primitive capsules), letting the meta-learning agent learn all semantic capsules and routes. This means that our initial set of primitive capsules limits what the network can eventually analyze and learn. Ideally, we would define primitive capsules for the most basic set of primitives from which we are able to construct every kind of object. We can, however, refine this set later on. A grammar with as terminal symbol produces the same results, even if it is refined by terminal symbols with the rule . For the draw functions, we rely on the current state of computer graphics. Here we have access to a near endless supply of parameterizable primitives  and graphics pipelines capable of physically-based rendering .
We postulated above that there always exists a higher-level grammar with an axiom that includes all the symbols of the observed grammars. We go a step further and view the capsule network as incomplete if there is more than one observed axiom (cf. Figure 4). There are four possible causes for this:
A non-activated parent capsule is lacking a route.
”What existing symbol best describes these parts?”
A parent capsule is missing.
”What new symbol best describes these parts?”
An attribute is lacking training data.
”What existing attribute best describes this style or pose?”
An attribute is missing.
”What new attribute best describes this style or pose?”
We may remedy these causes using one of two methods, either triggering the creation of a new route in an existing or new capsule ((A.1), (A.2)) or triggering the training of an existing or new attribute ((B.1), (B.2)).
However, deciding which of the four causes is responsible for the multiple observed axioms in the current forward pass is subjective even for humans. For example, consider a capsule network that has , and capsules. It encounters a new scene and the observed parse-tree contains four activations and one activation. , however, did not activate, triggering the meta-learning pipeline, due to multiple observed axioms. Is this just a with a previously unknown style (B.2)? Or is this a completely new capsule such as (A.2)?
We, thus, introduce a decision matrix (cf. Table 1). Akin to child development, we train this matrix by querying an oracle in the early stages of the capsule network’s training process and update the entries. Decisions are made by summing up all rows of features that evaluate to true and finding the column with maximum value. We may remove the oracle at any point in time, as it does not impair the learning capability of the network itself, only the ability of the agent to make human-like decisions and learn the correct names.
Lexical Interpretation. We interpret our grammar lexically. It is easy to see that each symbol represents a compound noun ( or ). For attributes, this analysis is more involved. Note that we have three attributes which we treated differently in Equations 15-17: and . We interpret these as prepositions. This becomes obvious, once we have multiple objects in a scene and are able to refer to their spatial relationship using words such as ”on” or ”near”, based purely on these attributes.
For static scenes, we interpret all remaining attributes as adjectives, such as ”wooden” or ”red”. Their magnitude is then related to adverbs, such as ”very”. However, in dynamic scenes, some attributes of an object change over time and describe new poses for the parts. Thus, we interpret these time-dependent attributes as verbs, such as ”walk”. Their value is equivalent to the normalized time evolution of an animation.
These interpretations are both interesting semantically, as well as for querying the oracle. Instead of presenting a choice between (A.1) - (B.2) and some values, an actual question can be formed from the activated features (cf. Table 1). Consider a capsule network that has thus far only seen a modern , made out of a blend of metal and wood. We now show it a chair made of the same parts, but with less metal and in a classical design. Even though the capsule was trained with basic style transformations , the design is still too complex to grasp. Instead, meta-learning is triggered by cause (B.2), because of a mismatch of attributes ”metallic”, ”wooden” and ”modern” in . As we have access to a lexical interpretation, we can make these abstract pieces of information easier to understand for a human oracle, by letting the meta-learning agent pose an actual question: ”This object looks similar to a modern chair, but is very wooden instead. What adjective best describes this style?”. The answer ”classical” then adds a new attribute to the capsule.
6 Implementation, Results and Comparison
Implementation. We implement the renderer of primitive capsules using signed distance fields  and their encoder using an AlexNet-like convolutional neural network  for regression. For semantic capsules we use Equations 15-17 for and a 4-layer deep dense regression neural network with activation functions for . The training data is generated synthetically using the process described in this paper and all hyperparameters, such as learning rate, are fine-tuned by hand. Our implementation called VividNet is found on Github at github.com/Kayzaks/VividNet.
Results. In the initial phase, our capsule network has three primitive capsules: , and . We begin by showing it an image of a spaceship (LHS of Figure 5), upon which it detects all the relevant graphical primitives, such as three triangles, one circle and one square, but has no semantic understanding of their relation. As this constellation leads to five activated capsules with no common parent, the meta-learning agent is called into action. In this case, it is obvious that (A.2) is triggered, as there are no semantic capsules yet. The exact split, however, is subjective and up to the oracle. We may treat these primitives as one space-ship (top row of Figure 5) or group them into two independent parts, booster and shuttle, which make up the space-ship (bottom row of Figure 5). In either case, the capsule network is extended by new capsules and trained using only this one example.
Now, the capsule network is shown a new scene, which includes an asteroid made up of three circles. The routes of the , and capsules find no agreement, as none of them have three circles as their parts. Again, the meta-learning agent queries the oracle, which concludes that a new capsule is required (A.2). The asteroids, however, vary quite a bit. In a new scene with a different asteroid, three circles are detected. This time however, a parent capsule () does exist, that admits all three circles as its parts, but it did not activate. This leads to a different set of activated features in our decision matrix and we find, after querying the oracle, that the is merely missing a route (A.1). Alternatively, the agent could have concluded that the capsule is missing some style attribute (B.2).
In Figure 5 we show two of the many possible timelines the training process could have taken, depending on the choice of features in the decision matrix, as well as the response by the oracle during the meta-learning process. It was sufficient to show the capsule network one spaceship (or its parts) and a few asteroids to construct the entire network and correctly identify these objects and all their attributes in a new scene.
Comparison. Our approach differs too much from current classification methods in order to make a direct numeric comparison. The neural symbolic capsule network can only express confidence, but has no notion of accuracy, as any inaccuracy is remedied by the meta-learning pipeline and its oracle. This does not mean it has perfect accuracy, but rather that it continues to learn forever. Further, the initial choice of primitive capsules is very important in the overall performance of the network. Any comparison would, thus, need to fixate the capsule network in a subjective configuration, eliminating the benefit of lifelong meta-learning.
7 Conclusion and Outlook
In this work we showed the internal workings of our neural-symbolic capsule network and how it extends itself through lifelong meta-learning. The proposed network is bi-directional, feed-forward (i.e., the capsule network) it is a pattern recognition algorithm and feed-backward (i.e., the generative grammar) it is a procedural graphics engine. We also showed how the network is capable of learning to detect new objects using a one/few-shot approach and that the training process is very human-like. This allows it to grow indefinitely with less training data, but requires the presence of an oracle to provide nouns, adjectives or verbs for new discoveries.
-  Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics. NIPS (2016)
-  Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences 110(45), 18327â18332 (2013)
-  Hamrick, J.B., Ballard, A.J., Pascanu, R., Vinyals, O., Heess, N., Battaglia, P.W.: Metacontrol for adaptive imagination-based optimization. ICLR (2017)
-  Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. International Conference on Artificial Neural Networks pp. 44–51 (2011)
-  Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. ICLR (2018)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. NIPS pp. 1097–1105 (2012)
-  Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. NIPS (2015)
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
-  Lipton, Z.C.: The mythos of model interpretability. CoRR abs/1606.03490 (2017)
-  Liu, Y., Wu, Z., Ritchie, D., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Learning to describe scenes with programs. ICLR (2019)
-  Liu, Z., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Physical primitive decomposition. ECCV (2018)
-  Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. CVPR pp. 5188–5196 (2015)
-  Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. ICLR (2019)
-  Martinovic, A., Gool, L.V.: Bayesian grammar learning for inverse procedural modeling. CVPR (2013)
-  Montavon, G., Samek, W., MÃ¼ller, K.R.: Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, 1–15 (2018)
-  Pharr, M., Humphreys, G., Jakob, W.: Physically Based Rendering 3rd Edition. Morgan Kaufmann (2016)
-  QuÃlez, i.: Rendering signed distance fields (2017), www.iquilezles.org
-  Ribeiro, M.T., Singh, S., Guestrin, C.: âWhy should i trust you?â explaining the predictions of any classifier. Knowledge Discovery and Data Mining pp. 1135–1144 (2016)
-  Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. NIPS (2017)
-  Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034 (2014)
-  Team, G.E.: Godot engine (2019), godotengine.org
-  Tian, Y., Luo, A., Sun, X., Ellis, K., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Learning to infer and execute 3d shape programs. ICLR (2019)
-  Towell, G.G., Shavlik, J.W.: Extracting refined rules from knowledge-based neural networks. Machine Learning 13(1), 71–101 (1993)
-  Towell, G.G., Shavlik, J.W.: Knowledge-based artificial neural networks. Artificial Intelligence 70(1), 119–165 (1994)
-  Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. CVPR (2017)
-  Ullman, T.D., Spelke, E., Battaglia, P., Tenenbaum, J.B.: Mind games: Game engines as an architecture for intuitive physics. Trends in Cognitive Science 21(9), 649–665 (2017)
-  Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. CVPR (2017)
-  Yao, S., Hsu, T.M.H., Zhu, J.Y., Wu, J., Torralba, A., Freeman, W.T., Tenenbaum, J.B.: 3d-aware scene manipulation via inverse graphics. NIPS (2018)
-  Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.B.: Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. NIPS (2018)
-  Zhang, Q., Wu, Y.N., Zhu, S.C.: Interpretable convolutional neural networks. CVPR pp. 8827–8836 (2018)
-  Zhao, Y., Birdal, T., Deng, H., Tombari, F.: 3d point-capsule networks. arXiv:1812.10775 (2018)
-  Zhou, Y., Zhu, Z., Bai, X., Lischinski, D., Cohen-Or, D., Huang, H.: Non-stationary texture synthesis by adversarial expansion. SIGGRAPH (2018)
-  Zou, C., Yumer, E., Yang, J., Ceylan, D., Hoiem, D.: 3d-prnn: Generating shape primitives with recurrent neural networks. ICCV (2017)