Inference Over Programs That Make Predictions
Abstract.
This abstract extends on the previous work (Perov, 2015; Perov and Wood, 2016) on program induction (Muggleton et al., [n. d.]) using probabilistic programming. It describes possible further steps to extend that work, such that, ultimately, automatic probabilistic program synthesis can generalise over any reasonable set of inputs and outputs, in particular in regard to text, image and video data.
1
1. Introduction
Probabilistic programming provides a natural framework for program induction, in which models can be automatically generated in the form of probabilistic programs given a specification. The specification can be expressed as inputoutput pairs, e.g. as “23year male has lower right abdominal pain (LRAP) and vomiting” =¿ “Appendicitis”; “34year female has stiff neck, severe headache” =¿ “Meningitis”, etc. We expect that given a set of such inputoutput examples, a program induction system will generate a model, which is capable of predicting an output for a new input. In the ideal scenario, the prediction will be a distribution over values rather than a specific deterministic value, as in “23year male has LRAP and vomiting” can be mapped to both “Appendicitis” and “Gastroenteritis” with probabilities .
In the previous work (Perov, 2015; Perov and Wood, 2016) it was shown how to define an adaptor (Johnson et al., 2007), stronglytyped grammar as a probabilistic program, which can generate other probabilistic programs given a specification in the form of a set of observations:
By performing inference over s, it was possible to infer simple probabilistic programs (specifically, samplers from one dimensional distributions, e.g. Bernoulli, Poisson, etc.) that can generalise over any available training data (input and output pairs) and predict new outputs given new inputs. The results were comparable to another stateoftheart approach of program induction, genetic programming.
2. Further Extensions
To facilitate the research into program induction using probabilistic programming, the follow improvements into the methodology are suggested:
2.1. Using nonparametric, hierarchical distributions for the grammar
In the previous work, the adaptor, stronglytyped grammar was used with an enhancement of adding an option of drawing variables from local environments. For example, if we were looking for a model that takes two integers as arguments and outputs another integer, we would define an initial local environment scope with two variables and predefined constants (like , etc.), and sample a candidate program from the grammar, i.e. as from (grammar (localscope (typed ’x1 ’int) (typed ’x2 ’int)) ’int). The method randomly generates an expression that produces an integer (or any other type):

either it will an integer constant (hence ),

an integer variable (hence ),

a predefined method , which returns an integer, with arguments (for each of which will be called recursively to construct the full expression ),

a new extended local scope (via ) with one more variable of any supported type (including functions themselves), such that the returned expression of that is still of integer type (hence ),

shortcircuit such that
, 
recursive call to the current function assuming the type signature of the current function is integer.
To make the priors over programs more flexible, we suggest to use nonparametric (Mansinghka et al., 2012; Johnson et al., 2007), hierarchical (Liang et al., 2010) priors. That is, instead of adding a local environment with a variable of arbitrary type and randomly grammargenerated expression of that type, we suggest to instead draw expressions from a nonparametric distribution defined as a oized function with a base function being the grammar itself. The arguments of a call to that function might be another expression generated using the same grammar, which ensures that the “keys” of that oized function will be generated based on the program induction inputs. The hierarchical property of such a prior might be achieved by ensuring that the same oized function might be call in its body; by doing so, we allow this function to decide whether to return an expression or to make another call to itself with different arguments, hence going deeper in the hierarchy.
2.2. Extending types supported, including higherorder typing
Another improvement over the previous work can be achieved by extending the types which can be used by the grammar. This includes adding more types like dictionaries, lists, matrices, sets, queues, etc. Also, ideally we would like to support “recursive” type definitions such that the grammar not just produces the expressions to be evaluated, but is also capable of producing expressions that generate other expressions to be evaluated.
2.3. Using discriminative model proposals
To facilitate inference in a probabilistic program induction framework, we can use modern advances in probabilistic programming inference. In particular, we can use discriminative models, such as neural networks, to facilitate the inference (Perov et al., 2015; Gu et al., 2015; Le et al., 2016; Douglas et al., 2017).
2.4. Incorporating the compressed knowledge in the form of embeddings
The set of functions that can be used by the grammar also can be extended. Specifically, we believe one of the most interesting additions into that set might be the pretrained embeddings. For example, we can incorporate (Mikolov et al., 2013) functions which would map a type “word” to a type “float”. This should allow the program induction to benefit from the compressed knowledge which the and similar embedding models represent.
2.5. “Ultimate” task for the induction
In the previous work, the probabilistic program induction was performed over a simple one dimensional distribution.
We believe that the most effective and cheap way to provide as much training data as possible is to set a task of predicting 120 words given previous 20500 words for an “arbitrary piece of text”. These pieces might be extracted from any source, e.g. from Wikipedia, news websites, books, etc. The observational likelihood might be a Binominal distribution , where is the number of words to predict for that particular inputoutput pair, and is the probability of “success”. This approach follows the methods of noisy Approximate Bayesian Computation (Marjoram et al., 2003). Parameter also might be varied in the process of inference, hence we might be performing simulated annealing. We believe that with enough computational power and with rich enough priors, the inference will be able to find models that predicts reasonably well what the next word or list of few words should be.
Once a good model that can predict next words is well trained, this task can be extended to predicting: audio and video (Perov et al., 2015) sequences, image reconstruction (Mansinghka et al., 2013; Kulkarni et al., 2015), texttoimage and imagetotext tasks, as well as then ultimately performing actions in environments like OpenAI Gym.
2.6. Distributing the computations
The inference over such a gigantic set of inputoutput pairs will require a massive amount of computations which needs to be distributed. One approach to run the inference in parallel might be running multiple Markov chains (e.g. using MetropolisHastings algorithm) where each chain is given some subset of observations (i.e. it would be similar to stochastic gradient descent approach), as well as those chains “lazily” share the hidden states of the nonparametric, hierarchical, adaptor grammar. By “lazy” sharing we mean that the hidden states of the nonparametric components of the grammar are to be updated from time to time.
2.7. Discussion over proposals
While this abstract has focused on possible enhancements to improve priors over models as well as possible ways of setting the inference objective, it is also important to allow the proposal over a new probabilistic program candidate (e.g., as in in MetropolisHastings) to be flexible, ideally by sampling the proposal function from the grammar as well. In that case, it will be “inference over inference”, i.e. nested inference (Rainforth, 2018) over the grammar and over the proposal. Another way of improving the process of inducing a new program candidate is the merging of two existing programs as in (Hwang et al., 2011).
2.8. Conclusion
This short abstract extends the previous work by suggesting some enhancements to allow more practical probabilistic program induction.
Implementing a system which is capable of such complex probabilistic program induction will require a lot of resources, with the computational resource and its distribution being the most expensive one, presumably.
Another careful consideration should be made to the choice of three languages:

the language in which the system is to be written,

the language in which the grammar is defined,

the language of the inferred probabilistic programs (models).
It might be beneficial if it is the same language altogether, such that the system can benefit from recursive use of the same grammar components (e.g. for doing inference over the inference proposal itself as suggested before). Also, ideally it is a “popular” language (or a language that can be easily transformed into such), such that all publicly available source code made be incorporated (Maddison and Tarlow, 2014) into the priors. Examples of such language candidates are Church (Goodman et al., 2012), Venture (Mansinghka et al., 2014), Anglican (Wood et al., 2014; Tolpin et al., 2015), Probabilistic Scheme (Paige and Wood, 2014) or WebPPL (Goodman and Stuhlmüller, 2014).
Acknowledgements.
This abstract is the extension to the work (Perov, 2015), as well as incorporates ideas which had been published online (Perov, [n. d.]b, [n. d.]a, [n. d.]c).References
 (1)
 Douglas et al. (2017) Laura Douglas, Iliyan Zarov, Konstantinos Gourgoulias, Chris Lucas, Chris Hart, Adam Baker, Maneesh Sahani, Yura Perov, and Saurabh Johri. 2017. A Universal Marginalizer for Amortized Inference in Generative Models. arXiv preprint arXiv:1711.00695 (2017).
 Goodman et al. (2012) Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2012. Church: a language for generative models. arXiv preprint arXiv:1206.3255 (2012).
 Goodman and Stuhlmüller (2014) Noah D Goodman and Andreas Stuhlmüller. 2014. The design and implementation of probabilistic programming languages.
 Gu et al. (2015) Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. 2015. Neural adaptive sequential monte carlo. In Advances in Neural Information Processing Systems. 2629–2637.
 Hwang et al. (2011) Irvin Hwang, Andreas Stuhlmüller, and Noah D Goodman. 2011. Inducing probabilistic programs by Bayesian program merging. arXiv preprint arXiv:1110.5667 (2011).
 Johnson et al. (2007) Mark Johnson, Thomas L Griffiths, and Sharon Goldwater. 2007. Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models. In Advances in neural information processing systems. 641–648.
 Kulkarni et al. (2015) Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. 2015. Picture: A probabilistic programming language for scene perception. In Proceedings of the ieee conference on computer vision and pattern recognition. 4390–4399.
 Le et al. (2016) Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. 2016. Inference compilation and universal probabilistic programming. arXiv preprint arXiv:1610.09900 (2016).
 Liang et al. (2010) Percy Liang, Michael I Jordan, and Dan Klein. 2010. Learning programs: A hierarchical Bayesian approach. In Proceedings of the 27th International Conference on Machine Learning (ICML10). 639–646.
 Maddison and Tarlow (2014) Chris Maddison and Daniel Tarlow. 2014. Structured generative models of natural source code. In International Conference on Machine Learning. 649–657.
 Mansinghka et al. (2012) Vikash Mansinghka, Charles Kemp, Thomas Griffiths, and Joshua Tenenbaum. 2012. Structured priors for structure learning. arXiv preprint arXiv:1206.6852 (2012).
 Mansinghka et al. (2014) Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099 (2014).
 Mansinghka et al. (2013) Vikash K Mansinghka, Tejas D Kulkarni, Yura N Perov, and Josh Tenenbaum. 2013. Approximate bayesian image interpretation using generative probabilistic graphics programs. In Advances in Neural Information Processing Systems. 1520–1528.
 Marjoram et al. (2003) Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavaré. 2003. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences 100, 26 (2003), 15324–15328.
 Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
 Muggleton et al. ([n. d.]) Stephen H Muggleton, Ute Schmid, and Rishabh Singh. [n. d.]. Approaches and Applications of Inductive Programming. ([n. d.]).
 Paige and Wood (2014) Brooks Paige and Frank Wood. 2014. A compilation target for probabilistic programming languages. arXiv preprint arXiv:1403.0504 (2014).
 Perov ([n. d.]a) Yura Perov. [n. d.]a. AI1 project draft. http://yuraperov.com/docs/ai/SolvingAI.pdf. [Online; accessed 01August2018].
 Perov ([n. d.]b) Yura Perov. [n. d.]b. AI1 proposal draft. http://yuraperov.com/docs/ai/AI1Proposal.pdf. [Online; accessed 01August2018].
 Perov ([n. d.]c) Yura Perov. [n. d.]c. AI1 proposal draft (two additional pieces). http://yuraperov.com/docs/ai/AI1Proposal_TwoAdditionalPieces.pdf. [Online; accessed 01August2018].
 Perov (2015) Yura Perov. 2015. Applications of probabilistic programming. Ph.D. Dissertation. University of Oxford.
 Perov and Wood (2016) Yura Perov and Frank Wood. 2016. Automatic sampler discovery via probabilistic programming and approximate bayesian computation. In Artificial General Intelligence. Springer, 262–273.
 Perov et al. (2015) Yura N Perov, Tuan Anh Le, and Frank Wood. 2015. Datadriven sequential Monte Carlo in probabilistic programming. arXiv preprint arXiv:1512.04387 (2015).
 Rainforth (2018) Tom Rainforth. 2018. Nesting Probabilistic Programs. arXiv preprint arXiv:1803.06328 (2018).
 Tolpin et al. (2015) David Tolpin, JanWillem van de Meent, and Frank Wood. 2015. Probabilistic programming in Anglican. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 308–311.
 Wood et al. (2014) Frank Wood, Jan Willem Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In Artificial Intelligence and Statistics. 1024–1032.