Inference Over Programs That Make Predictions

Inference Over Programs That Make Predictions

Yura Perov Babylon Health60 Sloane AvenueLondonSW3 3DDUnited Kingdom

This abstract extends on the previous work (Perov, 2015; Perov and Wood, 2016) on program induction (Muggleton et al., [n. d.]) using probabilistic programming. It describes possible further steps to extend that work, such that, ultimately, automatic probabilistic program synthesis can generalise over any reasonable set of inputs and outputs, in particular in regard to text, image and video data.

Probabilistic Programming, Program Induction, Program Synthesis
conference: The International Conference on Probabilistic Programming; 2018; Boston, MA, USAisbn: doi: copyright: none


1. Introduction

Probabilistic programming provides a natural framework for program induction, in which models can be automatically generated in the form of probabilistic programs given a specification. The specification can be expressed as input-output pairs, e.g. as “23-year male has lower right abdominal pain (LRAP) and vomiting” =¿ “Appendicitis”; “34-year female has stiff neck, severe headache” =¿ “Meningitis”, etc. We expect that given a set of such input-output examples, a program induction system will generate a model, which is capable of predicting an output for a new input. In the ideal scenario, the prediction will be a distribution over values rather than a specific deterministic value, as in “23-year male has LRAP and vomiting” can be mapped to both “Appendicitis” and “Gastroenteritis” with probabilities .

In the previous work (Perov, 2015; Perov and Wood, 2016) it was shown how to define an adaptor (Johnson et al., 2007), strongly-typed grammar as a probabilistic program, which can generate other probabilistic programs given a specification in the form of a set of observations:

(assume model (run-grammar array word))
(noisy-observe (apply model inp1) outp1)
(noisy-observe (apply model inpN) outpN)
(predict inp0)

By performing inference over -s, it was possible to infer simple probabilistic programs (specifically, samplers from one dimensional distributions, e.g. Bernoulli, Poisson, etc.) that can generalise over any available training data (input and output pairs) and predict new outputs given new inputs. The results were comparable to another state-of-the-art approach of program induction, genetic programming.

2. Further Extensions

To facilitate the research into program induction using probabilistic programming, the follow improvements into the methodology are suggested:

2.1. Using non-parametric, hierarchical distributions for the grammar

In the previous work, the adaptor, strongly-typed grammar was used with an enhancement of adding an option of drawing variables from local environments. For example, if we were looking for a model that takes two integers as arguments and outputs another integer, we would define an initial local environment scope with two variables and -predefined constants (like , etc.), and sample a candidate program from the grammar, i.e. as from (grammar (localscope (typed x1 int) (typed x2 int)) int). The method randomly generates an expression that produces an integer (or any other type):

  • either it will an integer constant (hence ),

  • an integer variable (hence ),

  • a predefined method , which returns an integer, with arguments (for each of which will be called recursively to construct the full expression ),

  • a new extended local scope (via ) with one more variable of any supported type (including functions themselves), such that the returned expression of that is still of integer type (hence ),

  • short-circuit such that

  • recursive call to the current function assuming the type signature of the current function is integer.

To make the priors over programs more flexible, we suggest to use non-parametric (Mansinghka et al., 2012; Johnson et al., 2007), hierarchical (Liang et al., 2010) priors. That is, instead of adding a local environment with a variable of arbitrary type and randomly grammar-generated expression of that type, we suggest to instead draw expressions from a non-parametric distribution defined as a oized function with a base function being the grammar itself. The arguments of a call to that function might be another expression generated using the same grammar, which ensures that the “keys” of that oized function will be generated based on the program induction inputs. The hierarchical property of such a prior might be achieved by ensuring that the same oized function might be call in its body; by doing so, we allow this function to decide whether to return an expression or to make another call to itself with different arguments, hence going deeper in the hierarchy.

2.2. Extending types supported, including higher-order typing

Another improvement over the previous work can be achieved by extending the types which can be used by the grammar. This includes adding more types like dictionaries, lists, matrices, sets, queues, etc. Also, ideally we would like to support “recursive” type definitions such that the grammar not just produces the expressions to be evaluated, but is also capable of producing expressions that generate other expressions to be evaluated.

2.3. Using discriminative model proposals

To facilitate inference in a probabilistic program induction framework, we can use modern advances in probabilistic programming inference. In particular, we can use discriminative models, such as neural networks, to facilitate the inference (Perov et al., 2015; Gu et al., 2015; Le et al., 2016; Douglas et al., 2017).

2.4. Incorporating the compressed knowledge in the form of embeddings

The set of functions that can be used by the grammar also can be extended. Specifically, we believe one of the most interesting additions into that set might be the pre-trained embeddings. For example, we can incorporate  (Mikolov et al., 2013) functions which would map a type “word” to a type “float”. This should allow the program induction to benefit from the compressed knowledge which the and similar embedding models represent.

2.5. “Ultimate” task for the induction

In the previous work, the probabilistic program induction was performed over a simple one dimensional distribution.

We believe that the most effective and cheap way to provide as much training data as possible is to set a task of predicting 1-20 words given previous 20-500 words for an “arbitrary piece of text”. These pieces might be extracted from any source, e.g. from Wikipedia, news web-sites, books, etc. The observational likelihood might be a Binominal distribution , where is the number of words to predict for that particular input-output pair, and is the probability of “success”. This approach follows the methods of noisy Approximate Bayesian Computation (Marjoram et al., 2003). Parameter also might be varied in the process of inference, hence we might be performing simulated annealing. We believe that with enough computational power and with rich enough priors, the inference will be able to find models that predicts reasonably well what the next word or list of few words should be.

Once a good model that can predict next words is well trained, this task can be extended to predicting: audio and video (Perov et al., 2015) sequences, image reconstruction (Mansinghka et al., 2013; Kulkarni et al., 2015), text-to-image and image-to-text tasks, as well as then ultimately performing actions in environments like OpenAI Gym.

2.6. Distributing the computations

The inference over such a gigantic set of input-output pairs will require a massive amount of computations which needs to be distributed. One approach to run the inference in parallel might be running multiple Markov chains (e.g. using Metropolis-Hastings algorithm) where each chain is given some subset of observations (i.e. it would be similar to stochastic gradient descent approach), as well as those chains “lazily” share the hidden states of the non-parametric, hierarchical, adaptor grammar. By “lazy” sharing we mean that the hidden states of the non-parametric components of the grammar are to be updated from time to time.

2.7. Discussion over proposals

While this abstract has focused on possible enhancements to improve priors over models as well as possible ways of setting the inference objective, it is also important to allow the proposal over a new probabilistic program candidate (e.g., as in in Metropolis-Hastings) to be flexible, ideally by sampling the proposal function from the grammar as well. In that case, it will be “inference over inference”, i.e. nested inference (Rainforth, 2018) over the grammar and over the proposal. Another way of improving the process of inducing a new program candidate is the merging of two existing programs as in (Hwang et al., 2011).

2.8. Conclusion

This short abstract extends the previous work by suggesting some enhancements to allow more practical probabilistic program induction.

Implementing a system which is capable of such complex probabilistic program induction will require a lot of resources, with the computational resource and its distribution being the most expensive one, presumably.

Another careful consideration should be made to the choice of three languages:

  • the language in which the system is to be written,

  • the language in which the grammar is defined,

  • the language of the inferred probabilistic programs (models).

It might be beneficial if it is the same language altogether, such that the system can benefit from recursive use of the same grammar components (e.g. for doing inference over the inference proposal itself as suggested before). Also, ideally it is a “popular” language (or a language that can be easily transformed into such), such that all publicly available source code made be incorporated (Maddison and Tarlow, 2014) into the priors. Examples of such language candidates are Church (Goodman et al., 2012), Venture (Mansinghka et al., 2014), Anglican (Wood et al., 2014; Tolpin et al., 2015), Probabilistic Scheme (Paige and Wood, 2014) or WebPPL (Goodman and Stuhlmüller, 2014).

This abstract is the extension to the work (Perov, 2015), as well as incorporates ideas which had been published online (Perov, [n. d.]b, [n. d.]a, [n. d.]c).


  • (1)
  • Douglas et al. (2017) Laura Douglas, Iliyan Zarov, Konstantinos Gourgoulias, Chris Lucas, Chris Hart, Adam Baker, Maneesh Sahani, Yura Perov, and Saurabh Johri. 2017. A Universal Marginalizer for Amortized Inference in Generative Models. arXiv preprint arXiv:1711.00695 (2017).
  • Goodman et al. (2012) Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2012. Church: a language for generative models. arXiv preprint arXiv:1206.3255 (2012).
  • Goodman and Stuhlmüller (2014) Noah D Goodman and Andreas Stuhlmüller. 2014. The design and implementation of probabilistic programming languages.
  • Gu et al. (2015) Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. 2015. Neural adaptive sequential monte carlo. In Advances in Neural Information Processing Systems. 2629–2637.
  • Hwang et al. (2011) Irvin Hwang, Andreas Stuhlmüller, and Noah D Goodman. 2011. Inducing probabilistic programs by Bayesian program merging. arXiv preprint arXiv:1110.5667 (2011).
  • Johnson et al. (2007) Mark Johnson, Thomas L Griffiths, and Sharon Goldwater. 2007. Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models. In Advances in neural information processing systems. 641–648.
  • Kulkarni et al. (2015) Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. 2015. Picture: A probabilistic programming language for scene perception. In Proceedings of the ieee conference on computer vision and pattern recognition. 4390–4399.
  • Le et al. (2016) Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. 2016. Inference compilation and universal probabilistic programming. arXiv preprint arXiv:1610.09900 (2016).
  • Liang et al. (2010) Percy Liang, Michael I Jordan, and Dan Klein. 2010. Learning programs: A hierarchical Bayesian approach. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). 639–646.
  • Maddison and Tarlow (2014) Chris Maddison and Daniel Tarlow. 2014. Structured generative models of natural source code. In International Conference on Machine Learning. 649–657.
  • Mansinghka et al. (2012) Vikash Mansinghka, Charles Kemp, Thomas Griffiths, and Joshua Tenenbaum. 2012. Structured priors for structure learning. arXiv preprint arXiv:1206.6852 (2012).
  • Mansinghka et al. (2014) Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099 (2014).
  • Mansinghka et al. (2013) Vikash K Mansinghka, Tejas D Kulkarni, Yura N Perov, and Josh Tenenbaum. 2013. Approximate bayesian image interpretation using generative probabilistic graphics programs. In Advances in Neural Information Processing Systems. 1520–1528.
  • Marjoram et al. (2003) Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavaré. 2003. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences 100, 26 (2003), 15324–15328.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • Muggleton et al. ([n. d.]) Stephen H Muggleton, Ute Schmid, and Rishabh Singh. [n. d.]. Approaches and Applications of Inductive Programming. ([n. d.]).
  • Paige and Wood (2014) Brooks Paige and Frank Wood. 2014. A compilation target for probabilistic programming languages. arXiv preprint arXiv:1403.0504 (2014).
  • Perov ([n. d.]a) Yura Perov. [n. d.]a. AI1 project draft. [Online; accessed 01-August-2018].
  • Perov ([n. d.]b) Yura Perov. [n. d.]b. AI1 proposal draft. [Online; accessed 01-August-2018].
  • Perov ([n. d.]c) Yura Perov. [n. d.]c. AI1 proposal draft (two additional pieces). [Online; accessed 01-August-2018].
  • Perov (2015) Yura Perov. 2015. Applications of probabilistic programming. Ph.D. Dissertation. University of Oxford.
  • Perov and Wood (2016) Yura Perov and Frank Wood. 2016. Automatic sampler discovery via probabilistic programming and approximate bayesian computation. In Artificial General Intelligence. Springer, 262–273.
  • Perov et al. (2015) Yura N Perov, Tuan Anh Le, and Frank Wood. 2015. Data-driven sequential Monte Carlo in probabilistic programming. arXiv preprint arXiv:1512.04387 (2015).
  • Rainforth (2018) Tom Rainforth. 2018. Nesting Probabilistic Programs. arXiv preprint arXiv:1803.06328 (2018).
  • Tolpin et al. (2015) David Tolpin, Jan-Willem van de Meent, and Frank Wood. 2015. Probabilistic programming in Anglican. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 308–311.
  • Wood et al. (2014) Frank Wood, Jan Willem Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In Artificial Intelligence and Statistics. 1024–1032.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description