\thesection Motivating Applications



Software complexity increases with every requirement, feature, revision, module, or software 2.0 (Artificial Intelligence (AI)) component that is integrated. Complexity related challenges in traditional software engineering have many tools and methodologies that mitigate and alleviate issues (e.g., requirements engineering, version control systems, unit testing). However, tight integration of AI components in programs is still in its infancy and so are the methodologies and tools that allow combined analysis, development, testing, integration, and maintenance.

We present \emphProbabilistic Software Modeling (PSM), a data-driven modeling paradigm for predictive and generative methods in software engineering. PSM is an analysis methodology for traditional software (e.g., Java \citeArnold2000) that builds a \emphProbabilistic Model (PM) of a program. The PM allows developers to reason about their program’s semantics on the same level of abstraction as their source code (e.g., methods, fields, or classes) without changing the development process or programming language. This enables the advantages of probabilistic modeling and causal reasoning for traditional software development that are fundamental in other domains (such as medical biology, material simulation, economics, meteorology). PSM enables applications such as test-case generation, semantic clone detection, or anomaly detection seamlessly for both, traditional software as well as AI components and their randomness. Our experiments indicate that PMs can model programs and allow for causal reasoning and consistent data generation that these applications are built on.

PSM has four main aspects: \emphCode (Structure), Runtime (Behavior), Modeling, and Inference. First, PSM extracts a program’s \emphstructure via static code analysis (\emphCode). The abstraction level is properties, executables, and types (e.g., fields, methods, and classes in Java) but ignores statements, allowing PSM to scale. Second, it inspects the program’s \emphbehavior by observing its runtime (\emphRuntime). This includes property accesses and executable invocations. Then, PSM combines this static structure and dynamic behavior into a probabilistic model (\emphModeling). This step also represents the main contribution of this work. Finally, predictive or generative applications (e.g., a test-case generator or anomaly detector) leverage the models via statistical inference (\emphInference).

The prototype used for the evaluation is called \emphGradient1nd is openly available.

First, Section \refsec:related work views our contribution from the perspective of existing related domains. Section \refsec:example introduces an illustrative example we use throughout this paper. In Section \refsec:applications we motivate our contribution by providing an outlook on possible applications and research opportunities that PSM enables. Then we briefly discuss the nomenclature and background needed to understand PSM (Section \refsec:background). Section \refsec: approach presents the main contribution containing the general usage pragmatism and construction methodologies for PSM models on a conceptual level. A comprehensive evaluation of whether software can be transformed into statistical models is given in Section \refsec:study and discussed in Section \refsec:discussion. Section \refsec:conclusion concludes the paper.


Related Work\labelsec:related work To position PSM it is useful to distinguish between \emphprogramming paradigms and \emphsoftware analysis methods. A programming paradigm is a collection of programming languages that share common traits (e.g., object-oriented, logical, or functional programming). Analysis methods extract information from programs (e.g., design pattern detection, clone detection). PSM is an \emphanalysis method that analyzes a program given in an object-oriented programming language and \emphsynthesizes a probabilistic model from it.


Probabilistic programming is a programming paradigm in which probabilistic models are specified. Developers describe probabilistic programs in a domain-specific language (e.g., BUGS \citeLunn2009) or via a library in a host language (e.g., Pyro \citeBingham2019, PyMC \citeSalvatier2016, Edward \citeTran2017b). In contrast, PSM analyzes a program written in a traditional programming language and translates it into a probabilistic program. This difference also holds for modeling concepts like \emphBayesian Networks \citeKoller2009 or \emphObject-Oriented Bayesian Networks \citeKoller1997, Musella2015 that can be implemented via a probabilistic programming language.


Formal methods are a programming paradigm that leverages logic as a programming language (e.g., TLA+ \citeLamport2002 or Alloy \citeJackson2002). \emphStochastic model checking \citeKwiatkowska2007 introduces uncertainty in the rigid formalism to model, e.g., natural phenomenons. Developers specify the behavior and provide the state transition probabilities in a special-purpose language (e.g., PRISM \citeKwiatkowska2011, PAT \citeLiu2011, CADP \citeGaravel2013). Again, PSM analyzes a program and synthesizes a PM allowing developers to work with the programming language of their choice.


Symbolic execution \citeKing1976 is an analysis method that executes a program with symbols rather than concrete values (e.g., JPF-SE \citeAnand2007, KLEE \citeCadar2008, Pex \citeTillmann2008). It can be used to determine which input values cause specific branching points (if-else branches) in a program. \emphProbabilistic symbolic execution \citeGeldenhuys2012 is an extension that quantifies the execution, e.g., branching points, in terms of probabilities. This is useful for applications that quantify program changes \citeFilieri2015 or performance \citeChen2016. Probabilistic symbolic execution operates on the statement level while PSM abstracts statements capturing, e.g., inputs and outputs of methods. This abstraction makes PSM computationally scalable while symbolic execution suffers from state explosions. Furthermore, this abstraction shifts the analysis focus to the program semantics compared to the statement semantics (e.g., what happens between methods vs. what happens at the if statement).


Probabilistic debugging \citeXu2018, Andrzejewski2007 is an analysis method that supports developers in debugging sessions. The debugger assigns probabilities to each statement and updates them according to the most likely erroneous statement. Again, in contrast to PSM, they operate on statement level. Another difference is given in the methodologies life cycle. Debugging has an operational life cycle only valid until the bug is found. PSM and the resulting models are intended to be persisted along with the matching source code revision. This allows, e.g., method-level error localization, by comparing multiple revisions of the same model.


Invariant detectors \citeHangal2002, Ernst2001, Le2018, Zuo2014, Gore2011, Lo2012 learn assertions and add them to the source code. This helps to pinpoint erroneous regions in the source code. Invariant detectors learn rules of value boundaries of statements (i.e., pre- and post-conditions), not the actual distribution. However, this distribution allows PSM to generate new data enabling causal reasoning across multiple code elements.


Illustrative Example\labelsec:example Consider as our running example the \emphNutrition Advisor that takes a person’s anthropometric measurements (height and weight) and returns a textual advice based on the \emphBody Mass Index (BMI). Figure \reffig: nutrition advisor structural shows the class diagram of the Nutrition Advisor, consisting of three core classes and the \codeServlet class. Classes considered by PSM are annotated with \emphModel (e.g., \codePerson). Figure \reffig: nutrition advisor behavioral depicts a sequence diagram of one program trace with concrete values. The \codeServlet receives properties (e.g., height, weight, or gender) with which it instantiates a Person object (not shown). \codeNutritionAdvisor.advice() takes this \codePerson object, extracts the \codeheight (\num168.59) and \codeweight (\num69.54) and computes the person’s BMI (\num24.466) via \codeBmiService.bmi(). The result is a textual advice based on the BMI (”You are healthy, try a …”). Note that, for the sake of simplicity, Figure \reffig: nutrition advisor structural only shows a subset of the code elements from the real Nutrition Advisor (e.g., \codePerson.name or \codePerson.age are omitted). Given a program such as the Nutrition Advisor, PSM can be used to build a network of probabilistic models with the same structure and behavior.


[t]0.37 \includegraphics[width=]structural {subfigure}[t]0.62 \includegraphics[width=]behavioral

Figure \thefigure: [Code] The static structure of the Nutrition Advisor, consisting of three core classes and a context class (e.g., a web-interface) calling the program.
Figure \thefigure: [Runtime] The dynamic behavior of the Nutrition Advisor, visualized by one execution trace. The \codeNutritionAdvisor handles \codeadvice requests in which \codePerson objects are received and a textual advice is returned.
Figure \thefigure: The Nutrition Advisor receives a person with its anthropometric measurements and computes a textual advice regarding the person’s diet. For simplicity some properties and executables are omitted.

\thesection Motivating Applications

PSM is a generic framework that enables a wide range of predictive and generative applications. This section lists a selection of possible applications.

\thesubsection Predictive Applications

Predictive applications seek to quantify, visualize, infer and predict the behavior and quality of a system. Visualization and Comprehension [Jayaraman2017, Brown1985, Mukherjea1994] applications help to understand software and its behavior. This includes the visualization of code elements and non-functional attributes such as performance. The PMs are the source of the visualization showing the global but also contextual behavior across code elements. For example, Figure Document visualizes the \codeheight-property in which typical and less typical values can be seen in a blink. visualizes a context-aware behavior how gender affects the height. Semantic Clone-Detection [Gabel2008, Kim2011] applications detect syntactically different but semantically equivalent code fragments, e.g., the iterative and recursive version of an algorithm. Traditionally, clone detection compares source code fragments focusing on exact or slightly adapted clones. However, semantic equality is beyond purely static properties of source code. PSM can detect method level clones by comparing their models. The comparison can be realized, for example, via statistical tests on sampled data [Mann1947, Kruskal1952, Massey1951] (simple automated decision), via visualization techniques such as Q-Q plots [Wilk1968] (comprehensive manual decision), or a combination these. Anomaly Detection [Hangal2002, Aniello2016, Kotu2019, Chandola2009] applications measure the divergence between a persisted PSM model and a newly collected observation. These applications can be deployed into a live system, in which components are monitored and checked against their models. A threshold checks for unlikely runtime observations (i.e., ) triggering additional actions in cause of a failure. and its effects on other elements can then be investigated with, e.g., visualization and comprehension techniques, for further decision-making processes.

\thesubsection Generative Applications

Generative applications leverage observations drawn from the models, e.g., executable inputs or property values. Test-Case Generation [Cseppento2017, Fraser2012] applications draw observations from executable and property models to generate test data. PSM can generate scoped test data with a specific likelihood or for a specific system scenario (system state). For instance, likelihood-scoped data can be used to generate different test suites such as typical, rare, or unseen by sampling where and are predefined boundaries of the likelihood. This helps to strengthen test suites with meaningful, automatically generated tests based on real (un)likely behavior. Simulation applications sample execution traces from the network of models in a structured fashion to reproduce the running system. This probabilistically executes the original program without actually running it. Simulations can bridge boundaries between hardware and software interfaces, reducing the number of hardware dependencies during development.

\thesection Background


[t]0.55 \includegraphics[width=]structural-prob {subfigure}[t]0.43 \includegraphics[width=]behavioral-prob

Figure \thefigure: [Modeling] The Probabilistic Model Network of the Nutrition Advisor (simplified). Elements within the Probabilistic Modeling Universe are modeled according to their probabilistic expressions. Triangles are properties, circles are executables, and rectangles are types. The superscripts represent property reads , and executable invocations , parameters , and return values .
Figure \thefigure: [Modeling] The distribution of the \codePerson.weight properties. The histogram are the runtime observations that were sampled from the True Distribution (usually unknown). The Fitted Distribution is the model approximation based on the data.
Figure \thefigure: The Nutrition Advisor system as Probabilistic Model Network (left) and the model of the \codePerson.weight node (right).

PSM combines two major domains: Software Engineering (SE) and Machine Learning (ML). Naturally, some terms can be misinterpreted depending on the readers background. The following terminology was chosen as the best common ground and might be untypical in the respective domain.

\thesubsection Code

Types, properties, and executables are object-oriented terms (e.g., classes, fields, and methods in Java [Arnold2000], see Figure Document). In the context of PSM, these are referred to as code elements. These code elements can be organized in an Abstract Semantics Graph (ASG), which is a high-level version of an abstract syntax tree (AST). An ASG contains no lexical nodes but has additional semantic relationships (e.g., typing information of expressions). Also, in the context of PSM, we define that each code element has a symbol. A symbol is a numerical identifier, e.g., .

\thesubsection Runtime

Runtime monitoring (or dynamic code analysis) [Ball1999] is the process of observing a running program. The program is executed by a trigger (parameters and environment) which is the context of the monitoring session. A running program spawns event streams which are sequences of monitoring events (e.g., Figure Document). These events contain information such as properties that were changed or executables that were invoked. Also, the stream shows which parts of the underlying source code are active with the given trigger. Tracing tracks every possible event at runtime, whereas sampling records events according to a specific rate.

\thesubsection Modeling

A probabilistic model uses the theory of probability to model a complex system (e.g., Nutrition Advisor). A random variable (e.g., ) captures an aspect of the system’s event space. The value range of random variables is given by (e.g., ). A probability distribution is a mapping from events in the system to real values (e.g., Figure Document histogram elements map to a point on the Fitted Distribution line). These values are between and and all values sum up to 1. The marginal distribution describes the probability distribution of the random variable (e.g., ). The joint distribution represents the probability distribution that can be described with all of the variables (e.g., ). A conditional distribution describes the probability distribution of given that some additional information of the random variable was observed (e.g., ). is called the conditional and scopes the distribution of . More background information is given, e.g., by Koller and Friedman [Koller2009], Murphy [Murphy2012], or Bishop [Bishop2006]. PSM is mostly interested in the conditional distribution of a code element given its invoking context, e.g., a property access with its context (e.g., advice method). A probabilistic expression such as is equivalent to pseudocode in SE. They describe a process (e.g., a sorting algorithm) that can be parameterized with a concrete implementation and technology (e.g., functional implementation given in Haskell [PeytonJones2003] or object-oriented implementation in Java [Arnold2000]). Similarly, can be parameterized via a stochastic model representing its quantity, and in hindsight its process. This work presents the modeling strategies (see Section Document) in the form of probabilistic expressions that our prototype parameterizes via Real Non-Volume Preserving Transformations (NVPs) [Dinh2016]. NVPs are density estimators that allow efficient and exact inference, sampling, and likelihood estimation of data points. NVPs learn an invertible and pure bijective function (with ) that map the original input variable to simpler latent variables . The latent variables are often isotropic unit norm Gaussian that are well understood in terms of sampling and likelihood evaluation. An NVP is a combination of multiple small neural networks, called coupling layers, that are combined by simple scale and translation transformations. Conditional NVPs are an extension that estimate .

\thesubsection Inference

Every PSM application in Section Document is build upon inference. It is the combination of sampling, conditioning, and likelihood evaluation. Each node in a PSM network is an NVP. Sampling with NVPs is done by sampling from the Gaussian latent-space and applying the NVP in inverse . NVP can be conditioned statically and dynamically. Static conditioning is achieved by adding additional features to the network during training. Dynamic conditioning finds latent-space configurations that match the condition by, e.g., variational inference [Blei2017, Rezende2015]. Finally, likelihood evaluation is achieved by evaluating the likelihood under the Gaussian latent-space times the NVPs Jacobian


More details are given by Dinh et al. [Dinh2016].

\thesection Approach



Figure \thefigure: Source Code (1) has a Program Structure (2) and a Runtime Behavior (3) that is extracted via Static and Dynamic Code Analysis. These result in a Probabilistic Model Network (empty) (4) and Behavior Datasets (5) that are combined by Optimization (6) into the final Probabilistic Model Network (fitted) (7).

PSM is a four-fold approach illustrated in Figure Document in which:

  1. [Code] static code information is extracted and analyzed;

  2. [Runtime] runtime behavior is collected and transformed;

  3. [Modeling] probabilistic models are built by combining code and runtime data;

  4. [Inference] applications are build by leveraging causal reasoning and data generation.

The main contributions of this work are concepts and realizations in the Modeling aspect.

\thesubsection Code

The input is the Source Code (1) of a program (e.g., of the Nutrition Advisor). Then, Static Code Analysis extracts the Program Structure (2) in the form of an ASG. The class diagram in Figure Document may act as an abstract substitute of the structure in this example. Elements that are to be modeled are annotated with the label Model. In that regard, PSM is selective of the code elements considered for static and dynamic code analysis. The selection depends on the application context (see Section Document), or the developer’s interest. The set of all code elements PSM considers is called the Modeling Universe.

\thesubsection Runtime

Dynamic Code Analysis extracts the Runtime Behavior (3) by executing the program with a trigger and monitoring the internal events. This results in an event stream similar to the sequence diagram in Figure Document. Events are property accesses and executable invocations of code elements in the modeling universe. Depending on the application context (see Section Document), execution triggers can be, e.g., tests (weak), or the runtime of a deployed system (strong). For example, Visualization and Comprehension demands a trigger as close as possible to the real environment (manual understanding). In contrast, Semantic Clone-Detection makes differential comparisons between models where synthetic data suffices (automatic comparisons).

\thesubsection Modeling

PSM extracts from the Program Structure the code element topology and builds the Probabilistic Model Network (empty) (see Figure Document, step 4). From a software engineering perspective, this process is comparable to traversing the ASG and attaching an empty (unfitted) PM to the every node. An example network is demonstrated in Figure Document where each node is a PM. The actual construction rules (probabilistic expressions) to build such a PSM network are given below (Section Document). The Dataset Creation tallies and pre-processes the event stream into Behavior Datasets (5) for each code element. The Model Parameter Optimization (6) fits each PM, i.e., node in the Probabilistic Model Network, to the Behavior Dataset of the associated code element. This results the Probabilistic Model Network (fitted) (7) with the same topology found in the Program Structure, optimized towards the observed Runtime Behavior.

Construction Rules

The construction rules define how each node in the Probabilistic Model Network (4), i.e., a given code element, is transformed into a probabilistic expression. This expression is a description of the model (random) variables and its approximating quantity (e.g., see Figure Document). Hence, building the PM network equals {enumerate*} a traversal in the program’s ASG; an application of the construction rules creating a probabilistic expression (per node); and the parameterization of the expressions with a concrete model (e.g., VAEs). The property construction rule defines a property model by the property value itself, conditioned on the symbol of the accessing executable (conditional).


and are the read and write accesses to the property. For example, the \codePerson.weight model is defined by . The value range of the property depends on the property itself, whereas the range of the conditional is all (executable) symbols that exist in the project . This includes executable symbols that live outside the PSM Universe. The conditional allows PSM to differentiate between call sites. This allows each call site to have a different distribution. For example, \codeNutritionAdvisorAdolesence and \codeNutritionAdvisorAdult use the \codeBmiService leading to two slightly shifted weight distributions in the same model. The executable construction rule defines an executable model by a joint distribution of the inputs and outputs, conditioned on the symbol of the invoking executable. {align} P(\textExecutable ∣C) &= P(I , O ∣C)
&= P(⏟\bmPa, \bmInv, \bmR_I, ⏟\bmW, \bmRet_O ∣C) are parameters, are (executable) invocations, are property reads, are property writes, are the return values, and are all (executable) symbols that exist in the project . An example would be the \codebmi-method with where (see Figure Document). The type construction rule defines a type model by the joint distribution of properties the type declares, conditioned on the symbol of the accessing executable.


For example, a \codePerson object is defined by . The type distribution is empty in the case where no properties exist as Figure Document shows for the \codebmiService property in \codeNutritionAdvisor. Sampling from a type distribution instantiates a new object of a given type by assigning the sampled values to the properties.

Technical Modeling Considerations

PSM estimates the density of the values that code elements emit during runtime in the form of generative models. It searches for a model from which new samples can be drawn, and that compresses the original monitoring data into a fixed set of parameters. This goal stipulates a set of requirements with which the network nodes can be parameterized. The model should be a scalable, parametric, decidable, generative, (conditional) density estimator.

  • Scalable such that it can handle the enormous amounts of data running systems produce.

  • Parametric such that it has a fixed set of parameters that can be stored and shared.

  • Decidable such that the parameter optimization has a clear convergence criterion.

  • Generative such that it allows for efficient sampling of the approximated distribution.

  • A (Conditional) Density estimator that is capable of approximating arbitrary data.

Besides, the learning process should be as robust as possible to reduce human intervention. Each requirement is tied to functional (generative, density estimator) or non-functional (scalable, parametric, decidable) requirements of PSM. One class of models that fit many of these requirements are likelihood-based deep generative networks like Variational Auto-Encoders (VAEs) [Kingma2013, Doersch2016, Sohn2015] or flow-based methods like the Real Non-Volumetric Preserving Transformation [Dinh2016] and derivatives [Grathwohl2018, Germain2015a, Papamakarios2019] . Another technical consideration is that Equations Document, Document, and Document can be factorized in each other. That is, a real implementation does not need a model for each property, executable, and type but may combine them into one model. The prototype in this work uses exclusively executable models (see Section Document).

\thesubsection Inference

Inference is the fundament of all applications motivated in Section Document and illustrated in Figure Document. The three tightly connected main aspects of inference are sampling (generation), conditioning (information propagation), likelihood evaluation (criticism). Sampling draws observations from one (local) or multiple (global) nodes (NVPs) in the PSM network. This enables the probabilistic execution of e.g., an executable or a subsystem. Conditioning sets the models into a specific state. For example, Figure Document illustrates the height property in its unconditioned and conditioned state. Local conditioning sets one node into a state. Global conditioning propagates a state across multiple nodes. Likelihood Evaluation quantifies samples in terms of their likelihood under a given node (i.e., a model). Figure Document illustrates the combination of these aspects and combines them into causal forward (8) and backward (9) reasoning. Forward reasoning (8) (e.g., \codePerson.height to \codeBmiService.bmi) samples a conditional distribution and propagates it through the network to set downstream nodes into a conditioned state. Backward reasoning (9) starts at a conditioned downstream node and searches for the most likely cause. At every step it is possible to draw conditional or unconditional samples. The directional aspect (forward and backward) is based on the source codes dependency graph. PSM networks, however, are undirectional (a network of joint-distributions).

\thesection Study

The core hypothesis of PSM is that programs can be transformed into a probabilistic model. This study (i.e., the prototype, research questions, analyses, and discussions) focuses on evaluating the core PSM methodologies presented in Section Document. Specifically, the study answers the following questions, providing evidence for the core hypothesis:

  1. [label=RQ0]

  2. [Code] Are projects exposing enough code elements that are eligible for PSM?

  3. [Runtime] Are code elements creating enough runtime data with which the model parameters can be optimized?

  4. [Modeling] Are probabilistic models capable of capturing the runtime data of eligible code elements?

  5. [Inference] Is the network of probabilistic models capable of solving inferential tasks?

Document addresses the precondition whether projects expose enough data (i.e., number or text) code elements that can be modeled. Document addresses the precondition whether these (data) code elements create a sufficient amount of runtime data that can be modeled. Document addresses the central question whether the behavior of a program in the form of its runtime data can be approximated via the concrete models. Finally, Document evaluates the usefulness of the approach and whether PSM is a sound basis for the applications presented in Section Document. The four questions are scoped by structured programs that can be executed and support runtime monitoring. The empirical evidence in this work is essential for any future endeavor related to statistical modeling of software. The evaluation of concrete applications of PSM described in Section Document are beyond the scope of this study.

\thesubsection Setup



Table \thetable: Hyper-parameters used in the experiments.
\toprule# Stage Name Values
\midrule1 Data Size \numrange2010000
2 Data Test Split \num10%
3 Preprocessing Number Standardization
4 Preprocessing Discretization Threshold 16
5 Preprocessing Discretization Encoding Base 10
6 Preprocessing Text Encoding Base 10
7 Optimizer Algorithm Adam [Kingma2014]
8 Optimizer Learning Rate \num5d-4
9 Optimizer Weight Decay [Krogh1992] \num5d-2
10 Optimizer Batch Size full dataset
11 Optimizer Max Epoch \num1000
12 Optimizer Early Stopping Patience 20 epochs
13 NVP [Dinh2016] Coupling Count \num6
14 Coupling Layer [Dinh2016] Linear Layer Count \num2
15 Coupling Layer [Dinh2016] Hidden Units Count \num32 (low) \num128 (high)
16 Coupling Layer [Dinh2016] Latent-Space
17 Coupling Layer [Dinh2016] Translation Activations Gelu [Hendrycks2016]
18 Coupling Layer [Dinh2016] Scale Activations Gelu [Hendrycks2016], Tanh

We implemented a prototype called Gradient2 that reflects the process and data flow presented in Figure Document.

  1. The input Source Code were open source subject systems written in Java (see next Section Document).

  2. The Program Structure was extracted using Spoon [Pawlak2015].

  3. \eAspectj

    was used to weave monitoring aspects (tracing) into the subject systems to capture their Runtime Behavior in the modeling universe.

  4. The Probabilistic Model Network (empty) was created by applying the rules from Section Document for each code element. Shape and size of the NVPs is given in Table Document.

  5. The Behavior Datasets were created by tallying the event stream. This includes splitting the dataset into training and evaluation partitions and preprocessing them. Preprocessing consisted of encoding text features by enumerating (starting from 0) and encoding them in a base 10 vector space. The same procedure was applied to the conditional dimension. Number dimensions were considered discrete if less or equal than \num16 values were found and underwent the same base 10 encoding procedure. Finally, all dimensions were standardized to have a mean of zero and a standard deviation of 1.

  6. Model parameters were optimized with their datasets, and the best parameter setting was retained (w.r.t. evaluation performance).

  7. Finally, the persisted models were used in the analysis scenarios (see Section Document).

Hyper-parameters of the experiments are given in Table Document. The chosen values are based on additional non-reported experiments evaluated on a synthetic dataset. All experiments were executed on a single machine (Intel i7, Nvidia GTX 970).

\thesubsection Subject Systems



Table \thetable: Overview of the projects used in the study. LoC are the lines of code in a project.
\topruleProject Version #Files #LoC Type Property Parameter Executable
\cmidrule7-10 \cmidrule12-15 \cmidrule17-21 Data Ref Unk Total Data Ref Unk Total Data Ref Void Unk Total
\midruleNutrition Advisor 0.1.0 \num5 \num154 \num5 \num11 \num3 \num1 \num15 \num19 \num1 \num0 \num20 \num10 \num0 \num19 \num1 \num30
Structurizr 1.0.0 \num115 \num9941 \num123 \num229 \num85 \num24 \num338 \num725 \num342 \num26 \num1093 \num320 \num302 \num508 \num20 \num1150
jLatexmath 1.0.7 \num156 \num21369 \num191 \num490 \num121 \num81 \num692 \num1115 \num556 \num153 \num1824 \num269 \num416 \num511 \num59 \num1255
PMD 6.5.0 \num799 \num89349 \num981 \num1858 \num503 \num481 \num2842 \num2933 \num2910 \num1943 \num7786 \num3222 \num719 \num3445 \num2073 \num9459
\midrule \num1075 \num120813 \num1300 \num2588 \num712 \num587 \num3887 \num4792 \num3809 \num2122 \num10723 \num3821 \num1437 \num4483 \num2153 \num11894

Data = {Number, Text}, Ref = Reference, Unk = Unknown

The study uses four subject systems listed in Table Document. Nutrition Advisor is the running example introduced in Section Document. Structurizr [Structurizr2019] is a developer-focused software architecture visualization tool. jLatexmath [Opencollab2019] is a library for rendering LaTeX formulas. PMD [PMD2019] is a static code analysis tool for Java applications. All code elements of the projects were included in the modeling universe (excluding inherited third-party elements). Nutrition Advisor received \num1000 advice requests as a trigger with data based on the NHANES [Nhanes2013] dataset. jLatexmath and Structurizr were executed with examples provided in their documentation. PMD analyzed the Nutrition Advisor and output the results in HTML format. The subject systems and their triggers are openly available3 as a benchmark suite for future experiments and comparisons.

\thesubsection Controlled Variables

The study controls for one variable: Capacity.

  • Capacity: The capacity describes the number (, ) of units in the linear layers of the NVPs.

\thesubsection Response Variables

The response is split into a quantitative and qualitative part. The quantitative part evaluates the Events per Code Element (ECE), Distinct Values per Code Element (DCE), and Negative Log-Likelihood (NLL). The qualitative part assesses the visual fidelity of the samples generated by the model compared to the original dataset and evaluates the usefulness of the PSM network via a scenario-based evaluation given in Section Document.

  • Events per Code Element (ECE): Measures the number of events emitted by code elements. This provides insight into the runtime activity of elements and how many models need to be fitted. We report ECE1 and ECE10 to distinguish between dependencies/constants and real behavior carrying code elements. ECE1 includes all code elements with at least one event (all active code elements at runtime). ECE10 includes only code elements that emitted at least 10 events at runtime.

  • Distinct Values per Code Element (DCE): Measures the number of distinct values emitted by code elements. This provides insight into the capacity models must have. We report DCE1 and DCE10 where DCE10 includes code elements with at least 10 distinct values.

  • Average Negative Log-Likelihood (NLL): Measures the average Negative Log-Likelihood (Equation Document) of data points under the model in natural units of information (nats; lower is better).

\thesubsection Experiment Results

The study results are split into four groups: Code, Runtime, Modeling, and Inference.


The projects contained a total of \num27804 property, parameter, and executable code elements. PMD is the largest project containing \SI76\percent of the total code elements. Nutrition Advisor is the smallest project containing \SI0.25\percent. Most elements were executables (\SI43\percent) or parameters (\SI39\percent). \SI42\percent of the elements were data elements, i.e., had either a number or text type that is eligible for PSM modeling. \SI22\percent were references within the modeling universe and the remaining \SI36\percent were elements of unknown type that were not within the modeling universe. Table Document shows detailed results per subject system, element type, and data type.




Table \thetable: Events are the number of events observed at runtime. ACT10 are the number of events observed at runtime on code elements with at least 10 events. DCT10 are the number of distinct values on code elements with at least 10 distinct values.
\topruleProject Data Type Events ACT10 DCT10
\cmidrule4-7 \cmidrule9-12 \cmidrule14-17 Mdn Q1 Q3 Total Mdn Q1 Q3 Total Mdn Q1 Q3 Total
\midrule\multirow2*Nutrition Advisor Data \num1000 \num1000 \num1000 \num21000 \num1000 \num1000 \num1000 \num21000 \num524 \num363 \num824 \num8040
Others \num1000 \num252 \num1001 \num9008 \num1001 \num1000 \num1501 \num9002 \num1000 \num1000 \num1000 \num1000
\multirow2*Structurizr Data \num6 \num2 \num17 \num35852 \num25 \num16 \num67 \num35041 \num21 \num13 \num46 \num2514
Others \num12 \num3 \num36 \num58489 \num34 \num17 \num104 \num57607 \num29 \num16 \num59 \num3331
\multirow2*jLatexmath Data \num130 \num15 \num526 \num6415336 \num274 \num61 \num1297 \num6414919 \num39 \num18 \num81 \num24495
Others \num66 \num6 \num530 \num1377280 \num257 \num56 \num1064 \num1376553 \num107 \num30 \num408 \num42592
\multirow2*PMD Data \num35 \num5 \num154 \num15069591 \num117 \num37 \num267 \num15068209 \num39 \num18 \num91 \num24511
Others \num18 \num5 \num117 \num1882176 \num64 \num20 \num185 \num1879058 \num30 \num16 \num123 \num69569
\midrule \num21 \num5 \num138 \num24868732 \num83 \num25 \num306 \num24861389 \num39 \num17 \num102 \num176052

Mdn = Median, Q1/3 = Quartile Data = {Number, Text}, Others = {Reference, Unknown}

Monitoring sessions lasted for a median duration of \SI136.55\second (\IQR3.27369.35) and were concurrently executed with the modeling sessions of other projects. The median processing speed was \num25101 events per second (\IQR2472726283). During the monitoring session, a total of \num24868732 events were emitted from \num6002 code elements (\SI22\percent of total code elements). \SI36\percent of the \num6002 code elements emitted data (text or number) events. \SI68\percent were generated by the PMD project, while the least events were generated by the Nutrition Advisor \SI0.12\percent. \SI87\percent of the events were data (text or number) events while the remaining \SI13\percent were either reference or unknown events. The event analysis shows that most of the events (\num24861389) occurred on \num3868 (\SI14\percent of total) code elements. This excludes elements that emitted less than 10 events (ECE10). \SI36\percent of the \num3868 code elements generated data (text or number) events. Percentages for the largest and smallest, as for the data types match those of the events. Differences are given in Table Document in terms of the central tendencies. The distinct value analysis shows that a total of \num176052 distinct values were generated by \num914 code elements (\SI3.29\percent). This excludes elements that emitted less than 10 events (DCE10). \SI44\percent of the \num914 code elements generated data events. Most of the distinct values come from the PMD project that make up \SI53\percent. Least distinct values were generated by the Structurizr with \SI3.32\percent. Distinct values related to Data were encountered \SI34\percent while others were encountered \SI66\percent of the time.




Table \thetable: Model analysis results split across projects, and capacity. Lower is better for NLL results.

! {threeparttable} \topruleCapacity Project Models Data Points Dimensions Training NLL Test NLL \cmidrule5-8 \cmidrule10-13 \cmidrule15-18 \cmidrule20-23 Mdn Q1 Q3 Total Mdn Q1 Q3 Total Mdn Q1 Q3 Total Mdn Q1 Q3 Total \midrule\multirow4*Low Nutrition Advisor \num4 \num1000 \num1000 \num1000 \num4000 \num6 \num5 \num8 \num27 \num-1.37 \num-4.40 \num1.92 \num-4.44 \num-1.61 \num-4.51 \num1.69 \num-4.80 Structurizr \num50 \num67 \num31 \num137 \num14715 \num3 \num3 \num4 \num179 \num-0.83 \num-2.77 \num1.75 \num-48.86 \num-0.93 \num-2.95 \num2.08 \num-39.27 jLatexmath \num146 \num393 \num82 \num1248 \num206820 \num4 \num3 \num7 \num763 \num-3.10 \num-7.64 \num1.06 \num-617.12 \num-3.10 \num-7.91 \num1.29 \num-598.81 PMD \num574 \num133 \num56 \num337 \num454545 \num4 \num3 \num5 \num2511 \num-3.96 \num-6.84 \num-3.15 \num-3080.96 \num-3.96 \num-6.69 \num-2.94 \num-3034.99 \midruleLow \num774 \num151 \num56 \num472 \num680080 \num4 \num3 \num5 \num3480 \num-3.95 \num-6.67 \num-1.96 \num-3751.38 \num-3.95 \num-6.58 \num-1.96 \num-3677.88 High \num-3.95 \num-7.22 \num-2.03 \num-3985.55 \num-3.99 \num-7.30 \num-1.99 \num-3946.18 \midrule\bottomrule {tablenotes} Mdn = Median, Q1/3 = Quartile

Table Document contains the detailed results of the low capacity setting and the margins for the high capacity setting. The total wall time to optimize the parameters of all models was \SI195\minute (\SI111\minute for high capacity). The median time one model needed to optimize in the low capacity setting was \average72.4255.2193.16 (\average38.6029.1150.16 for high capacity). A total of \num774 models were fitted. PMD accounted for \SI74\percent of the models. In sum, \num680080 data points were used in the process were Nutrition Advisor had the most data points available per model (\num1000). A total of \num3480 dimensions exist across all models were PMD accounts for \SI72\percent of all dimensions. However, the Nutrition Advisor models had the highest amount of dimensions per model. \SI62\percent of the dimensions were related to continuous features and the remainder to discrete features. A total of \num12787800 parameters were used (\average157801500016560) in the low capacity setting for the models. The high capacity setting had a total of \num165172056 parameters (\average210468207384213552). Finally, all projects yielded a total test NLL of \num-3677.88 (low capacity). On average, the models found in the PMD project had the best NLL with \num-3.96 and the worst in the Structurizr \num-0.93 (lower is better). No significant divergence between training and test NLL can be seen. The qualitative inspection of the models revealed a good approximation with two caveats. First, imprecisions in the approximations are given for categorical dimensions that include high mass levels. The high mass levels cause an increase of mass in the surrounding levels compared to the original data. Proximity in categorical data is introduced by the 10-ary encoding and the continuous nature of NVPs. Second, imprecisions are given in continuous dimensions with disconnected high-density modes being connected. This issue occurs more frequently in the low capacity setting than in the high capacity setting indicating underfitted models.




Figure \thefigure: Shows an inference example with a condition caused by a latent variable starting at the handle-method. Gender, only accessible in the handle-method is conditioned to females. Height and weight are propagated while bmi jointly adapts to the condition. The last column shows a roundtrip of 10 (40 propagation hops) and its effect on compared to the original distribution.

The qualitative assessment of the inference capabilities of PSM are split into two scenarios presented in Figure Document and Figure Document. These scenarios extend the running example by adding the \codeServlet to the Modeling Universe. The first scenario in Figure Document shows a simulation in which the Nutrition Advisor is conditioned on women requests. The circles at the top illustrate the original call hierarchy and parts of the PSM network from Figure Document. Each node was fitted on the original data without any restrictions or conditions. The contour plots below show the height and weight variables in each model conditioned by gender (see Figure Document for unconditional version). The density plots at the bottom present the bmi variable of the same respective model. In the background is the original unconditioned distribution (i.e., including males). Only the handle-model has direct access to the gender property. By iteratively sampling observations, propagating, and conditioning the next model the original conditional information (i.e., ) flows through the network. This equals (probabilistic) executions of the program. Finally, Figure Document on the right shows the degree of information degradation in a forward and backward inference setting with 10 round-trips (40 information hops). Centers and shape are mostly preserved but a slight shift of variance can be seen. The density of the bmi variable was preserved over the 40 hops without any crucial loss of information.



Figure \thefigure: Shows an example for semantic testing and criticism where the null-model and alt-model come from different teams. The clear difference between the return values was detected automatically and works indifferent with traditional software as with software 2.0.

The second scenario in Figure Document assumes that Servlet and NutritionAdvisor are developed by Company A while BmiService is developed by Company B specialized on AI. Company A uses the simple height/weight formula to stub the BmiService until Company B delivers its service based on a regression model. Company A has a PSM model of the system. Company A builds a second revision of its PSM model, including the new component they received from Company B (BmiService). The automated compatibility checks during continuous integration failed for bmi code elements (in bmi(…) and advice(…)) but are successful for all other elements. Revisiting the call graph in reverse order reveals a semantic error in the new component illustrated in Figure Document. The inputs match (contour plots on the left) but the outputs diverge drastically (density plot on the right). The issue was that Company A uses the metric measurement system while Company B uses the imperial system. The scenario is based on real data. However, the regression model was substituted by the simple BMI formula given in the imperial form. Compatibility checks were done with Kolmogorov-Smirnov Tests [Massey1951]. The remarkable aspect of this scenario is the ignorance of PSM regarding the true underlying implementation (code vs AI model). Unit tests of the component and integration tests of depending components would need to ask the model for the correct assertion values given an input. Not only are these tests flawed, but every update of the model’s parameters would trigger cascading changes in the tests. In contrast, PSM tests the behavior, not the code (semantic tests).

\thesection Discussion

The results presented in Section Document provide direct or indirect evidence for the research questions in Section Document.

\thesubsection Code

The results of the code analysis (see Section Document) shows that the total project size is secondary for PSM. Nearly half (\SI43\percent) of the code elements in a project are text or numbers and can be modeled. The remaining elements are either referencing eligible code elements or are external dependencies. This large proportion justifies the use of PSM for projects independent of their size (Document). In conclusion, projects, independent of their size, expose enough code elements eligible for PSM.

\thesubsection Runtime

The results of the runtime analysis (see Section Document) show that most events are related to actual data (\SI87\percent), providing evidence for Document and support for PSM. These data events are emitted by a rather small portion of the active code elements (\SI14\percent, ACT10). Regarding Document, this means that few models will capture most of a program’s behavior. Most of the variability is generated by few code elements \SI3.29\percent. Nearly half of the variability is related to data (\SI44\percent) while the other half are mostly object references. In terms of Document this means that the average capacity (free optimizable parameters) of models can be low; simplifying model maintenance and interpretation. In conclusion, active code elements are creating enough data (text or number) that can be used for PSM.

\thesubsection Modeling

The results of the modeling analysis (see Section Document) show that most models have few dimensions providing further empirical support to use low capacity models. The selected capacity does not hint at overfitting to specific portions of the data given that training and test NLL are not significantly different. However, many low-dimension discrete only models can be replaced by Conditional Probability Tables (CPDs)4[Koller2009] for a more efficient and precise representation. The qualitative inspections revealed high-quality models with good approximations with two caveats (mass leakage and mode connectivity). The two issues are related to the capacity of the model (too high for discrete, too low for continuous) that adaptive model type and parameter selection can solve. In conclusion, the qualitative and quantitative assessments suggest that probabilistic models can approximate the behavior of a program.

\thesubsection Inference

The inference analysis (see Section Document) evaluated the usefulness of PSM models by two illustrative scenarios. The first scenario (Figure Document) illustrated multi-dimensional information (height and weight) propagation with latent factors (gender only visible in \coderequest) across multiple models. The second scenario (Figure Document) focused on model/data evaluation in a software development context in which software and AI components are integrated. The scenarios distill the foundations on which any PSM application (see Section Document) is built: sampling (generation), conditioning (information propagation), and likelihood evaluation (criticism). In conclusion, results show that local (within model) and global (between models) generation is sensitive to conditions allowing consistent causal reasoning in PSM models.

\thesection Limitations

There are several limitations to the approach or the current prototype. The approach needs a structured program, and it must be observable at runtime. Large methods that handle multiple tasks will reduce the usefulness of PSM. The current prototype is focused on data. References are handles to objects that might contain data or more references. PSM naturally dereferences these handles since models only contain, e.g., properties, that are accessed. This means that PSM is not useful for libraries whose only purpose is reference management, e.g., a collection library. The current prototype explodes lists as singular value assignments, i.e., a list of two elements acts as two assignments to a non-list variable. No order relationship between list elements is preserved as typical for distributions. Sequential models can alleviate this limitation. However, the usefulness is subject to the actual application that is realized.

\thesection Threats to Validity

An external threat to validity is given by the number of projects used in the study. Rigorous internal evaluation and projects of different size and type minimize the threat. Different sizes control for the expectation that large projects will have more elements and events, resulting in better models. Different project types (e.g., PMD as system or jLatexmath as application software) control for the element type distribution and their runtime content (user vs. synthetic data). Finally, the evaluation models all eligible code elements and measured the variance across the projects. The NLL across projects in Table Document does not hint at a by-chance good project selection.

\thesection Conclusion and Future Work

In this work, we presented Probabilistic Software Modeling (PSM), a data-driven approach for predictive and generative methods in software engineering. We have discussed applications, pragmatics, construction details, and technical considerations of PSM. We evaluated the viability and usability of PSM on multiple projects and discussed scenarios that provide insight into how PSM is used. The results have shown that PSM is not only viable but naturally integrates with software 2.0 (AI components). Our future work will focus on the realization and evaluation of applications and their comparison to the current state-of-the-art. In conclusion, PSM analyzes a program and synthesizes a probabilistic model that is capable of simulating and quantifying it. The resulting models are repeatable, persistable, shareable, and quantifiable representations and act as a foundation from which solutions can be derived.


  1. https://github.com/¡blinded ORG¿/gradient a
  2. https://github.com/¡blinded ORG¿/gradient
  3. https://github.com/¡blinded ORG¿/gradient-benchmark
  4. A table encoding the probability per categorical level.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description