# Visual Causal Feature Learning

## Abstract

We provide a rigorous definition of the *visual cause* of a behavior that is broadly applicable to the visually driven behavior in humans, animals, neurons, robots and other perceiving systems. Our framework generalizes standard accounts of causal learning to settings in which the causal variables need to be constructed from micro-variables. We prove the Causal Coarsening Theorem, which allows us to gain causal knowledge from observational data with minimal experimental effort. The theorem provides a connection to standard inference techniques in machine learning that identify features of an image that *correlate* with, but may not *cause*, the target behavior. Finally, we propose an active learning scheme to learn a manipulator function that performs optimal manipulations on the image to automatically identify the visual cause of a target behavior. We illustrate our inference and learning algorithms in experiments based on both synthetic and real data.

## 1Introduction

Visual perception is an important trigger of human and animal behavior. The visual cause of a behavior can be easy to define, say, when a traffic light turns green, or quite subtle: apparently it is the increased symmetry of features that leads people to judge faces more attractive than others [10]. Significant scientific and economic effort is focused on visual causes in advertising, entertainment, communication, design, medicine, robotics and the study of human and animal cognition. Visual causes profoundly influence our daily activity, yet our understanding of what constitutes a visual cause lacks a theoretical basis. In practice, it is well-known that images are composed of millions of variables (the pixels) but it is functions of the pixels (often called ‘features’) that have meaning, rather than the pixels themselves.

We present a theoretical framework and inference algorithms for visual causes in images. A visual cause is defined (more formally below) as a function (or *feature*) of raw image pixels that has a *causal effect* on the target behavior of a perceiving system of interest. We present three advances:

We illustrate our ideas using synthetic and real-data experiments. Python code that implements our algorithms, as well as reproduces some of the experimental results, is available online at http://vision.caltech.edu/~kchalupk/code.html.

We chose to develop the theory within the context of *visual* causes as this setting makes the definitions most intuitive and is itself of significant practical interest. However, the framework and results can be equally well applied to extract causal information from any aggregate of micro-variables on which manipulations are possible. Examples include auditory, olfactory and other sensory stimuli; high-dimensional neural recordings; market data in finance; consumer data in marketing. There, causal feature learning is both of theoretical (“What is the cause?”) and practical (“Can we automatically manipulate it?”) importance.

### 1.1Previous Work

Our framework extends the theory of causal graphical models [28] to a setting in which the input data consists of raw pixel (or other micro-variable) data. In contrast to the standard setting, in which the macro-variables in the statistical dataset already specify the candidate causal relata, the causal variables in our setting have to be constructed from the micro-variables they supervene on, before any causal relations can be established. We emphasize the difference between our method of causal feature *learning* and methods for causal feature *selection* [12]. The latter choose the best (under some causal criterion) features from a restricted set of plausible macro-variable candidates. In contrast, our framework efficiently searches the whole space of all the possible macro-variables that can be constructed from an image.

Our approach derives its theoretical underpinnings from computational mechanics [26], but supports a more explicitly causal interpretation by incorporating the possibility of confounding and interventions. Since we allow for unmeasured common causes of the features in the image and the target behavior, we have to distinguish between the plain conditional probability distribution of the target behavior () given the (observed) image () and the distribution of the target behavior given that the observed image was manipulated (i.e. vs. ). [15], who develop a similar model to investigate the relationship between causal micro- and macro-variables, avoid this distinction by assuming that all their data was generated from what in our setting would be the manipulated distribution . We take the distinction between interventional and observational distributions to be one of the key features of a causal analysis. The extant literature on causal learning from image or video data does not generally consider the aggregation from pixel variables into causal macro-variables, but instead starts from annotated or pre-defined features of the image (see e.g. [6]).

### 1.2Causal Feature Learning: An Example

Figure 1 presents a paradigmatic case study in visual causal feature learning, which we will use as a running example. The contents of an image are caused by external, non-visual binary hidden variables and such that if is on, contains a vertical bar (v-bar^{1}

We deliberately constructed this example such that the visual cause is clearly identifiable: manipulating the presence of an h-bar in the image will influence the distribution of . Thus, we can call the following function the *causal feature* of or the *visual cause* of :

The presence of a v-bar, on the other hand, is not a causal feature. Manipulating the presence of a v-bar in the image has no effect on or . Still, the presence of a v-bar is as strongly correlated with the value of (via the common cause ) as the presence of an h-bar is. We will call the following function the *spurious correlate* of in :

Both the presence of h-bars and the presence of v-bars are good individual (and even better joint) predictors of the target variable, but only one of them is a cause. Identifying the visual cause from the image thus requires the ability to distinguish among the correlates of the target variables those that are actually causal, even if the non-causal correlates are (possibly more strongly) correlated with the target.

While the values of and in our example stand in a bijective correspondence to the values of and , respectively, this is only to keep the illustration simple. In general, the visual cause and the spurious correlate can be probabilistic functions of any number of (not necessarily independent) hidden variables, and can share the same hidden causes.

## 2A Theory of Visual Causal Features

In our example the identification of the visual cause with the presence of an h-bar is intuitively obvious, as the model is constructed to have an easily describable visual cause. But the example does not provide a theoretical account of what it takes to be a visual cause in the general case when we do not know what the causally relevant pixel configurations are. In this section, we provide a general account of how the visual cause is related to the pixel data.

### 2.1Visual Causes as Macro-Variables

A visual cause is a high-level random variable that is a function (or feature) of the image, which in turn is defined by the random micro-variables that determine the pixel values. The functional relation between the image and the visual cause is, in general, surjective, though in principle it could be bijective. While we are interested in identifying the visual causes of a target behavior, the functional relation between the image pixels and the visual cause should not itself be interpreted as causal. Pixels do not *cause* the features of an image, they *constitute* them, just as the atoms of a table constitute the table (and its features). The difference between the causal and the constitutive relation is that the former requires the possibility of independent manipulation (at least to some extent), whereas by definition one cannot manipulate the visual cause without manipulating the image pixels.

The probability distribution over the visual cause is induced by the probability distribution over the pixels in the image and the functional mapping from the image to the visual cause. But since a visual cause stands in a constitutive relation with the image, we cannot without further explanation describe interventions on the visual cause in terms of the standard -operation [21]. Our goal will be to define a macro-variable , which contains all the causal information available in an image about a given behavior , and define its manipulation. To make the problem approachable, we introduce two (natural) assumptions about the causal relation between the image and the behavior: (i) The value of the target behavior is determined subsequently to the image in time, and (ii) the variable is in no way represented in the image. These assumptions exclude the possibility that is a cause of features in the image or that can be seen as causing itself.

### 2.2Generative Models: From Micro- to Macro-Variables

Let represent a target behavior.^{2}

Independent noise that may contribute to the target behavior is marginalized and omitted for the sake of simplicity in the above equation. The noise term incorporates any hidden variables which influence the behavior but stand in no causal relation to the image. Such variables are not directly relevant to the problem. Figure 2 shows this generative model.

Under this model, we can define an *observational partition* of the space of images that groups images into classes that have the same conditional probability :

In standard classification tasks in machine learning, the observational partition is associated with class labels. In our case, two images that belong to the same cell of the observational partition assign equal *predictive* probability to the target behavior. Thus, knowing the observational class of an image allows us to predict the value of . However, the predictive probability assigned to an image does not tell us the *causal* effect of the image on . For example, a barometer is widely taken to be an excellent predictor of the weather. But changing the barometer needle does not cause an improvement of the weather. It is not a (visual or otherwise) cause of the weather. In contrast, seeing a particular barometer reading may well be a *visual cause* of whether we pack an umbrella.

Our notion of a visual cause depends on the ability to manipulate the image.

The manipulation changes the values of the image pixels, but does not change the underlying “world”, represented in our model by the that generated the image. Formally, the manipulation is similar to the -operator for standard causal models. However, we here reserve the -operation for interventions on causal *macro*-variables, such as the visual cause of . We discuss the distinction in more detail below.

We can now define the *causal partition* of the image space (with respect to the target behavior ) as:

The underlying idea is that images are considered causally equivalent with respect to if they have the same causal effect on . Given the causal partition of the image space, we can now define the visual cause of :

The visual cause is thus a function over , whose values correspond to the post-manipulation distributions . We will write to indicate that the causal class of image is , or in other words, that in image , the visual cause takes value . Knowing allows us to predict the effects of a visual manipulation , as long as we have estimated for one representative of each causal class .

### 2.3The Causal Coarsening Theorem

Our main theorem relates the causal and observational partitions for a given and . It turns out that in general the causal partition is a coarsening of the observational partition. That is, the causal partition aligns with the observational partition, but the observational partition may subdivide some of the causal classes.

Throughout this article, we use “almost all” to mean “all except for a subset of Lebesgue measure zero”. Figure 3 illustrates the relation between the causal and the observational partition implied by the theorem. We note that the measure-zero subset where does not coarsen can indeed be non-empty. We provide such counter-examples in Appendix 7.

We prove the CCT in Appendix 6 using a technique that extends that of [19]: We show that (1) restricting the space of all the possible to only the distributions compatible with a fixed observational partition puts a linear constraint on the distribution space; (2) requiring that the CCT be false puts a non-trivial polynomial constraint on this subspace, and finally, (3) it follows that the theorem holds for almost all distributions that agree with the given observational partition. The proof strategy indicates a close connection between the CCT and the faithfulness assumption [28].

Two points are worth noting here: First, the CCT is interesting inasmuch as the visual causes of a behavior do not contain all the information in the image that predict the behavior. Such information, though not itself a cause of the behavior, can be informative about the state of other non-visual causes of the target behavior. Second, the CCT allows us to take any classification problem in which the data is divided into observational classes, and assume that the causal labels do not change within each observational class. This will help us develop efficient causal inference algorithms in Section 3.

### 2.4Visual Causes in a Causal Model Consisting of Macro-Variables

We can now simplify our generative model by omitting all the information in unrelated to behavior . Assume that the observational partition refines the causal partition . Each of the causal classes delineates a region in the image space such that all the images belonging to that region induce the same . Each of those regions—say, the k-th one—can be further partitioned into sub-regions such that all the images in the m-th sub-region of the k-th causal region induce the same observational probability . By assumption, the observational partition has a finite number of classes, and we can arbitrarily order the observational classes within each causal class. Once such an ordering is fixed, we can assign an integer to each image belonging to the k-th causal class such that belongs to the m-th observational class among the observational classes contained in . By construction, this integer explains all the variation of the observational class within a given causal class. This suggests the following definition:

The spurious correlate is a well-defined function on , whose value ranges between and . Like , the spurious correlate is a macro-variable constructed from the pixels that make up the image. and together contain all and only the visual information in relevant to , but only contains the causal information:

We prove the theorem in Appendix 8. It guarantees that and constitute the smallest-entropy macro-variables that encompass all the information about the relationship between and . Figure 4 shows the relationship between and , the image space and the observational and causal partitions schematically. is now a cause of , correlates with due to the unobserved common causes , and any information irrelevant to is pushed into the independent noise variables (commonly not shown in graphical representations of structural equation models).^{3}

The macro-variable model lends itself to the standard treatment of causal graphical models described in [21]. We can define interventions on the causal variables using the standard -operation. The -operator only sets the value of the intervened variable to the desired value, making it independent of its causes, but it does not (directly) affect the other variables in the system or the relationships between them (see the *modularity assumption* in [21]). However, unlike the standard case where causal variables are separated in location (e.g. *smoking* and *lung cancer*), the causal variables in an image may involve the same pixels: may be the average brightness of the image, whereas may indicate the presence or absence of particular shapes in the image. An intervention on a causal variable using the -operator thus requires that the underlying manipulation of the image respects the state of the other causal variables:

In some cases it can be impossible to manipulate to a desired value without changing . We do not take this to be a problem special to our case. In fact, in the standard macro-variable setting of causal analysis we would expect interventions to be much more restricted by physical constraints than we are with our interventions in the image space.

## 3Causal Feature Learning: Inference Algorithms

Given the theoretical specification of the concepts of interest in the previous section, we can now develop algorithms to learn , the visual cause of a behavior. In addition, knowledge of will allow us to specify a *manipulator function*: a function that, given any image, can return a maximally similar image with the desired causal effect.

The manipulator searches for an image closest to among all the images with the desired causal effect . The meaning of “closest” depends on the metric and is discussed further in Section 3.2 below. Note that the manipulator function can find candidates for the image manipulation underlying the desired causal manipulation but it does not check whether other variables in the system (in particular, the spurious correlate) remain in fact unchanged. Using the closest possible image with desired causal effect is a heuristic approach to fulfilling that requirement.

There are several reasons why we might want such a manipulator function:

The problem of visual causal feature learning can now be posed as follows: Given an image space and a metric , learn —the visual cause of —and the manipulator .

### 3.1Causal Effect Prediction

A standard machine learning approach to learning the relation between and would be to take an *observational dataset* and learn a predictor whose training performance guarantees a low test error (so that for a test image ). In causal feature learning, low test error on observational data is insufficient; it is entirely possible that contains spurious information useful in predicting test labels which is nevertheless not causal. That is, the prediction may be highly accurate for observational data, but completely inaccurate for a prediction of the effect of a manipulation of the image (recall the barometer example). However, we can use the CCT to obtain a causal dataset from the observational data, and then train a predictor on that dataset. Algorithm ? uses this strategy to learn a function that, presented with any image , returns . We use a fixed neural network architecture to learn , but any differentiable hypothesis class could be susbtituted instead. Differentiability of is necessary in Section 3.2 in order to learn the manipulator function.

In Step ? the algorithm picks a representative member of each observational class. The CCT tells us that the causal partition coarsens the observational one. That is, in principle (ignoring sampling issues) it is sufficient to estimate for just one image in an observational class in order to know that for any other in the same observational class. The choice of the experimental method of estimating the causal class in Step ? is left to the user and depends on the behaving agent and the behavior in question. If, for example, represents whether the spiking rate of a recorded neuron is above a fixed threshold, estimating could consist of recording the neuron’s response to in a laboratory setting multiple times, and then calculating the probability of spiking from the finite sample. The causal dataset created in Step ? consists of the observational inputs and their causal classes. The causal dataset is acquired through experiments, where is the number of observational classes. The final step of the algorithm trains a neural network that predicts the causal labels on unseen images. The choice of the method of training is again left to the user.

### 3.2Causal Feature Manipulation

Once we have learned we can use the causal neural network to create synthetic examples of images as similar as possible to the originals, but with a different causal label. The meaning of “as similar as possible” depends on the image metric (see Definition ?). The choice of is task-specific and crucial to the quality of the manipulations. In our experiments, we use a metric induced by an norm. Alternatives include other -induced metrics, distances in implicit feature spaces induced by image kernels [13] and distances in learned representation spaces [1].

Algorithm ? proposes one way to learn the manipulator function using a simple manipulation procedure that approximates the requirements of Definition ? up to local minima.

The algorithm, inspired by the active learning techniques of uncertainty sampling [18] and density weighing [24], starts off by training a causal neural network in Step ?. If only observational data is available, this can be achieved using Algorithm ?. Next, it randomly chooses a set of images to be manipulated, and their target post-manipulation causal labels. The loop that starts in Step ? then takes each of those images and searches for the image that, among the images with the same desired causal class, is closest to the original image. Note that the causal class boundaries are defined by the current causal neural net . Since is in general a highly nonlinear function and it can be hard to find its inverse sets, we use an approximate solution. The algorithm thus finds the minimum of a weighted sum of (the difference of the output image ’s label and the desired label ) and (the distance of the output image from the original image ).

At each iteration, the algorithm performs manipulations and the same number of causal queries to the agent, which result in new datapoints . It is natural to claim that the manipulator performs well if for many , which means the target causal labels agree with the true causal labels. We thus define the *manipulation error* of the th iteration as

While it is important that our manipulations are accurate, we also want them to be minimal. Another measure of interest is thus the *average manipulation distance*

A natural variant of Algorithm ? is to set to a large integer and break the loop when one or both of these performance criteria reaches a desired value.

## 4Experiments

In order to illustrate the concepts presented in this article we perform two causal feature learning experiments. The first experiment, called grating, uses observational and causal data generated by the model from Section 1.2. The grating experiment confirms that our system can learn the ground truth cause and ignore the spurious correlates of a behavior. The second experiment, mnist, uses images of hand-written digits [17] to exemplify the use of the manipulator function on slightly more realistic data: in this example, we transform an image into a maximally similar image with another class label.

We chose problems that are simple from the computer vision point of view. Our goal is to develop the theory of visual causal feature learning and show that it has feasible algorithmic solutions; we are at this point not engineering advanced computer vision systems.

### 4.1The Grating Experiment

In this experiment we generate data using the model of Figure 1, with two minor differences: and only induce one v-bar or h-bar in the image and we restrict our observational dataset to images with only about 3% of the pixels filled with random noise (see Fig. ?). Both restrictions increase the clarity of presentation. We use Algorithms ? and ? (with minor modifications imposed by the binary nature of the images) to learn the visual cause of behavior .

Figure ? (top) shows the progress of the training process. The first step (not shown in the figure) uses the CCT to learn the causal labels on the observational data. We then train a simple neural network (a fully connected network with one hidden layer of 100 units) on this data. The same network is used on Iteration 1 to create new manipulated exemplars. We then follow Algorithm ? to train the manipulator iteratively. Fig. ? (bottom) illustrates the difference between the manipulator on Iteration 1 (which fails almost 40% of the time) and Iteration 20, where the error is about 6%. Each column shows example manipulations of a particular kind. Columns with green labels indicate successful manipulations of which there are two kinds: switching the causal variable on (, “adding the h-bar”), or switching it off (, “removing the h-bar”). Red-labeled columns show cases in which the manipulator failed to influence the cause: That is, each red column shows an original image and its manipulated version which the manipulator believes should cause a change in , but which does not induce such change. The red/green horizontal bars show the percentage of success/error for each manipulation direction. Fig. ? (bottom, a) shows that after training on the causally-coarsened observational dataset, the manipulator fails about 40% of the time. In Fig. ? (b), after twenty manipulator learning iterations, only six manipulations out of a hundred are unsuccessful. Furthermore, the causally irrelevant image pixels are also much better preserved than at iteration 1. The fully-trained manipulator correctly learned to manipulate the presence of the h-bar to cause changes in , and ignores the v-bar that is strongly correlated with the behavior but does not cause it.

### 4.2The Mnist on Mturk Experiment

In this experiment we start with the MNIST dataset of handwritten digits. In our terminology, this – as well as any standard vision dataset – is already causal data: the labels are assigned in an experimental setting, not “in nature”.

Consider the following binary human behavior: if a human observer answers affirmatively to the question “Does this image contain the digit ‘7’?”, while if the observer judges that the image does not contain the digit ‘7’. For simplicity we will assume that for any image either or . Our task is to learn the manipulator function that will take any image and modify it minimally such that it will become a ‘7’ if it was not before, or will stop resembling a ‘7’ if it did originally.

We conduct the manipulator training separately for all the ten mnist digits using human annotators on Amazon Mechanical Turk. The exact training procedure is described in Appendix 10. Figure 5 (top) shows training progress. As in Fig. ?, the manipulation error decreases with training. Figure 5 (bottom) visualizes the manipulator training progress. In the first row we see a randomly chosen MNIST “9” being manipulated to resemble a “0”, pushed through successive “0-vs-all” manipulators trained at iterations 0, 1, ..., 5 (iteration 1 shows what the neural net takes to be the closest manipulation to change the “9” to a “0” purely on the basis of the non-manipulated data). Further rows perform similar experiments for the other digits. The plots show how successive manipulators progressively remove the original digits’ features and add target class features to the image.

## 5Discussion

We provide a link between causal reasoning and neural network models that have recently enjoyed tremendous success in the fields of machine learning and computer vision [17]. Despite very encouraging results in image classification [16], object detection [5] and fine-grained classification [3], some researchers have found that visual neural networks can be easily fooled using adversarial examples [29]. The learning procedure for our manipulator function could be viewed as an attempt to train a classifier that is robust against such examples. The procedure uses causal reasoning to improve on the boundaries of a standard, correlational classifier (Fig. ? and Figure 5 show the improvement). However, the ultimate purpose of a causal manipulator network is to extract truly causal features from data and automatically perform causal manipulations based on those features.

A second contribution concerns the field of causal discovery. Modern causal discovery algorithms presuppose that the set of causal variables is well-defined and meaningful. What exactly this presupposition entails is unclear, but there are clear counter-examples: and cannot be two distinct causal variables. There are also well understood problems when causal variables are aggregates of other variables [4]. We provide an account of how causal macro-variables can supervene on micro-variables.

This article is an attempt to clarify how one may construct a set of well-defined causal macro-variables that function as basic relata in a causal graphical model. This step strikes us as essential if causal methodology is to be successful in areas where we do not have clearly delineated candidate causes or where causes supervene on micro-variables, such as in climate science and neuroscience, economics and—in our specific case—vision.

#### Acknowledgements

KC’s work was funded by the Qualcomm Innovation Fellowship 2014. KC’s and PP’s work was supported by the ONR MURI grant N00014-10-1-0933. FE would like to thank Cosma Shalizi for pointers to many relevant results this paper builds on.

## 6Appendix: Proof of the Causal Coarsening Theorem

Before we prove the Causal Coarsening Theorem, we prove its less general version in order to split the rather complex proof of CCT into two parts. This Auxiliary Theorem can be proven using simpler techniques, however here we deliberately use techniques that transfer directly to the proof of the CCT.

Our proof is inspired by a proof used by [19] to prove that almost all distributions compatible with a given causal graph are faithful. The proof strategy is thus first to express the proposition that for a given distribution, the observational partition does not refine the causal partition as a polynomial equation on the space of all distributions compatible with the model. We then show that this polynomial equation is not trivial, i.e. there is at least one distribution that is not its root. By a simple algebraic lemma, this will prove the theorem. We extend [19]’s proof technique in our usage of Fubini’s Theorem for the Lebesgue integral. It allows us to “split” the polynomial constraint into multiple different constraints along several of the distribution parameters. This allows for additional flexibility in creating useful assumptions (in our proof, the assumption that the datapoints have well-defined causal classes, but the observational class can still vary freely).

Assume that is binary and , are discrete variables (say , though can be very large. We will use the notation for simplicity later on). The discreteness assumption is not crucial, but will simplify the reasoning. We can factorize the joint as . can be parametrized by parameters, by parameters, and by another parameters, all of which are independent. Call the parameters, respectively,

We will denote parameter vectors as

where the indices are arranged in lexicographical order. This creates a one-to-one correspondence of each possible joint distribution with a point , where is the -dimensional simplex of multinomial distributions.

To proceed with the proof, we first pick any point in the space: that is, we fix the values of and . The only free parameters are now for all values of ; varying these values creates a subset of the space of all the distributions which we will call

is a subset of isometric to the -dimensional simplex of multinomials. We will use the term to refer both the subset of and the lower-dimensional simplex it is isometric to, remembering that the latter comes equipped with the Lebesgue measure on .

Now we are ready to show that the subset of which does not satisfy the Causal Coarsening constraint is of measure zero with respect to the Lebesgue measure. To see this, first note that since and are fixed, each image has a well-defined causal class . The Causal Coarsening constraint says “For every pair of images such that it holds that .” The subset of of all distributions that do not satisfy the constraint consists of the for which for some it holds that

Take any pair for which (if such a pair does not exist, then the Causal Coarsening constraint holds for all the distributions in ). We can write

Since the same equation applies to , the constraint can be rewritten

which we can rewrite in terms of the independent parameters (after defining and ) and further simplify as

which is a polynomial constraint on (note that to keep the notation manageable, we have omitted the dependent term from the equations). By a simple algebraic lemma [20], if the above constraint is not trivial (that is, if there exists for which the constraint does not hold), the subset of on which it holds is measure zero.

To see that Eq. does not always hold, note that if for *any* we set (and thus for any ) and , the equation reduces to

Thus if Eq. was trivially true, we would have or for all . However, this implies , which contradicts our assumption.

We have now shown that the subset of which consists of distributions for which (even though ) is Lebesgue measure zero. Since there are only finitely many pairs of images for which , the subset of of distributions which violate the Causal Coarsening constraint is also Lebesgue measure zero. The remainder of the proof is a direct application of Fubini’s theorem.

For each , call the (measure zero) subset of that violates the Causal Coarsening constraint . Let be the set of all the joint distributions which violate the Causal Coarsening constraint. We want to prove that , where is the Lebesgue measure. To show this, we will use the indicator function

By the basic properties of positive measures we have

It is a standard application of Fubini’s Theorem for the Lebesgue integral to show that the integral in question equals zero. For simplicity of notation, let

We have

Equation follows as restricted to is the indicator function of .

This completes the proof that , the set of joint distributions over and that violate the Causal Coarsening constraint, is measure zero.

We are now ready to prove the main theorem.

Any variables that appear in this proof without definition are defined in the proof of the Auxiliary Theorem. We take the same parametrization of distributions. Fixing an observational partition means fixing a set of observational constraints (OCs)

where is the number of observational classes. Since , is an independent parameter in the unrestricted , and the OCs reduce the number of independent parameters of the joint by . We want to express this parameter-space reduction in terms of the and parameterization and then apply the proof of the Auxiliary Theorem. To do this, for each observational class , choose a representative image such that

Then for each it holds that

or

Picking an arbitrary , we can separate the left-hand side as

Finally, this equation can be rewritten in terms of and as

or

for any . There are precisely such equations, altogether equivalent to the observational constraints. Thus we can express any distribution that is consistent with a given observational partition in terms of the full range of and parameters, and a restricted number of independent parameters. The rest of the proof now follows similarily to the proof of the Auxiliary Theorem and shows that within this restricted parameter space, the parameters for which the (fixed) observational partition is not a refinement of the causal partition is measure zero.

## 7Appendix: Cct Examples and Counter-Examples

In Figure 6 we provide examples of three distributions over binary variables and three-valued . The first model induces a causal partition that is a proper coarsening of the observational partition, and thus agrees with the CCT. The second model induces an observational partition that is a proper coarsening of the causal partition – CCT implies that this is a measure-zero case and that, after fixing the observational partition, we had to carefully tweak the parameters to align the causal partition as it is. The third model induces causal and observational partitions that are incompatible – that is, neither is a coarsening of the other. This is also a measure-zero case. We provide a Tetrad (http://www.phil.cmu.edu/tetrad/) file that contains these three models at http://vision.caltech.edu/~kchalupk/code.html. It can be used to verify our observational and causal partition computations.

## 8Appendix: Proof of the Complete Macro-Variable Description Theorem

The first part follows by construction of . For the second part, note that by the CCT there is a bijective correspondence between the pairs of values and the observational probabilities . Call this correspondence , that is and . Further, define as the function on with . But since , we have . That is, the value of and is a function of the value of , and thus the entropy of and is smaller than the entropy of .

## 9Appendix: Predictive Non-Causal Information in Causal Variable

In some cases retains predictive information that is not causal. Consider the following example: We have a causal graph consisting of three variables where the causal relations are and . All three variables are binary and we have a positive distribution over the variables. In the general case, distributions over this graph satisfy

, and importantly

.

If we view as an image (which can either be all black or all white), as the target behavior and as a hidden confounder, analogous to the set-up in the main article, then the observational partition has just two classes, namely . But in this case the observational partition *is the same* as the causal partition: . So by our definition of a spurious correlate, is a constant, since there are no further distinctions to be made within any of the causal classes. would be omitted from any standard causal model. Nevertheless, we have in our model still that , i.e. the causal variable still contains predictive information that is not causal. Given that there is by construction no other than the causal and the trivial partition in this example, it must be the case that retains predictive non-causal information. It follows that in our definitions of and , it is not the case that the predictive non-causal components of an image can always be completely separated from the causal features.

## 10Appendix:the mnist on mturk Experiment

For this experiment, we started off by training ten one-vs-all neural nets. We used cross-validation to choose among the following architectures: 100 hidden units (h.u.), 300 h.u. (one layer), 100-100 h.u (two layers), 300-300 h.u. (two layers). We used maxout [8] activations (each of which computed the max of 5 linear functions). For training we used stochastic gradient descent in batches of 50 with 50% dropout [14] on the hidden units, momentum adjustment from 0.5 to 0.99 at iteration 100, learning rate decaying from 0.1 to 0.0001 with exponential coefficient of 1/0.9998, no weight decay, and we enforced the maximum norm of a column of hidden units to 5. The training stopped after 1000 iterations and the iteration with best validation error was chosen. We used the Pylearn2 package [8] to train the networks.

This initial training was done on 5000 training points and 1250 validation points (both of which come from the mnist dataset) for each machine. The training points were chosen at random to include 2500 images of a specific digit class (that is, 2500 zeros for the first machine, 2500 ones for the second machine and so on), and 2500 images of random other digits for each machine. The validation sets were composed similarly. Each machine then used Algorithm 2 to transform 1000 images of digits *from its training set* into maximally similar images of the opposing class.

We thus started off with ten manipulated datasets of 1000 images each. The first dataset contained images of zeros manipulated to be non-zeros, and all the other digits manipulated to be zeros. The tenth dataset contained images of nines manipulated to be non-nines and the other digits manipulated to be nines. We then used Amazon Mechanical Turk to present all those images to human annotators, using the interface shown in Figure 7. The images created by all the manipulator networks were mixed at random together, so that each single annotator (annotating 250 images in one task) would see some images created by each machine. Finally, each of the 10000 images was shown to five annotators; we used 540=200 annotators total on each iteration. The annotators labeled the images as either one of the ten digits, or the question mark ‘?’ if there was no recognizable digit in an image. The final label (“target digit” or “not target digit”) was chosen using majority of the annotators’ votes.

The annotated manipulated digits were then added to the datasets which their respective original images belonged to. We then proceeded to train the next iteration of neural network manipulators on the updated datasets, and so on until completion of the manipulator training.

#### References

### Footnotes

- We take a v-bar (h-bar) to consist of a complete column (row) of black pixels.
- An extension of the framework to non-binary, discrete is easy but complicates the notation significantly. An extension to the continuous case is beyond the scope of this article.
- We note that may retain predictive information about that is not causal, i.e. it is not the case that all spurious correlations can be accounted for in . See Appendix 9 for an example.

### References

**Representation learning: A review and new perspectives.**

Y. Bengio, A. Courville, and P. Vincent. Pattern Analysis and Machine Intelligence**Representing shape with a spatial pyramid kernel.**

A. Bosch, A. Zisserman, and X. Munoz. In*6th ACM International Conference on Image and Video Retrieval*, pages 401–408, 2007.**The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization.**

S. Branson, G. Van Horn, and C. Wah. International Journal of Computer Vision**A statistical problem for inference to regulatory structure from associations of gene expression measurements with microarrays.**

T. Chu, C. Glymour, R. Scheines, and P. Spirtes. Bioinformatics**Pedestrian detection: An evaluation of the state of the art.**

P. Dollar, C. Wojek, B. Schiele, and P. Perona. IEEE Transactions on Pattern Analysis and Machine Intelligence**Using causal induction in humans to learn and infer causality from video.**

A. S. Fire and S. C. Zhu. The Annual Meeting of the Cognitive Science Society (CogSci)**Learning Perceptual Causality from Video.**

A. S. Fire and S. C. Zhu. AAAI Workshop: Learning Rich Representations from Low-Level Sensors**Maxout networks.**

I. J. Goodfellow and D. Warde-Farley. arXiv preprint arXiv:1302.4389**Explaining and Harnessing Adversarial Examples.**

I. J. Goodfellow, J. Shlens, and C. Szegedy. arXiv preprint arXiv:1412.6572**Human (Homo sapiens) facial attractiveness and sexual selection: The role of symmetry and averageness.**

K. Grammer and R. Thornhill. Journal of Comparative Psychology**The pyramid match kernel: Efficient learning with sets of features.**

K. Grauman and T. Darrell. Journal of Machine Learning Research**Causal feature selection.**

I. Guyon, A. Elisseeff, and C. Aliferis. In*Computational Methods of Feature Selection Data Mining and Knowledge Discovery Series*, pages 63–85. Chapman and Hall/CRC, 2007.**Image classification with segmentation graph kernels.**

Z. Harchaoui and F. Bach. In*IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, pages 1–8, 2007.**Improving neural networks by preventing co-adaptation of feature detectors.**

G. E. Hinton and N. Srivastava. arXiv preprint arXiv:1207.0580**Quantifying causal emergence shows that macro can beat micro.**

E. P. Hoel, L. Albantakis, and G. Tononi. Proceedings of the National Academy of Sciences**ImageNet Classification with Deep Convolutional Neural Networks.**

A. Krizhevsky, I. Sutskever, and G. E. Hinton. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,*Advances in Neural Information Processing Systems 25*, pages 1097–1105. 2012.**Gradient-based learning applied to document recognition.**

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Proceedings of the IEEE**A sequential algorithm for training text classifiers.**

D. D. Lewis and W. A. Gale. In*ACM SIGIR Seventeenth Conference on Research and Development in Information Retrieval*, pages 3–12, 1994.**Strong completeness and faithfulness in Bayesian networks.**

C. Meek. In*Eleventh Conference on Uncertainty in Artificial Intelligence*, pages 411–418, 1995.**Distinctness of the eigenvalues of a quadratic form in a multivariate sample.**

M. Okamoto. The Annals of StatisticsCausality: Models, Reasoning and Inference

J. Pearl. .**Using Markov blankets for causal structure learning.**

J. P. Pellet and A. Elisseeff. Journal of Machine Learning Research**ImageNet large scale visual recognition challenge.**

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. arXiv preprint arXiv:1409.0575**An analysis of active learning strategies for sequence labeling tasks.**

B. Settles and M. Craven. In*Conference on Empirical Methods in Natural Langauge Processing*, pages 1070–1079, 2008.Causal architecture, complexity and self-organization in the time series and cellular automata

C. R. Shalizi. .**Computational mechanics: Pattern and prediction, structure and simplicity.**

C. R. Shalizi and J. P. Crutchfield. Journal of Statistical Physics**Causal inference of ambiguous manipulations.**

P. Spirtes and R. Scheines. Philosophy of ScienceCausation, prediction, and search

P. Spirtes, C. N. Glymour, and R. Scheines. .**Intriguing properties of neural networks.**

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. In*International Conference on Learning Representations*, 2014.**Graph kernels.**

S. V. N. Vishwanathan. Journal of Machine Learning Research**Part-based R-CNNs for fine-grained category detection.**

N. Zhang, J. Donahue, R. Girshick, and T. Darrell. In*ECCV 2014*, pages 834–849, 2014.