Machine Learning in a data-limited regime: Augmenting experiments with synthetic data uncovers order in crumpled sheets

Machine Learning in a data-limited regime: Augmenting experiments with synthetic data uncovers order in crumpled sheets


Machine learning has gained widespread attention as a powerful tool to identify structure in complex, high-dimensional data. However, these techniques are ostensibly inapplicable for experimental systems with limited data acquisition rates, due to the restricted size of the dataset. Here we introduce a strategy to resolve this impasse by augmenting the experimental dataset with synthetically generated data of a much simpler sister system. Specifically, we study spontaneously emerging local order in crease networks of crumpled thin sheets, a paradigmatic example of spatial complexity, and show that machine learning techniques can be effective even in a data-limited regime. This is achieved by augmenting the scarce experimental dataset with inexhaustible amounts of simulated data of flat-folded sheets, which are simple to simulate. This significantly improves the predictive power in a test problem of pattern completion and demonstrates the usefulness of machine learning in bench-top experiments where data is good but scarce.


Machine learning is a versatile tool for data analysis that has permeated applications in a wide range of domains LeCun et al. (2015). It has been particularly well suited to the task of mining large datasets to uncover underlying trends and structure, enabling breakthroughs in areas as diverse as speech and character recognition Mohamed et al. (2009); Dahl et al. (2011); Deng et al. (2013); LeCun et al. (1998), medicine Shameer et al. (2018), games Mnih et al. (2013); Silver et al. (2016, 2017), finance Heaton et al. (2017), and even romantic attraction Joel et al. (2017). The prospect of applying machine learning to similarly revolutionize research in the physical sciences has likewise gained attention and excitement. Data-driven approaches have been successfully applied to data-rich systems such as classifying particle collisions in the LHC Bhimji et al. (2017); Baldi et al. (2014), classifying galaxies Banerji et al. (2010), segmenting large microscopy datasets Sommer et al. (2011); Gulshan et al. (2016) or identifying states of matter Carrasquilla and Melko (2017); Spellings and Glotzer (2018). Machine learning has also enhanced our understanding of soft-matter systems: In a recent series of works, Cubuk, Liu, and collaborators have used data-driven techniques to define and analyze a novel “softness” parameter governing the mechanical response of disordered, jammed systems Cubuk et al. (2015); Sussman et al. (2017); Schoenholz et al. (2017).

All examples cited above address experimentally, computationally, or analytically well-developed scientific fields supplied by effectively unlimited data. By contrast, many systems of interest are characterized by scarce or poor-quality data, a lack of established tools, and a limited data acquisition rate that falls short of the demands of effective machine learning. As a result, the applicability of machine learning to such systems is problematic and would require additional tools. This would potentially be of high value to the experimental physics community and would require novel ways of circumventing the data limitations, either experimentally or computationally. In this manuscript, we study crumpling and the evolution of damage networks in thin sheets as a test case for machine-learning-aided science in complex, data-limited systems that lack a well established theoretical, or even a phenomenological, model.

Crumpling is a complicated and poorly understood process: As a thin sheet is confined to a small region of space, stresses spontaneously localize into one-dimensional regions of high curvature Amar and Pomeau (1997); Witten (2007); Aharoni and Sharon (2010) forming a damage network of sharp creases (Fig. 1B) that can be classified according to the sign of the mean curvature: creases with positive and negative curvature are commonly referred to as valleys and ridges, respectively. Previous work on crumpled sheets has established clear and robust statistical properties of these damage networks. For example, it has been shown that the number of creases at a given length follows a predictable distribution Andresen et al. (2007) and the cumulative amount of damage over repeated crumpling is described by an equation of state Gottesman et al. (). However, these works do not account for spatial correlations, which is the structure we are trying to unravel. The goal of this work is to learn the statistical properties of such networks by solving a problem of network completion: Separating the ridges from valleys, can a neural net be trained to accurately recover the location of the ridges, presented only with the valleys? For later use, we call this problem partial network reconstruction.

The predominant challenge we are addressing here is a severe data limitation. As detailed below, we were unable to perform this task using experimental data alone. However, by augmenting experimental data with computer-generated examples of a simple sister system which is well understood, namely flat-folding, we trained an appropriate neural network with significant predictive power.

Figure 1: (A) A sheet of Mylar that has undergone a succession of flat folds. (B) A sheet of Mylar that has been crumpled. (C) A simulated flat-folded sheet. The sheet has been folded 13 times. Ridges are colored red, and valleys are black.

The primary dataset used in this work was collected for a previous crumpling study Gottesman et al. (), where the experimental procedures are detailed and are only reviewed here for completeness. Mylar sheets are crumpled by rolling them into a 3 cm diameter cylinder and compressing them uni-axially to a specified depth within the cylindrical container, creating a permanent damage network of creasing scars embedded into the sheet. To extract the crease network, the sheet is carefully opened up and scanned using a custom-made laser profilometer, resulting in a topographic height map from which the mean curvature is calculated. The sheet is then successively re-crumpled and scanned between 4 and 24 times, following the same procedure. The curvature map is preprocessed with a custom script based on the Radon transform (for details see Sec. I in the Supplementary Information (SI)) to separate creases from the flat facets and fine texture in the data (Fig. 2A). The complete dataset consists of a total of 506 scans corresponding to 31 different sheets.


Failures with only experimental data

As stated above, the task we tried to achieve is partial network reconstruction: inferring the location of the ridges given only the valleys (Fig. 2A). Our first attempts were largely unsatisfactory and demonstrated little to no predictive power. Strategies for improving our results included subdividing the input data into small patches of different length scales, varying the network architecture, data representation, and loss function, and denoising the data in different ways. We approached variants of the original problem statement, trying to predict specific crease locations, distance from a crease, and changes in the crease network between successive frames. In all these cases our network invariably learned specific features of the training set rather than generic principles that hold for unseen test data, a common problem known as over-fitting. The main culprit for this failure is insufficient data: The dataset of a few hundred scans available for this study is small compared to standard practices in machine learning tasks (for example, the problem of hand-written digit classification, MNIST, which is commonly given as an introductory exercise in machine learning, consists of 70,000 images LeCun et al. (1998)). Moreover, as creases produce irreversible scars, images of successive crumples of the same sheet are highly correlated, rendering the effective size of our dataset considerably smaller.

Over-fitting can be addressed by constraining the model complexity through insights from physical laws, geometric rules, symmetries, or other relevant constraints. Alternatively, it can be mediated by acquiring more data. Sadly, neither of these avenues is viable: current theory of crumpling cannot offer significant constraints about the structure or evolution of crease networks. Furthermore, adding a significant amount of experimental data is prohibitively costly: achieving a dataset of the size typically used in deep learning problems, say scans, would require thousands of lab hours, given that a single scan takes about ten minutes. Lastly, data cannot be efficiently simulated since, while preliminary work on simulating crumpling is promising Narain et al. (2013); Guo et al. (2018), generating a simulated crumpled sheet still takes longer than an actual experiment. A different approach is needed.

Turning to a sister system: flat-folding

An alternative strategy is to consider a reference system free from data limitations alongside the target system, with the idea that similarities between the target and reference systems allow a machine learning model of one to inform that of the other. This is similar to transfer learning Pan and Yang (2010), but in this case rather then re-purpose a network, we supplement the training data with that of a reference system. In our case, a natural choice of such a system is the well understood, flat-folded thin sheet, effectively a more constrained version of crumpling. Flat-folding is the process of repeatedly folding a thin sheet along straight lines to create permanent creases. In this process, explicit rules dictate the structure of the resulting crease network: Creases cannot begin or terminate in the interior of a sheet—they must either reach the boundary or create closed loops; the number of ridge and valley creases that meet at each vertex differs by two (Maekawa’s theorem); finally, opposite vertex angles must match (Kawasaki’s theorem) Turner et al. (2016). Given these rigid geometric rules, we expect partial network reconstruction of flat-folded sheets to be a much more constrained problem than that of crumpled ones.

Figure 2: (A) A schematic of the processing pipeline. From the height map, a mean curvature map is calculated and denoised with a Radon-transform based method. Ridges (black) and valleys (red) are separated. The binary image of the valleys () is the input to the neural network (). The distance transform of the binary image of the ridges is the target (). Warmer colors represent regions closer to ridges. These color conventions are consistent through all figures in this paper. (B) Two samples of predictions on generated data. The true fold network is superimposed on the predicted distance map. It is seen that the true valleys (red) coincide perfectly with the bright colors, demonstrating strong predictive power. Below the predictions we show confusion matrices, with the nearest third of pixels, the middle third, and the furthest third. (C) Two predictions, and their corresponding confusion matrices, using the network trained on generated data (without noise) and applied to experimental scans.

However, while experimentally collecting flat-folding data is only marginally less costly than collecting crumpling data, simulating it on a computer is a straightforward task. We wrote a custom C++ code to do this using the Voro++ library Rycroft (2009) for rapid manipulation of the polygonal folded facets. The code generates a complete and statistically representative dataset of practically unlimited size (see typical examples in Fig. 1C, Fig. 2B and Fig. S1). The code is described in Sec. II of the Supplemental Information.

Having flat-folding as a reference system provides foremost a convenient setting for comparing the performance of different network architectures. The vast parameter space of neural networks requires testing different hyperparameters, loss functions, optimizers, and data representations with no standard method for finding the optimal combination. This problem is exacerbated when it is not at all clear where the failure lies: Is the task at all feasible? If so, is the network architecture appropriate? If so, is the dataset sufficiently large? Answering these questions with our limited amount of experimental data is very difficult. However, with flat folding we are certain the task is feasible and our data is comprehensive, so experimentation with different networks is easier. Indeed, after testing many architectures, we identified a network capable of learning the desired properties of our data, reproducing linear features and maintaining even non-local angle relationships between features.

Network structure

The chosen network is a modified version of the fully-connected SegNet Badrinarayanan et al. (2017) deep convolutional neural net. As outlined in Fig. 2A, each crease network is separated into its valleys and ridges. The neural net, , is given as an input a binary image of the ridges, denoted (“input” in Fig. 2). The goal of is to predict is the the distance transform of the of the valleys, . That is, for each pixel, is the distance to the nearest valley pixel (“target” in Fig. 2). The loss is chosen to be simply the distance between the predicted distance transform, and the real one,


where the summation index represents image pixels. Training is performed by finding weights that minimize this loss. The motivation for this choice of representation is that creases are sharp and narrow features and therefore if we demand from to predict the precise location of a crease, even slight inaccuracies would lead to vanishing gradients of , making training harder. See Materials and Methods below for full details of the implementation.

In silico flat-folding

For exclusively in silico generated flat-folding data, the trained network performs partial network reconstruction with nearly perfect accuracy, as demonstrated in Fig. 2B: The agreement between the true location of the valleys (red lines) and their predicted location (bright colors) is visibly flawless. As a means of quantifying accuracy, we present the confusion matrices of the predicted and true output; the upper left (lower right) entries in the confusion matrix (Fig. 2B) contain the probability of correctly predicting regions closest to (most distant from) a crease, which is approximately 90%.

Although by itself a non-trivial task that requires learning a complicated set of geometrical rules, a task that would require non-negligible effort from a human if they were to write an explicit algorithm to solve it, partial network reconstruction of in silico flat-folding proved, as hoped, to be a problem that the neural network solves with relative ease.

Experimental flat-folding

As an intermediate step between in silico flat-folding and experimental crumpling data, we next examine the performance of the neural network on experimental flat-folding data. Fig. 2C reveals that the resulting prediction weakens by comparison, a consequence of noise present in experimental data that is absent from the in silico samples. Noise occurs in the form of varying crease widths, fine texture, and missing creases that are undetected in image processing. In some cases, true creases missed during processing are correctly predicted, which also introduces error to our accuracy metric (see for example, the center of the second panel of Fig. 2C). While sufficient data of experimental flat-folding would likely allow the network to distinguish signal from noise, in our data-limited regime noise must be added to the generated in silico data in order to help the network learn to accurately predict experimental scans and avoid over-fitting.

Figure 3: Effect of noise type on experimental data prediction (A)–(E) The top, middle and bottom rows show, respectively, an example training image from each training set, an example prediction (the one used in the right panel of Fig. 2C), and the corresponding confusion matrix. The noise types are described concisely in the title of each panel and complete specifications are given in Materials and Methods. (F) Each pixel of the near perfect prediction from Fig. 2B was randomly toggled with probability . The figure shows the upper left value of the resulting confusion matrix. (G) The network from (E) applied on an additional experimental scan (from left panel of Fig. 2C). The average confusion matrix on all experimental scans is shown.

We examine the effect of adding several types of noise on the prediction accuracy on experimental input (Figs. 3A–E). We observe significant improvement and find that adding experimentally realistic noise, as in Fig. 3E, is more effective than toggling individual pixels randomly (Figs. 3B,D). We found that the noise type that leads to optimal training is to randomly add and remove patches of input that are approximately the same length scale as the noise in the experimental scans. We also find that it is important to provide input data with lines of variable width, to prevent the network from expecting only creases of a particular width. For complete details of the different noise properties, see Materials and Methods.

While the values in the confusion matrices in Fig. 3E might seem low, it is important to note that the metric used here is not trivial to interpret: it compares the distance from a distance map, which is particularly sensitive to noise, since a localized noise speckle in a region remote from valleys perturbs a large region of space (essentially, of the size of its Voronoi cell). To gauge the effect of noise on the accuracy metric, we randomly toggle a fraction of pixels in an otherwise perfect flat-folding example and recompute the entries of its confusion matrix, as presented in Fig. 3F. With realistic noise levels, i.e. , we can expect accuracy values between 0.75 and 0.80 in the upper left and lower right entries of the confusion matrix, comparable to the values reported in Fig. 3E. That is, for experimental flat-folding we achieve accuracy levels that are comparable to what is expected for a perfect prediction with noisy preprocessing.

Experimental crumpling

For crumpling, we train the neural network on a combination of experimental crumpling data (30%) and the same noised in silico flat-folding data (70%) that was used above. Training on this combined dataset, the resulting predictions accurately reconstruct key features of the crease networks in crumpling sheets, that were not achieved in prior attempts. In Figure 4, we present predictions on entire sheets (Fig. 4A) as well as a few close ups on some regions (Fig. 4B). The confusion matrices suggest that the network is often relatively accurate in predicting regions that are directly near a crease (upper left entry) as well as large open spaces (lower right entry), classifying such regions with 50%–60% accuracy. In addition, Fig. S3 shows the prediction on each of 16 successive crumples of the same sheet held out from training. These results demonstrate that augmenting the dataset with in silico generated flat-folding data allows the network to discern some underlying geometric order in the crease networks of experimental crumpling data.

Figure 4: Predictions on crumpling (A) One sheet that was successively crumpled, shown after 4 and 7 crumpling iterations. Color code follows Fig. 2. (B) Close-ups on selected smaller patches from the same image, broken down to prediction, prediction and target, and prediction and input.

Throughout this work we used confusion matrices to quantify the network’s performance. However, there are other ways to do so which would give different results. One such way is comparing the network’s output to “random” network completion, i.e. to a network that construes a pattern having the statistical properties of a crease network, but is only weakly correlated with the input image. Though a generative model of crease networks is not available, we can sample crease patterns from the experimental data and compare the predicted distance maps to those measured from such randomly selected samples. This is discussed in Sec. V of the SI, where it is seen that our prediction for a given crease pattern is overwhelmingly closer to the truth than any sampled patch from other experiments (Fig. S5).

Our results demonstrate the capacity of a neural network to learn, at least partially, the structural relationship of ridges and valleys in a crease pattern of crumpled sheets. The next step is to understand the network’s decision process, with the aim of uncovering the physical principles responsible for the observed structure. However, while interpretation of trained weights is currently a heavily researched topic, see (Smilkov et al., 2016; Frosst and Hinton, 2017; Sundararajan et al., 2017, among many others), there is not yet a standard approach to do so. Our ongoing work seeks to probe the network’s inner workings by perturbing the input data. For example, we can individually alter each pixel in and quantify the effect of perturbation on the prediction relative to the original target. Alternatively, we can examine the effect of adding or removing creases, or test the network’s output given inputs that do not occur naturally in the crumpled sheets. Some preliminary results are discussed in Sec. IV of the SI.


Experimental data is paramount to our understanding of the physical world. However, prohibitive data acquisition rates in many experimental settings require augmenting experimental data in order to answer some questions. In particular, computer simulations now play a significant role in exploratory science; many experimental conditions can be accurately simulated to corroborate our understanding of empirical results.

Despite these advances, the simulation of certain phenomena is inhibited by insufficient theoretical knowledge of the system, or by demanding computational resources and development time. For crumpling, without a deeper understanding which would allow the use of simplified/reduced models, simulations require prohibitively small time steps, small domain discretization, or both to avoid unphysical phenomena such as self-intersection Narain et al. (2013). In this manuscript, we showcase a new method for circumventing limited data for the system of crumpled sheets. We demonstrate that by using a vast amount of in silico generated data from a related sister system with well-understood dynamics alongside a limited amount of costly experimental data, we are able to complete the missing crease pattern on experimental crumpled sheets.

Even with a small experimental training set, we show that augmenting the dataset by computer generated, artificially noised data of flat-folding, salient features of the ridge network can be predicted from the surrounding valleys: the network successfully predicts the presence of certain creases, as well as their pronounced absence in certain locations (see Fig. 4B). Combined, these results suggest predictable geometric constraints in a hallmark disordered system.

Improving the experimental dataset by performing dedicated experiments, or replacing the simulated flat-folding with simulated crumpling data are also promising future directions. While we have only demonstrated the advantages of data augmentation for one problem, it is tempting to imagine how it may apply to other systems in experimental physics. In addition to providing insights into the predictability of crease patterns, a quantitative predictive model (i.e. an oracle) could serve as an important experimental tool that allows for targeted experiments, especially when experiments are costly or difficult. As shown above, a trained neural network is able to shed light on where order exists, even if the source of the order is not apparent.

Replacing the scientific discovery process with an automated procedure is risky. Frequently hypotheses which were initially proposed are not the focal points of the final works they germinated, as observations and insights along the way sculpt the research towards its final state. This serendipitous aspect of discovery has been of immense importance to the sciences and is difficult to include in automated data exploration methods, which is an area of ongoing research Raccuglia et al. (2016); Baltz et al. (2017); Ren et al. (2018). By showing that data-driven techniques are able to make non-trivial predictions on complicated systems, even in a severely data-limited regime, we hope to demonstrate that these tools should become a valuable tool for experimentalists in many different fields.

Materials and Methods


Experimental flat-folding and crumpling data were performed on sheets of 0.05 mm thick Mylar. Flat folds were performed successively at random, without allowing the paper to unfold between successive creases. Crumpled sheets was obtained by first rolling the sheet into a 3 cm diameter cylinder and then applying axial compression to a specified depth between 7.5 mm and 55 mm. Sheets were successively crumpled between 4 and 24 times.

To image the experimental crease network, crumpled/flat-folded sheets were opened up and their height profile was scanned using a home-built laser profilometer. The mean curvature map was calculated by differentiating the height profile, and then denoised using a custom Radon-based denoiser, the implementation details of which are given in Sec. I of the SI. A total of 506 scans were collected from 31 different experiments.

Network architecture and training

Data was fed into a fully convolutional network, based on the SegNet architecture Badrinarayanan et al. (2017) with the final soft-max layer removed, as we did not perform a classification problem. The depth of the network allows for long-range interactions to be incorporated without the additional free parameters from fully connected layers. The network was implemented in Mathematica and optimization was performed using the ADAM optimizer Kingma and Ba (2014) on a Tesla 40c GPU with 256 GB of RAM. Code is freely available git (2018).

Training data comprised of approximately 70% generated data and 30% experimental data. For training, the in silico generated input data was also augmented with standard data-augmentation methods: flipped along both and transposed. All images were down-sampled to have dimensions of pixels. For crumpling data, creases were also linearized to look more similar to the experimental input. An example of the effect of linearizing is shown in Fig. S2 of the SI.


Noise was added to the input in a few different ways, presented In Fig. 3B. The noise of each panel was generated as follows:

  1. No noise.

  2. “White” noise: Each pixel was randomly toggled with 5% probability.

  3. Random Blur: Input was convolved with a Gaussian with a width drawn uniformly between 0 and 3. The array was then thresholded at 0.1. Here and below “thresholded at ” means a pointwise threshold was imposed on the array, such that values smaller than were set to 0 and otherwise set to 1.

  4. Each pixel was randomly toggled with 1% probability, then passed through Random Blur (C).

  5. Input was Random Blurred (as (C)) but thresholded at 0.55. We denote the blurred-and-thresholded input as . Then, was noised using both additive and multiplicative noise, as follows: and are two random fields drawn from a pointwise uniform distribution between 0 and 1 and convolved with a Gaussian of width seven (pixels) and thresholded at 0.55. Finally, the “noised” input is


This work was supported by the National Science Foundation through the Harvard Materials Research Science and Engineering Center (DMR-1420570). S.M.R. acknowledges support from the Alfred P. Sloan research foundation. The GPU computing unit was obtained through the NVIDIA GPU grant program. JH was supported by a Computational Science Graduate Fellowship (DOE CSGF). YBS was supported by the JSMF post-doctoral fellowship for the study of complex systems. CHR was partially supported by the Director, Office of Science, Computational and Technology Research, U. S. Department of Energy under Contract No. DE-AC02-05CH11231.

Supplemental Materials

Figure S1: On the left, we show nine different random flat-folding patterns generated by our algorithm, without inward folding. Ridge folds are colored red and valley folds are colored in blue. On the right we show four different folding patterns where inward folds have been allowed.

Appendix S-I Radon transform based detection method

Here we detail the detection method used to identify crease networks from maps of mean curvature prior to machine learning. We refer to our technique as a Radon-based detection method, as it repurposes the key principle behind a Radon transform—recovering a signal through integration along directed paths—for crease detection. By integrating a quantity of interest, in our case the mean curvature, along paths of regularly spaced orientations within local regions of the curvature map, we construct a signal array that enhances the signature of creases and reduces noise. A strong signal is recovered if an integration path coincides with the direction of an extended structure such as a crease; a weak signal is produced by features that are point-like or isotropic, representative of noise and fine texture in the data. The raw curvature maps of each sheet are pixels. Prior to processing, curvature maps are downsampled for computational efficiency. A downsampling factor of 4 was found to preserve the integrity of the crease pattern while providing a useful speedup in computation for a final resolution of pixels per cm. Next, a linear integration path is centered about a given pixel of the curvature map, traversing the diameter of a fixed circular local window. The average curvature along a particular direction is computed by exact numerical integration of the bicubic interpolant on the grid defined by pixel centers. The integration direction is systematically rotated about the central pixel, and the maximum average curvature over all path orientations is selected as the signal. This process is repeated for all pixels in the curvature map, resulting in a signal array of only the average curvatures that are a maximum along local, linear paths. Integrals along 24 equally spaced path orientations on the interval of 0 to 180 degrees were considered at each pixel and the maximum selected as the signal. We examined a range of integration path lengths up to 8 mm, as the integration window defines a length scale that must accommodate features of varied sizes. While smaller integration paths can detect finer details particularly at low crease densities, they sacrifice some of the advantage afforded by longer paths in accruing a strong signal that is well separated from noise. An integration path length of 3.2 mm suitably mediated such effects and provided a clear crease network. Finally, global and local thresholds are applied to the signal array to separate the real creases from the background noise. A combination of the two was observed to work well in retaining the desired crease network: The global threshold is more permissive of noise but acts uniformly across the signal array, while the local threshold accommodates variations in signal intensity, and thus provides sensitivity to softer (less sharp) creases. We use a global threshold of as the minimum signal intensity retained as a crease (0.12 is approximately 10% the magnitude of the largest creases), and set the local threshold to label as noise any pixel whose intensity falls below of the maximum signal in a neighborhood centered about the pixel. In training with crumpled sheets, the crease networks were also linearized as shown in Fig. S2. This was done with a custom script that skeletonized the input and used the Mathematica function MorphologicalGraph.

Figure S2: Comparison between the preprocessed curvature map and the linearized version. The denoised curvature map of an entire crumpled sheet with three enlarged insets (a-c) for better visibility. Red and blue are creases retained after the Radon-based denoising, green and orange are the linearized representation.

Appendix S-II In silico generation of flat-folding data

A custom code was written in C++ to simulate flat folding. The code makes use of the Voro++ software library Rycroft (2009), which provides routines for fast manipulation of polygons. To begin, the sheet is represented as a single square. To simulate a simple flat fold on a given chord, the square is cut into two polygons, and one polygon is reflected about the chord. Subsequent flat folds are simulated by taking the collection of polygons representing the folded sheet, cutting them by a given chord, and reflecting those on one side about the chord. Throughout the process, each polygon keeps track of an affine transformation from its current configuration back to its position in the original square sheet. By transforming all polygons back to the original sheet, the flat folding map of valleys and ridges can be constructed. The code can also simulate inward folds where a ray is selected and the sheet is pinched in along this ray. For computational efficiency, the code computes a bounding circle during the folding process, whereby the collection of polygons representing the folded sheet is wholly within the circle.

While folding along a given chord is strictly well-defined, there is no natural way to draw a random chord from a distribution (e.g. Bertrand’s paradox in probability theory) and a choice must be made regarding the way a chord is drawn. Our choice is the following: A fold is determined by a straight line in and therefore can be parameterized by its angle and offset. At each iteration the angle is drawn uniformly in the range radians and the offset uniformly over the bounding circle. If the chosen fold line does not actually create a fold (because the line misses all polygons) then a new angle and displacement are chosen and so forth. For inward folds, we first choose a point uniformly inside the bounding circle, then determine if the point is inside any polygon; if not, we keep choosing new points until we find one that is. We then choose a random orientation for the ray from this point and two random angles and uniformly from for the first two folded segments that are counter-clockwise from the ray, after which the remaining two angles at the point are given by and .

Our data set was generated by folding the sheet times, where is chosen uniformly in the range from 7 to 15. Each fold has a random sign (ridge or valley) with equal probability. For each sheet, the probability of inward folds was chosen uniformly over the range . Figure S1 shows a selection of generated crease patterns.

Figure S3: Prediction on a sheet that was crumpled 16 times. The prediction is shown in blue for a given set of valleys (black). The true creases are overlaid in red. The first 16 of the 17 experiments are shown, the 17th experiment is very similar to the 16th. Confusion matrices for 8 of the 16 matrices are shown in the right. The color corresponds to the outline of the matrix.
Figure S4: Additional test results. A The result of approximate differentiation (see text in Sect. S-IV) on flat-fold (A) and crumpling (A’) inputs. Unfortunately, experimentally testing these results or correlating them with other physical quantities proved difficult. B Creases colored by the magnitude of the change caused by their removal (Eq. S1). Cooler colors correspond to weaker change and warmer colors to stronger change. While some trends are clearly discernible (e.g. there is a strong correlation between the change magnitude and the crease length), we are still trying to interpret these results in terms of the underlying physics.

Appendix S-III Prediction on 16 sheets

The validation set (an experiment held out from training) consists of 17 successive crumples of the same sheet of paper. In Fig. S3, we show the prediction on the first sixteen of these sheets. For each prediction of an entire sheet, the image was computed in overlapping patches of size . Each pixel was considered to be the average value based on a sequence of predictions. Preliminary work was done on automatically detecting regions that were the best and the worst predicted. This along with things discussed below, are the topic of ongoing work.

Appendix S-IV Probing the network: Ongoing work

In their paper, Lehman et al. discuss some computational oddities in the field of computational evolution Lehman et al. (2018). They present a series of important ideas through short tales where the computer produced unexpected behavior that, when understood, were a key step in learning how to successfully use the computational tools. We think examples similar to these are important to share as the use of machine learning in the experimental sciences is still in its infancy. In this spirit, we discuss some of our attempts to tie the predictions of the network back to the underlying physics of crumpling.

A potential pitfall of using neural networks is that they will provide an output for any input, no matter how absurd either the input or output is. No warning appears. This is powerful, but requires caution, as neural networks allow for predictions on inputs that are physically impossible to create. Thus, one should take all the following probing attempts with a grain of salt.

It is tempting to “differentiate” the input signal to see if perturbations at any particular location cause large changes in the neural network’s prediction. In Fig. S4 A and A’ we do this, perturbing the input (empty space and white lines) by making each pixel slightly more crease-like if it is not a crease or less crease-like if it is a crease. The background color shown is the magnitude of the change relative to the original prediction. Our hope was that this map may correlate with some known aspect of the physics. However, we do not think that this is the case. We tried aligning sequential images and estimating whether new creases tend to form with higher probability in regions that correlate with this sensitivity map—we do not find this to be the case. We are are currently exploring more sophisticated ways of differentiating the trained network, such as those presented by Sundararajan et al. Sundararajan et al. (2017).

Similarly, we can ask questions such as: What would happen if we translate a particular crease 5 mm to the left? What if we artificially remove parts of folds in flat-folding? What if we remove entire creases? In Fig. S4B, we present the results of removing entire creases form a crumpled sheet. The ridges are colored by their total effect on the prediction, that is, if the original input is and the perturbed input is (after removing a crease), we define the total magnitude of the change as


where runs over all pixels. The hope is that these unphysical perturbations to the input can provide insight into the working of the network. However, as stated above, interpreting them should be done with care, since in these cases the input to the neural net might be too dissimilar to anything in the training data, making predictions less reliable.

Figure S5: Prediction Accuracy. A The loss (orange line) of a given reconstruction (bottom) compared to the of losses distribution from all other patches from similarly crumpled sheets. From this data we calculate the -score of this patch to be 3.6. B Repeating this procedure for all patches, we calculate the distribution of -scores, giving an average -score of nearly 3. Three representative patches are shown at their -score location.

Appendix S-V Another approach to error quantification

It is common to benchmark Machine Learning prediction accuracies with respect to a suitably-defined random guess. For example, in the MNIST digit recognition task, even a blindfolded monkey can achieve 10% accuracy simply because there are only 10 classes to choose from. In our case, however, there exists no generative model for crease networks, so there is no random guess that we can compare the output of our network to. As a surrogate, we can draw a random crease network from our data. That is, we compare our predictions on a given patch to many patches from other, similar experiments. This is presented in Fig. S5: For a given patch, we compute the loss of our prediction (Eq. 1 in the main text) compared to the true value. In Fig. S5A we compare this loss with the distribution of losses obtained by comparing other patches to the true value. Examining hundreds of different predictions, in Fig. S5B, we find that our predictions have an average -score of nearly 3. The -score for a patch is defined as where is the loss for this patch and , are, respectively, the mean and standard deviation of all losses calculated from other patches on the same true value.

Appendix S-VI Code and Data Availability

The code and some data can be found at For more training data, either flat-folding or crumpling, email the authors.


  1. Y. LeCun, Y. Bengio,  and G. Hinton, Nature 521, 436 (2015).
  2. A.-R. Mohamed, G. Dahl,  and G. Hinton, NIPS workshop on deep learning for speech recognition and related applications 1, 39 (2009).
  3. G. Dahl, D. Yu, L. Deng,  and A. Acero, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 4688 (2011).
  4. L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, et al., 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 8604 (2013).
  5. Y. LeCun, L. Bottou, Y. Bengio,  and P. Haffner, Proceedings of the IEEE 86, 2278 (1998).
  6. K. Shameer, K. W. Johnson, B. S. Glicksberg, J. T. Dudley,  and P. P. Sengupta, Heart  (2018).
  7. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra,  and M. Riedmiller, arXiv preprint arXiv:1312.5602  (2013).
  8. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Nature 529, 484 (2016).
  9. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Nature 550, 354 (2017).
  10. J. Heaton, N. Polson,  and J. H. Witte, Applied Stochastic Models in Business and Industry 33, 3 (2017).
  11. S. Joel, P. W. Eastwick,  and E. J. Finkel, Psychological science 28, 1478 (2017).
  12. W. Bhimji, S. A. Farrell, T. Kurth, M. Paganini,  and E. Racah, arXiv preprint arXiv:1711.03573  (2017).
  13. P. Baldi, P. Sadowski,  and D. Whiteson, Nature Communications 5, 4308 (2014).
  14. M. Banerji, O. Lahav, C. J. Lintott, F. B. Abdalla, K. Schawinski, S. P. Bamford, D. Andreescu, P. Murray, M. J. Raddick, A. Slosar, et al., Monthly Notices of the Royal Astronomical Society 406, 342 (2010).
  15. C. Sommer, C. Straehle, U. Koethe,  and F. A. Hamprecht, 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro , 230 (2011).
  16. V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, R. Kim, R. Raman, P. Q. Nelson, J. Mega,  and D. Webster, JAMA 316, 2402 (2016).
  17. J. Carrasquilla and R. G. Melko, Nature Physics 13, 431 (2017).
  18. M. Spellings and S. C. Glotzer, AIChE J. 64, 2198 (2018).
  19. E. D. Cubuk, S. S. Schoenholz, J. M. Rieser, B. D. Malone, J. Rottler, D. J. Durian, E. Kaxiras,  and A. J. Liu, Physical Review Letters 114, 108001 (2015).
  20. D. M. Sussman, S. S. Schoenholz, E. D. Cubuk,  and A. J. Liu, Proceedings of the National Academy of Sciences 114, 10601 (2017).
  21. S. S. Schoenholz, E. D. Cubuk, E. Kaxiras,  and A. J. Liu, Proceedings of the National Academy of Sciences 114, 263 (2017).
  22. M. B. Amar and Y. Pomeau, Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 453, 729 (1997).
  23. T. A. Witten, Reviews of Modern Physics 79, 643 (2007).
  24. H. Aharoni and E. Sharon, Nat. Mater. 9, 993 (2010).
  25. C. A. Andresen, A. Hansen,  and J. Schmittbuhl, Physical Review E 76, 026108 (2007).
  26. O. Gottesman, J. Andrejivic, C. H. Rycroft,  and S. M. Rubinstein, arXiv preprint arXiv:1807.00899 .
  27. R. Narain, T. Pfaff,  and J. F. O’Brien, ACM Transactions on Graphics (TOG) 32, 51 (2013).
  28. Q. Guo, X. Han, C. Fu, T. Gast, R. Tamstorf,  and J. Teran, ACM Transactions on Graphics  (2018).
  29. S. J. Pan and Q. Yang, IEEE Transactions on knowledge and data engineering 22, 1345 (2010).
  30. N. Turner, B. Goodwine,  and M. Sen, Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 230, 2345 (2016).
  31. C. H. Rycroft, Chaos 19, 041111 (2009).
  32. V. Badrinarayanan, A. Kendall,  and R. Cipolla, IEEE transactions on pattern analysis and machine intelligence 39, 2481 (2017).
  33. D. Smilkov, N. Thorat, C. Nicholson, E. Reif, F. B. Viégas,  and M. Wattenberg, arXiv preprint arXiv:1611.05469  (2016).
  34. N. Frosst and G. Hinton, arXiv preprint arXiv:1711.09784  (2017).
  35. M. Sundararajan, A. Taly,  and Q. Yan, arXiv preprint arXiv:1703.01365  (2017).
  36. P. Raccuglia, K. C. Elbert, P. D. Adler, C. Falk, M. B. Wenny, A. Mollo, M. Zeller, S. A. Friedler, J. Schrier,  and A. J. Norquist, Nature 533, 73 (2016).
  37. E. Baltz, E. Trask, M. Binderbauer, M. Dikovsky, H. Gota, R. Mendoza, J. Platt,  and P. Riley, Scientific Reports 7, 6425 (2017).
  38. F. Ren, L. Ward, T. Williams, K. J. Laws, C. Wolverton, J. Hattrick-Simpers,  and A. Mehta, Science Advances 4 (2018).
  39. D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980  (2014).
  40.  (2018).
  41. J. Lehman, J. Clune, D. Misevic, C. Adami, J. Beaulieu, P. J. Bentley, S. Bernard, G. Belson, D. M. Bryson, N. Cheney, et al., arXiv preprint arXiv:1803.03453  (2018).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description