Accelerated Discovery of Sustainable Building Materials

Accelerated Discovery of Sustainable Building Materials

Xiou Ge
University of Illinois at Urbana-Champaign &Richard T. Goodwin
IBM T.J. Watson Research Center &Jeremy R. Gregory
Massachusetts Institute of Technology \ANDRandolph E. Kirchain
Massachusetts Institute of Technology &Joana Maria
IBM T.J. Watson Research Center &Lav R. Varshney
University of Illinois at Urbana-Champaign

Concrete is the most widely used engineered material in the world with more than 10 billion tons produced annually. Unfortunately, with that scale comes a significant burden in terms of energy, water, and release of greenhouse gases and other pollutants. As such, there is interest in creating concrete formulas that minimize this environmental burden, while satisfying engineering performance requirements. Recent advances in artificial intelligence have enabled machines to generate highly plausible artifacts, such as images of realistic looking faces. Semi-supervised generative models allow generation of artifacts with specific, desired characteristics. In this work, we use Conditional Variational Autoencoders (CVAE), a type of semi-supervised generative model, to discover concrete formulas with desired properties. Our model is trained using open data from the UCI Machine Learning Repository joined with environmental impact data computed using a web-based tool. We demonstrate CVAEs can design concrete formulas with lower emissions and natural resource usage while meeting design requirements. To ensure fair comparison between extant and generated formulas, we also train regression models to predict the environmental impacts and strength of discovered formulas. With these results, a construction engineer may create a formula that meets structural needs and best addresses local environmental concerns.


The building sector accounts for a significant proportion of overall energy consumption and pollution worldwide. Concrete, including its primary ingredient, cement, is one of the most energy intensive and polluting building materials to fabricate. Building environmental-friendly infrastructure and reducing pollution due to rapid urbanization are two of the Sustainable Development Goals (SDGs) to be achieved by 2030. Given a desire for more sustainable development, there is growing interest in discovering concrete formulas that minimize pollution. As an example, using low carbon footprint concrete will lead to improvements in Indicator 9.4.1 of the SDG,111 which evaluates the world’s progress to the SDG target by measuring emission from unit economic activities.

In the automated material discovery domain, the Materials Project,222 a main program of the Materials Genome Initiative (Jain et al., 2013), is a web-based platform that provides both open source data sets and data analysis tools for researchers to design novel materials. The tool has a relatively large and comprehensive database and interactive tools for materials such as inorganic compounds, nonporous materials, electrodes, etc. However, the Materials Project has not extended to the concrete mixture design domain which we think is equally important.

To this end, the Cement Sustainability Initiative (CSI) developed the Environmental Product Declaration (EPD) tool to facilitate the generation of sector-specific EPDs for cement and concrete but also for clinker, lime, and plaster.333 EPD is a voluntary declaration that provides quantitative information about the environmental impact of a product, using life-cycle assessment (LCA) methodology and verified by an independent third party. The cloud-based tool was designed to be easy-to-use, to facilitate the process overall, and to reduce the costs of preparing cement and concrete EPDs. In this work, we join this with data from the open UCI repository.

Furthermore, recent advances in artificial intelligence have enabled machines to generate highly plausible artifacts, such as images of realistic looking faces (Yan et al., 2016), and natural language (Bowman et al., 2015). In this work, we use Conditional Variational Autoencoders (CVAE), a type of semi-supervised generative model, to generate concrete formulas with desired properties. We demonstrate CVAEs can design concrete formulas with lower emissions and natural resource usage while meeting design requirements. To ensure fair comparison between extant and generated formulas, we also train regression models to predict the environmental impacts and strength of generated formulas. With these results, a construction engineer may create a formula that meets structural needs and best addresses local environmental concerns.

The rest of this paper is organized as follows. We first survey some past work on applying data science to scientific discovery. We then move on to describe the data set and the CVAE model details. Next we give results, first showing the average percentage reduction environmental impact achieved by generated better-performing concrete formulas. We then show strength spectrum plots in the 3D environmental impact space which could be turned into a visualization tool for concrete designers. Lastly, we evaluate the performance of strength conditioned generation of the trained model.

Related Work

Data science has been applied to scientific discovery for some time but is now gaining popularity. Within materials discovery, the Discovery through Eigenbasis Modeling of Uninteresting Data (DEMUD) algorithm proposed by Wagstaff et al. (2013) guides scientific discovery by prioritizing the data point that carries more information to investigate and backing up with an explanation for novelty, using dimensionality reduction techniques. Varshney (2018) proposes to use Bayesian surprise (Itti and Baldi, 2006) as an objective to select the most interesting data point for investigation. Balachandran et al. (2016) use a regressor-selector pair to locate the desired material in the least number of iterations by alternating between exploration and exploitation strategies.

Recently, deep generative models have been applied in materials and molecules discovery. Gómez-Bombarelli et al. (2016) use variational autoencoders (VAEs) based on recurrent neural networks (RNNs) for chemical design in which molecules are encoded as strings. Rampasek et al. (2017) use VAEs to improve the accuracy of drug response predictions. Moreover, semi-supervised generative models allow generation of artifacts with specific, desired characteristics (Yan et al., 2016). We transfer this idea to the concrete formulation generation task to make the synthesis controllable.

Comprehensive physics-based models that predict concrete performance from formulas have been elusive for a century (Mehta and Mehta, 1986). Recently, discriminative machine learning models have been applied to predict the compressive strength of concrete and demonstrated good performance (Chou et al., 2014). However, the environmental impacts of concrete have never been considered in terms of predictive machine learning models. Moreover, the success of deep generative models in generating realistic visual and audio data have given hope to generate other artifacts such as concrete formulas. Here we develop a novel generative machine learning model—a form of computational creativity or accelerated discovery (Besold, Schorlemmer, and Smaill, 2015)—with the capability to design environment-friendly concrete that may help in meeting sustainable development targets.


Figure 1: CVAE Model Structure

Data set

We train our model using the Concrete Compressive Strength Data Set (Yeh, 1998) openly available from the UCI Machine Learning Repository. It has 1,030 training examples, with seven continuous features describing the amount of constituent material such as cement, aggregates, and water. Compressive strength, after a particular curing time (age), of each concrete formula is also given. In addition, we use the CSI EPD tool to estimate the environmental impact of each concrete formula. The EPD tool produces 12 continuous features characterizing the concrete environmental impact. Among these, we largely focus on global warming potential (GWP), acidification potential (AP), and concrete batching water (CBW) consumption.

Generative Model

Our model is based on a variant of the VAE (Kingma and Welling, 2013) called CVAE (Sohn, Lee, and Yan, 2015) as shown in Fig. 1. Like other generative models, the goal is to estimate the data distribution and to generate realistic new samples from that distribution (Doersch, 2016). What makes CVAE different from VAE is that instead of merely generating realistic samples from the data distribution randomly, we generate from the conditional distribution which give us control over the underlying properties of generated data by conditioning on different values of .

We interpret the variables in the conditional generative model as follows: represents the side information of a formula including the strength, age, and environmental impacts, represents the constituent material amount of a formula, and is the latent variable. Like the VAE, a CVAE consists of an encoder that maps the data points to latent codes and a decoder that reconstructs the data points from latent codes. The decoder and encoder are implemented as neural networks where and are the respective network parameters. Since the goal is to generate realistic concrete samples with desired properties, we want to maximize the log likelihood of the data distribution model . Since the data distribution and the posterior distribution are both intractable, we maximize the Evidence Lower Bound (ELBO), , instead. The loss function of CVAE is:


Implementation Details

In our model, the encoder network consists of four fully-connected layers with 25 neurons on the first layer, 20 neurons on the second layer, followed by two parallel layers with two neurons on each which represent the mean and log variance respectively. The prior is set to be an isotropic Gaussian distribution with zero mean and unit variance . The reparameterization trick is performed to make the sampling step differentiable and enable backpropagation for training. The decoder network consists of two fully-connected layers with 20 neurons on the first layer and 25 neurons on the second layer. ReLU activation functions are applied to all layers except the output layer of the decoder, where we use sigmoid activation since we scale our data to . The model is trained end to end with Adam optimizer with learning rate of 0.001 and batch size of 10.

Figure 2: Generating new concrete formulas and evaluating their properties

Property Predictors

We also trained neural network-based regression models as shown in Fig. 2 using the data set that we described above to predict the environmental impact and strength of concrete formulas. Since the compressive strength is dependent on the age of concrete, we trained separate compressive strength predictors for each age group. The purpose of the predictors is twofold. First, we can measure how well the properties of generated samples match the desired properties given as conditioning variables during data generation. Second, we can make fair comparisons between extant and generated concrete formulas in terms of the environmental impact. We experimented with three different types of regression models, namely linear regression, decision tree regression, and neural network regression. Although linear regression can achieve comparable performance with decision tree regression and neural network regression, it often predicts far-out-of-range values for newly generated concrete formulas. The neural network regression has slightly better performance than the decision tree regression and therefore we use the former for prediction tasks. The performance of the neural network regression models for global warming potential (GWP), acidification potential (AP), and concrete batching water (CBW) consumption are shown in Table 1. The performance of the strength predictors are shown in Table 2.

(kg eq.) (kg eq.) ()
MAE 7.187 0.019 0.003
RMSE 9.374 0.040 0.006
0.979 0.974 0.881
Table 1: Environmental Impacts Predictor Performance
Predictor Performance (MPa)
Metric 3 7 14 28 56 90
MAE 2.985 3.850 3.378 6.015 5.093 4.457
RMSE 0.222 0.201 0.163 0.227 0.124 0.125
0.819 0.870 0.703 0.679 0.795 0.789
Table 2: Strength Predictor Performance


Generating environmental impact reducing concrete formulas

Average Reduction ()
Age Strength GWP AP CBW
(day) (MPa) (kg eq.) (kg eq.) ()
3 301 0.80 1.83 5.47
401 7.74 1.59 0.26
7 301 19.69 3.94 7.58
401 25.45 11.33 5.03
14 201 2.20 5.72 10.64
601 42.45 21.09 5.17
28 701 21.62 6.66 3.32
801 27.44 8.40 4.15
56 401 4.38 2.95 7.04
501 14.38 3.23 3.64
701 30.26 23.75 1.32
801 5.88 1.33 3.46
90 801 30.58 6.91 4.11
Table 3: Average environmental impact reduction achieved by better performing generated samples
Strength (MPa) 301 401
Constituent Material Amount (kg per )
Cement 186.4 259.0
Blast Furnace Slag 236.7 288.6
Fly Ash 107.1 58.8
Water 142.3 142.5
Superplasticizer 22.3 26.1
Coarse Aggregate 901.4 868.6
Fine Aggregate 717.2 763.0
Table 4: Sample concrete formula with reduced environmental impact
(a) Curing time = 7 days, Strength = 301 MPa
(b) Curing time = 7 days, Strength = 401 MPa
Figure 3: Approximated hull of generated samples from archetypal analysis, training samples, and all generated samples for specific curing time and strength level.

To demonstrate that the generative algorithm discovers new concrete formulas with reduced environmental impacts, we compared the GWP, AP, and CBW values between extant concrete formulas and generated formulas with the same age and similar strength. For each concrete age group, we generate 60,000 concrete formulas. Both the strength and the environmental impact inputs to the generator are produced by randomly sampling from the standard uniform distribution whereas the latent code input is produced by randomly sampling from the standard bivariate normal distribution. We then use the trained environmental impact predictor and strength predictor for the corresponding age group to evaluate environmental impact and strength of the newly generated formulas. We count the number of generated samples having lower environmental impact than the best observed values for extant samples in all 3 dimensions. We also measured the average percentage reduction in environmental impact for the better-performing samples as compared to extant samples.

Results shown in Table 3 indicate that there is significant opportunity to reduce environmental impact. We constructed an approximated convex hull that encloses a majority of the better performing points in the 3D space as shown in Fig. 3. From the diagram we can also see that there is an opportunity to trade off different types of environmental impact. In Table 4, we show one specific generated concrete formula that is nearest neighbor to one of the extremal points used to construct the convex hull, for strength of 301 MPa and 401 MPa respectively.

Visualization for concrete designers

On top of the 3D environmental impact design space that we mentioned earlier, we also color each data point based on the predicted strength of the corresponding formula. Fig. 4 shows the strength spectrum of the newly generated concrete formulas plotted in the environmental impact space for each concrete curing time group. These plots could serve as a visualization tool for the concrete designers to quickly select newly generated formulas that meet the design requirements.

(a) 3 days
(b) 7 days
(c) 14 days
(d) 28 days
(e) 56 days
(f) 90 days
Figure 4: Strength spectrum of generated concrete formulas for different concrete curing time plotted in 3D environmental impacts space, where color indicates strength

Strength-conditioned progression generation

(a) 3 days
(b) 7 days
(c) 14 days
(d) 28 days
(e) 56 days
(f) 90 days
Figure 5: Strength conditioned progression for different concrete curing time

Attribute-conditioned image progression has been experimented by Sohn, Lee, and Yan (2015). In the experiment, one of the attribute dimension values such as gender, facial expression, or hair color is modified by interpolating between the minimum and maximum attribute value, i.e. , where . Indeed, one can visualize that the attribute of generated images change progressively with the change in conditioning attribute values.

To further demonstrate our concrete generator can produce concrete designs with desired properties, we perform similar experiments. For the purpose of illustration, we limit our conditioning variables to strength and curing time of the concrete. We again generate 10,000 samples for each curing time group, by uniformly sampling from . Fig. 5 shows how well the predicted strength of generated formulations match with the desired strength given as conditioning variable during generation. The performance varies across different curing time groups. RMSE is computed to evaluate the performance quantitatively. The better performing model should have the contour of the scattered dots to cover the diagonal line. The result shows that the generator seems to work the best for concrete curing time of 7 days.


We have demonstrated that with the data obtained from an open source database and cloud-based tools, we are able to train a CVAE model and discover new concrete formulations with reduced environmental impact. However, there is still room for improving our model. Our data contains both continuous and categorical values, but CVAE may not be the best for capturing such mixed categorical and continuous data. The VAE-ROC model proposed by Suh and Choi (2016) is said to be better at handling mixed data. We hope by modifying the CVAE model in line with specifics of the VAE-ROC, the generator would synthesize more realistic concrete designs and achieve better performance in attribute-conditioned generation. Experimental verification by actually making newly discovered concrete formulations also remains.


  • Balachandran et al. (2016) Balachandran, P. V.; Xue, D.; Theiler, J.; Hogden, J.; and Lookman, T. 2016. Adaptive strategies for materials design using uncertainties. Scientific Rep. 6(19660).
  • Besold, Schorlemmer, and Smaill (2015) Besold, T. R.; Schorlemmer, M.; and Smaill, A. 2015. Computational Creativity Research: Towards Creative Machines. Springer.
  • Bowman et al. (2015) Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Józefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv:1511.06349 [cs.LG].
  • Chou et al. (2014) Chou, J.-S.; Tsai, C.-F.; Pham, A.-D.; and Lu, Y.-H. 2014. Machine learning in concrete strength simulations: Multi-nation data analytics. Construction and Building Materials 73:771–780.
  • Doersch (2016) Doersch, C. 2016. Tutorial on Variational Autoencoders. arXiv:1606.05908 [stat.ML].
  • Gómez-Bombarelli et al. (2016) Gómez-Bombarelli, R.; Duvenaud, D. K.; Hernández-Lobato, J. M.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik, A. 2016. Automatic chemical design using a data-driven continuous representation of molecules. arXiv:1610.02415 [cs.LG].
  • Itti and Baldi (2006) Itti, L., and Baldi, P. 2006. Bayesian surprise attracts human attention. In Advances in Neural Information Processing Systems 18. 547–554.
  • Jain et al. (2013) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; and Persson, K. a. 2013. The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1(1):011002.
  • Kingma and Welling (2013) Kingma, D. P., and Welling, M. 2013. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML].
  • Mehta and Mehta (1986) Mehta, P., and Mehta, P. 1986. Concrete: Structure, Properties, and Materials. Prentice-Hall.
  • Rampasek et al. (2017) Rampasek, L.; Hidru, D.; Smirnov, P.; Haibe-Kains, B.; and Goldenberg, A. 2017. Dr.VAE: Drug Response Variational Autoencoder. arXiv:1706.08203 [stat.ML].
  • Sohn, Lee, and Yan (2015) Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems 28. 3483–3491.
  • Suh and Choi (2016) Suh, S., and Choi, S. 2016. Gaussian Copula Variational Autoencoders for Mixed Data. arXiv:1604.04960 [stat.ML].
  • Varshney (2018) Varshney, L. R. 2018. Dimensions, Bits, and Wows in Accelerating Materials Discovery. Cham: Springer International Publishing. 1–14.
  • Wagstaff et al. (2013) Wagstaff, K. L.; Lanza, N. L.; Thompson, D. R.; Dietterich, T. G.; and Gilmore, M. S. 2013. Guiding scientific discovery with explanations using demud. In Proc. 27th AAAI Conf. Artif. Intell., 905–911. AAAI Press.
  • Yan et al. (2016) Yan, X.; Yang, J.; Sohn, K.; and Lee, H. 2016. Attribute2image: Conditional image generation from visual attributes. In Computer Vision – ECCV 2016, 776–791. Cham: Springer International Publishing.
  • Yeh (1998) Yeh, I.-C. 1998. Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 28(12):1797–1808.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description