Exploring helical dynamos with machine learning
Abstract
We use ensemble machine learning algorithms to study the evolution of magnetic fields in forced magnetohydrodynamic (MHD) turbulence that is helically forced. Using mean field formalism, we model the electromotive force (EMF) both as a linear and nonlinear function of the mean magnetic field and current density. The form of the EMF is determined using regularized linear regression and random forests. We also compare various analytical models to the data using Bayesian inference with Markov Chain Monte Carlo (MCMC) sampling. Our results demonstrate that linear regression is largely successful at predicting the EMF and the use of more sophisticated algorithms (random forests, MCMC) do not lead to significant improvement in the fits. We conclude that the data we are looking at is effectively low dimensional and essentially linear. Finally, to encourage further exploration by the community, we provide all of our simulation data and analysis scripts as open source IPython notebooks.
F. Nauman and J. Nättilä]Farrukh Nauman^{1}^{1}1Email: naumanf@chalmers.se,
Joonas Nättilä^{2}^{2}2Email: joonas.nattila@su.se
Department of Space, Earth and Environment, Chalmers University, SE41296 Gothenburg, Sweden.
Nordita, KTH Royal Institute of Technology and Stockholm University, Roslagstullsbacken 23, SE106 91 Stockholm, Sweden
1 Introduction
Large scale magnetic fields are observed on the Earth, the Sun, galaxies. Understanding the origin and sustenance of these magnetic fields is the subject of dynamo theory. Magnetic fields observed in the galaxies are thought to have their origin in the early phases of cosmological evolution [1]. On relatively smaller scales, magnetic fields likely play a significant role in star [2] and planet formation [3]. For the Sun, a long standing question has been to understand the origin of the cylical behavior of the magnetic North and South poles [4].
A number of outstanding questions remain about dynamo theory. Classical models of dynamo theory [5, 6] assume that the background flow is stationary and the magnetic field is too weak to feedback on the flow. This reduces the problem of magnetic field generation to a closure problem: how does the fluctuating electromotive force (EMF) depend on the mean field? Previous work on dynamo theory has often modeled the EMF as a linear function of the mean field (see [7] for an extensive review). This approximation is reasonable in the growth (kinematic) phase in a flowdriven magnetohydrodynamic (MHD) system where the initial magnetic energy is orders of magnitude smaller than the kinetic energy of the forced fluid. But the limitations become apparent when the magnetic fields have gained enough energy to modify the background flow through the Lorentz force. To model this backreaction of the mean magnetic field on the flow, various proposals have been put forward including models where EMF is a nonlinear function of the mean field [8, 9, 10].
Two methods have been widely used in numerical simulations of dynamos to get an explicit form for the electromotive force and both are essentially linear: the test field method [11] and linear regression using the least squares method [12]. The kinematic test field method uses a set of orthogonal ‘test’ fields that are predetermined and can be used to compute the various coefficients in the EMF expansion corresponding to terms linear in the large scale field and its gradients. As input, the test field only requires the small scale velocity field and dynamically solves (on the fly) the test field equations to compute the test field coefficients. This has the advantage of retaining dynamical information [13, 14]. Linear regression, in contrast, is mostly used in postprocessing to model the ‘inverse’ problem of determining the coefficients corresponding to different terms in the linear expansion of the EMF.
Machine learning and deep learning have revolutionized image processing, object detection, natural language processing [15]. Recently there has been a surge of interest in using machine and deep learning tools in dynamical systems and computational fluid dynamics [16, 17, 18]. Datadriven modeling is not new to fluid dynamics; in meteorology, observational data has been used for decades to guide weather predictions (see [19] for a historical overview). Considerable increase in computational power with advances in hardware (GPUs) has allowed modern optimization methods to become increasingly tractable. The end goal is to not only build predictive datadriven models but also to understand underlying complex physics that cannot be captured by simple linear regression.
The motivation to study reduced descriptions of the full 3D MHD simulations comes primarily from the extremely large, and currently numerically intractable, magnetic Reynolds number of most astrophysical systems (, [7]). Reducedorder modeling of dynamical systems is currently a highly active field of research that aims to provide description of physical systems in terms of the minimal possible degrees of freedom [20, 17]. In this paper, we attempt to build reducedorder models for MHD dynamos by applying nonlinear regression tools to the widely studied dynamo [7]. In this context, machine learning methods are particularly advantageous as they value generalization. This is usually tested by separating the data into distinct training and validation sets. A high score on the training set but a poor score on validation set indicates poor generalization. This is in contrast to standard linear regression techniques employed in the dynamo literature where the entire data set is taken to be the training set with no validation set to test overfitting [12].
We use data from Direct Numerical Simulations (DNS) of forced helical turbulence to construct a reduced model where the EMF is an unknown (linear and nonlinear) function of the large scale field. This allows us to quantitatively assess the standard assumption of modeling EMF as linearly proportional to the mean magnetic field and current density. We employ three classes of machine learning models: (i) Linear regression, (ii) Random forests, (iii) Bayesian inference (with Markov Chain Monte Carlo model minimization). Our selection of a wellstudied systems like forced helical MHD turbulence supports our second main motivation: we can compare our results from machine learning tools against relatively wellknown results in the literature. The insights gained from such a comparison could then help guide future work on more complex MHD turbulent systems such as MHD turbulence with differential rotation [21, 22].
We describe our data and numerical simulations in the next section. This is followed by a section that begins by explaining the fitting methods considered in this paper. We then apply these machine learning and statistical methods on the data. We discuss our results in section 4 putting out work in the context of previous work and some caveats. Section 5 is conclusions. All of our analysis scripts and data are freely available as interactive IPython notebooks.^{3}^{3}3https://github.com/fnauman/ML_alpha2
2 Data
We describe the equations, setup and code that was used to conduct DNS of forced turbulence in magnetized flows.
2.1 Description of DNS
We use the publicly available 6th order finite difference pencilcode[23]^{4}^{4}4http://github.com/pencilcode. The MHD equations are solved in a triply periodic cubic domain of size :
(1)  
(2)  
(3) 
where is the viscosity, , is the forcing function, is the resistivity, and is the current density.
The forcing function is taken to be homogeneous and isotropic with explicit control over the fractional helicity determined by the parameter. It is defined as:
(4) 
where the wavevectors and the phase are random at each time step. The wavevectors are chosen within a band to specify a range of forcing centered around . The normalization factor is defined such that the forcing term matches the physical dimensions of the other terms in the Navier Stokes equation, , where is the forcing amplitude, is the sound speed, is the time step. The Fourier space forcing has the form [24]:
(5) 
where is a measure of the helicity of the forcing; for positive maximum helicity, , and .
Name  

R5e2  500  1.68  4.97  0.0337 
R1e3  1000  4.44  9.94  0.0446 
R2e3  2000  10.31  19.88  0.052 
R3e3  3000  16.71  30.12  0.055 
R4e3  4000  22.64  39.76  0.057 
R5e3  5000  28.72  49.70  0.058 
R6e3  6000  34.61  59.63  0.058 
R7e3  7000  40.41  69.58  0.058 
R8e3  8000  46.22  79.52  0.058 
R9e3  9000  50.16  89.55  0.056 
R1e4*  10000  55.79  99.40  0.056 
R15e4*  15000  82.12  149.1  0.055 
The initial velocity and density are set to zero while the magnetic field is initialized through a vector potential with a small amplitude () Gaussian noise. We define the units [25]:
(6) 
such that the magnetic field, , is in the units of Alfv́en speed (). The domain size is , which leads to . Moreover, we use the following dimensionless numbers:
(7) 
where is the forcing wavemode. We provide a summary of the simulations in Table 1.
2.2 Reduced model
Numerical simulations of large scale astrophysical flows with magnetic Reynolds number well in the excess of are prohibitive. Moreover, stateoftheart simulations requires O() CPU hours to compute for moderate Reynolds numbers. To make modeling such systems more tractable, it is therefore necessary to look for simpler surrogate models that can capture the most significant properties of the high fidelity simulations (DNS).
Large scale dynamo theory concerns itself with the evolution equation of the filtered induction equation,
(8) 
where EMF is given as , is the turbulent resistivity, and represents a filtered quantity. The filter could be ensemble, temporal or spatial depending on the problem. The first term on the righthand side is typically ignored unless there is a strong nonconstant mean velocity field such as in the case of galaxies (the mean differentially rotating flow for the galactic dynamo is considered in [26]).
In keeping with previous work that has used horizontal averaging as the filter in equation 8 [25, 7], our data has been averaged over the whole plane:
(9) 
This averaging scheme has the interesting property that since (due to incompressibility and periodic boundary conditions). We only consider the fields reduced to 1D by furthermore averaging them in two different ways:

averaged: where , are chosen such that they do not cover the entire zdomain since that will correspond to volume averaged quantities that are zero.

averaged: where , are chosen to correspond to either the kinematic or saturation regime for different runs.
For horizontally averaged (plane) fields with periodic boundary conditions, due to incompressibility. This leaves two independent magnetic fields and corresponding EMF components. Analytical models that use are termed dynamos and have been extensively studied in the literature. In principle, a filter that retains independent components of the data while using will be termed dynamo model.
In Fig. 1, we show a threedimensional phasespace visualization of vs. vs. that shows the relation between the EMF and the mean magnetic fields (situation is the same when considering too). The circle in plane represents the fact that the total magnetic energy . The slight tilt with respect to the EMF (i.e., towards direction) is a result of the linear correlation with the EMF. The increasing disorder with increasing is characteristic of highly turbulent systems.
3 Model fits
We describe the 1D linear and nonlinear regression for temporal and vertical profiles in the following subsections. In the following, we will refer to the magnetic fields and EMFs as both ‘fields’ and ‘features’, ‘target’ interchangeably. For most of the following, we will focus on but similar results hold for since we are looking at isotropically forced MHD turbulence with no background field.
3.1 Description of algorithms considered in this paper
Ordinary Least Squares (OLS) assumes that the data is linear, features/terms are independent, variance for each point is roughly constant (homoscedasticity), features/terms do not interact with one another. Moreover, causal inference based on linear regression can be misleading [27], a problem shared by all curve fitting methods. We should highlight the distinction between linearity of the regression as opposed to linearity of the features. For example, one can construct a nonlinear basis (example: or ) and do linear regression on it. Linear regression on nonlinear data can potentially represent nonlinear data well assuming that the nonlinear basis terms are independent and have low variance. Constructing the correct interacting/nonlinear terms requires domain expertise. For this reason, it is important to also consider nonlinear regression methods such as random forests that are robust to noise and are capable of modeling interactions and nonlinearities.
3.1.1 Lasso
Most real data have considerable variance that can lead to misleading fits from linear regression. Regularized linear regression aims to construct models that are more robust to outliers. Two of the most common regularization schemes for linear regression are [28]: (i) LASSO (Least Absolute Shrinkage and Selection Operator; also known simply as L1 norm), (ii) Ridge (also known as L2 norm). As opposed to ridge, LASSO has the advantage of doing feature selection by reducing coefficients of insignificant features. We will use LASSO regression in this work.
Intuitively LASSO reduces variance at the cost of introducing more bias. LASSO works better than ridge regression if the true model is indeed low dimensional (or the coefficient matrix is sparse). For dense data that requires high dimensional complex models, neither ridge nor LASSO do well. The “Bet on sparsity” principle [28] suggests trying LASSO for all datasets since if the data turn out to be dense, it is unlikely that any regularization scheme will do well. Formally, LASSO optimizes for:
(10) 
where represents the target vector (in our case, ), is the feature matrix ( are the columns), is coefficient vector and is a parameter with a range between and . The subscripts ‘1’ and ‘2’ represent L1 () and L2 () norms, respectively. A higher penalizes outliers more strongly and sets coefficients of unimportant features to zero while a lower is similar to linear regression with no regularization (OLS).
3.1.2 Random forests
Random forests belong to the class of ensemble machine learning algorithms [29, 28]. A random forest consists of a large number of decision trees that each take a subset of features with a bootstrapped sample of the data [30]. Moreover, each decision tree only picks a random subset of features to make decisions. This helps in getting rid of strong correlations among different trees but some correlations might still remain. Random forests are one of the most popular and widely used machine learning algorithms because they require little finetuning and can handle large noisy data sets. From a computational point of view, each decision tree in the ensemble is independent and can thus be easily processed in parallel. Moreover, because of the weighted average over the ensemble, random forests are robust to strong variance in the data but the use of only a few deep decision trees can lead to bias.
Ensembles of decision trees (random forests and gradient boosting) are nonparametric models meaning they do not require information about the underlying data distribution. In other words, unlike in linear regression where the shape of the function that connects the independent variables to the dependent variable is fixed, the functional form is not known in random forests and is instead determined through training. This offers a different perspective on modeling, termed ‘algorithmic’ modeling as opposed to ‘data’ modeling [31]. The nonparametric nature of random forests and gradient boosting make them applicable to a wide class of problems since in most realworld situations, data distribution is not known apriori. However, this comes at the cost of interpretability making ensembles of decision trees harder to understand than linear regression [32].
Feature importance and selection: As opposed to linear regression, random forests do not directly measure the coefficient of the features. A single decision tree is highly interpretable but a whole ensemble of decision trees is harder to interpret [31]. Random forests offer some insight through feature importances [28]. We use the “mean decrease impurity” method implemented in python library scikitlearn [33] that is based on [34, 35]. With this method, the feature importance is computed using an average over all trees based on the reduction of error corresponding to a particular feature weighted by the fraction of samples in that particular node. One popular alternative is to randomly permute the data for a particular feature and see whether it has an effect on the mean squared error (or information gain/gini importance for classification tasks). If the mean squared error increases or remains the same, it implies that the particular feature is not important. This is termed as ‘permutation’ importance [29, 32].
3.1.3 Bayesian inference
Bayesian inference is a powerful statistical framework that presents an alternative way to perform the model parameter estimation. The main strengths of Bayesianbased approaches are their ability to quantify the uncertainty of the model and the possibility to incorporate previous apriori knowledge into the estimation. The downside, when compared to many conventional machine learning methods, is that the model needs to be known before the fit.
In general, let us present our model with and our empirical (simulation) data with . We are then interested in how well can our model describe the data, or quantitatively what is the probability that our model is valid given the data, . By directly comparing our model and the data, what we actually get is, however, the probability of our model given the data, known as the likelihood . Our original question of the validity of our model, can be answered by applying the Bayes’ theorem presented as (see e.g., [36])
(11) 
where is our prior probability of the model and is the prior probability. Therefore, because can be computed and is known a priori, we can use the Bayes’ theorem to quantitatively asses the validity of our dynamo models given the simulation data results.
In practice, not only are we interested in the validity of the model, but also what are the parameters of the model, , that best describe the data. In Bayesian inference, these model parameters are determined using a marginal estimation where the resulting posterior parameter for model parameter are obtained by integrating over the probability of all the other model parameters, (). Then, this (onedimensional) distribution represents the probability that the th model parameter, , will take a particular value given our data .
Here we solve the Eq. (11) (and therefore obtain also posterior distributions for our model parameters) using Monte Carlo Markov Chain (MCMC) integration techniques. To perform the MCMC fit we use the publicly available emcee library [37]. We employ the affineinvariant stretchmove ensemble sampler with typically number of walkers (members of ensemble), where is the number of fit parameters. No prior knowledge is incorporated into the fits and so we use uniform priors for all of our model parameters.
3.2 Vertical profiles
In this subsection, we apply various methods on time averaged data that only has vertical dependence ( grid points).
In Fig. 2, we show the linear (Pearson) correlation coefficients among the various fields. The correlation between the current densities and the magnetic fields is nearly indicating they are linearly dependent. Strong linear correlation does not rule out the physical importance of a variable  it only implies that from a curve fitting point of view, it has redundant information. Since we are using helical forcing leading to a helical state, . At high , one might expect that the power in the velocity and magnetic field spectrum will be distributed across multiple modes implying weaker correlations. However, Fig. 2 (right) indicates the correlation between and is still perfect. This means that we can eliminate current density as an independent variable. Nonetheless, in keeping with previous work, for linear basis we do consider the current density as an independent variable but eliminate it for the nonlinear (polynomial) basis where its inclusion will lead to several redundant terms.
We consider two sets of basis in this subsection:

Linear: , , , .

Polynomial of : , , , , , .
We note that for kinetically forced simulations, , which means that the total energy conservation equation yields, in the saturated state. This makes and linearly dependent terms and thus we do not consider them separately. Another motivation to use is because work on dynamo quenching often considers , which when expanded for the leads to cubic terms of the form ().
We will be using the publicly available python library scikitlearn (see [38], [39] for introductions) for modeling the data.
Previous work on dynamo theory has often relied on linear regression methods [12, 40] with some effort to incorporate nonlinear effects through magnetic helicity conservation ([9, 41]; see also nonlinear test field method [42]). In this section, we use linear regression with a regularization term for both the linear and polynomial basis discussed previously.
In Fig. 3, we show the fits using random forests regression and LASSO. Because our data is small (only points), we use twofold cross validation to determine various hyperparameters of LASSO and random forests. We found that bootstrapping leads to spurious results so we turned it off. A similar problem was found with random train/test splits: the spatial structure of the fields is like one full sinusoidal wave and randomly picking a fraction of the points leads to strange patterns. Moreover, while deep trees can lead to overfitting, in our case we found that a maximum depth of was optimal through 2fold cross validation. The end results are still not as good as LASSO perhaps owing to the fact that the training set is only entries ( of the data)  this is too low for an algorithm like random forest to really excel. That is the key lesson we have learnt while modeling this dataset throughout: “small” datasets mean simpler algorithms will outperform more sophisticated algorithms like random forests.
Feature Ranking
Feature  OLS  Lasso  Stability  RanFor  MIC  Corr  mean  

0.0076  0.0048  0.0  0.0049  0.2  0.6  0.0000  0.1019  0.1149  
0.6345  0.2577  0.0  1.0000  0.8  1.0  1.0000  0.9972  0.8362  
0.0871  0.0240  0.0  0.0000  0.6  0.0  0.3979  0.1210  0.1538  
1.0000  1.0000  1.0  0.4404  1.0  0.8  1.0000  0.9982  0.7793  
0.0818  0.0000  0.0  0.0000  0.4  0.4  0.4048  0.1179  0.1756  
0.0000  0.0145  0.0  0.0001  0.0  0.2  0.2089  0.1366  0.0700 
In Table 2, we compare the coefficients from OLS, LASSO, randomized LASSO (stability), Recursive Feature Elimination (RFE) from OLS and from random forests, Mutual Information Criterion (MIC), Pearson correlation (Corr) and the average of all these predictions in the last column. The meaning of these numbers is different. For OLS and LASSO, the numbers corresponding to the fields represent coefficients from linear regression and can be negative or positive. For random forests, the numbers represent the relative feature importance based on ‘mean decrease impurity’ that is only positive. These feature importance numbers are scaled to sum to unity. Stability selection is a general term that refers to checking robustness of model predictions by testing it on different random (bootstrapped) subsets of data with different hyperparameter values. In our particular application, we are using randomized LASSO that randomly varies the the LASSO coefficient [43] ^{5}^{5}5influenced by: https://blog.datadive.net/selectinggoodfeaturespartivstabilityselectionrfeandeverythingsidebyside/.
RFE is a modelagnostic (wrapper) feature selection method that starts with a number of features and removes the most important features recursively. This is an example of ‘backward’ feature selection [44, 28] and we used it with OLS and random forests in this case. Maximal Information Coefficient (MIC) is a measure of mutual information between two (random) variables and is capable of accounting for nonlinear relationships that the Pearson correlation coefficient cannot (example: correlation between and will be nearly zero, but MIC will yield an answer close to one). We point out that we scaled the whole set of features using standard scaling before calculating coefficients/feature importances. Moreover, because LASSO, OLS, Corr returned negative as well as positive values and RFE returned values ranging from to , we used minmax scaling to rescale all values between .
Ensembles of machine learning models can lead to improved accuracy [45]. Here we take the simplest possible approach: we take the unweighted average of various predictors to compute the overall score for each feature. While stability selection shows that is the dominant feature, the mean of all predictors gives a slightly higher score. This could be because both and have the same sinusoidal form.
3.3 Time series analysis: Vertically averaged data
In contrast to the previous section, we square and then average the fields vertically and study the temporal dependence. The magnetic fields and the EMF show exponential growth in time and thus it was important to do some preprocessing where we took the logarithm of squared averaged fields. Because of this particular preprocessing, we do not consider polynomial expansions like we did before and instead focus on either the linear form, or the quenched form .
3.3.1 Preprocessing
For the vertically averaged data set, we find that the fluctuations in the EMF are quite significant (see Fig. 4). Such large fluctuations make any kind of statistical analysis quite prohibitive. In this section, we therefore decided to use the log of averaged over the direction. Moreover, to reduce the fluctuations we used a moving window of size (right panel of Fig. 4). This allows for cleaner fits.
3.3.2 LASSO and random forest
In Fig. 5, we show fits for the time series using LASSO and random forests. Time series data is particularly difficult to fit in the presence of fluctuations and trend as in the figure of . In the kinematic phase (left panel), the random forest fit is not considerably different from LASSO fit whereas the random forest fit does poorly when the same time series is extended up to the saturation phase (right panel). This is a relatively common issue with random forests that they do not perform well when there is a “trend” present in the series. Just like the spatial profiles in last subsection, the LASSO outperforms random forests.
3.3.3 Bayesian inference
In this section we focus on reconstructing square of the “true” EMF field as a function of time . As one of the simplest forms possible, we construct the reconstructed EMF, , from the corresponding magnetic field components as
(12)  
(13) 
where and are the reconstructed and components of the EMF field. Here and are our model parameters to optimize. This EMF form has the nice property that in the limit it reduces to the linear model.
After this, we are left with the freedom to define our likelihood or scorefunction that describes the quality of the reconstruction. Here we define it via a simple residual function
(14) 
that penalizes the fit from over and underestimating the reconstructed EMF. Note that other forms of the residual functions are also possible. For discrete simulation data results (defined on an array) the likelihood then simplifies to , where the summation index is taken to be over the time dimension or the direction, depending on the type of fit we are considering. Finally, when employing the MCMC fit note that we are actually minimizing , as is typically done, to get a more shallow and slowly varying scorefunction.
Examples of the fit results are shown in Fig. 6 for three different values of , , and depicting the maximum a posteriori model. The (quenched) EMF model does reproduce the qualitative behavior of the saturating field but especially for higher the fits become more worse. Most importantly, we have found that constraints on the saturation field strength are always very loose. Results without using the quenching form also lead to similar fits. This gives an indirect support for the previous machine learning results: The data strongly favors isotropic with a simplified model of .
3.4 Results summary
For both the vertical and temporal profiles, various models suggest that is a good fit. But if , this will lead to exponential growth without any saturation. Is there an inconsistency here? In the induction equation, the fields at time ‘’ are related to the fields at earlier times. The machine learning models we use here are not capable of storing long term memory. The fits shown in this paper are comparing and in some nearest neighbor sense and because both and are growing exponentially in time, they appear to be linearly correlated. Capturing the dynamical information requires more sophisticated algorithms that we leave for future work.
4 Discussion
4.1 Caveats
Property  OLS  LASSO  RF  Bayesian 

Nonlinear  No  No  Yes  Yes 
Interpretable  Yes  Yes  Yes^{6}^{6}6Feature importances and partial dependence plots can help interpret random forests  Yes 
Suitable for all sizes of data?  Yes  Yes  Intermediate to big data  Small to intermediate data 
Interactions  No  No^{7}^{7}7Including cross terms like can help model interactions but it still assumes each feature is linearly independent  Yes  Yes 
Prediction of trends  No  No  No  Yes 
We summarize the properties of the different machine learning algorithms considered in this work in Table 3. Linear regression (both OLS and LASSO) are by definition capable of dealing with linear data only. However, they can handle some nonlinear properties using a nonlinear basis (for example, polynomial regression). The biggest advantage of linear models is the interpretability: the sensitivity of the target variable (EMF) with respect to the features () is as easy as determining the coefficients of these terms. Linear models are also good at dealing with data of all sizes but matrix inverses are expensive for large data. Linear models might also be able to model trends in data if the features they consider have the same trend as the target variable.
Random forests are based on ensembles of decision trees. While random forests are quite robust to outliers and missing data, the feature importances computed from random forests can be misleading [32]. Moreoever, random forests are not as interpretable as linear regression or single decision trees. A typical random forest fit can involve trees implying that interpretation is difficult. But computation of feature importances and partial dependence plots help in making sense of random forest results [31, 28]. Random forests typically bootstrap data and use a random subset of features for each decision tree to determine which one leads to the lowest error. These two sources of randomness imply that random forests are best suited for intermediate to big data. Indeed, in our work we find that bootstrap had to be turned off and the best fit models used less than decision trees.
Bayesian methods offer an interesting complementary toolkit to various machine learning techniques. They also enable us to properly take the noisy nature of DNSs into account. Here the big caveat is, that the model(s) needs to be known beforehand. This, however, further arguments in favour of our hybrid approach applied in this paper. We have used machine learning techniques to detect important features and then applied our physical knowledge to build valid models out from those. Biggest caveat here is that Bayesian model optimization becomes more and more computationally demanding with multidimensional (and often multimodal) data. This puts a limitation on the reallife usability of the method when dealing with increasingly complex data.
The data considered in this paper is either spatially or temporally averaged reducing to entries. Big data typically refers to the case where the observations/rows are . “Small” data faces problems of overfitting and is more likely to be sensitive to outliers. Because of these issues, simpler models like linear regression with regularization tend to do well. It is, therefore, no surprise that LASSO has outperformed random forests. For larger datasets with irregular nonlinear patterns, random forests and gradient boosting tend to do much better [46].
4.2 Comparison with previous work
We used the simplest model: isotropic with no nonlocal effects in either space and/or time. While earlier work has focused primarily on linear models of dynamos where using linear regression and the test field method [7], nonlinear models have also been considered [42]. We assumed isotropy, which could be a misleading assumption in the presence of shear (see [12, 47]). Furthermore, in our model, we did not incorporate nonlocal effects that were considered by [12, 13, 14]. Combining machine learning algorithms with anisotropic, local, nonlinear models might lead to new insights, and we leave this for future work.
The dynamo considered in this work is an idealized system but has the advantage that due to maximally helical forcing and periodic boundary conditions, magnetic helicity is both conserved and is gauge invariant (see [41]). This simplistic setup allows developing analytical theory to explain the origin and saturation of magnetic fields [9, 48]. Numerical results seem to be consistent with theory [25, 49]. In this work, we considered existing analytical formulations and used machine learning tools to test whether the EMF is linear or nonlinear in the mean magnetic fields. Machine learning models can complement analytical theory in looking for reduced descriptions of high dimensional systems [18].
5 Conclusions
Our main result can be summarized as follows: because we are in the “small” data regime and we are dealing with ordered data (high signal to noise ratio due to helically forced turbulence), regularized linear regression (LASSO) provides the best fit. For small organized data that we have considered in this work, random train/test splits, bootstrapping and other sources of randomness do not help. For this particular dataset, many of the features are strongly correlated with one another complicating the model fits even further. Strong correlations also imply that feature engineering to construct polynomial basis is not particularly useful.
What we intended to demonstrate using the example of is that sophisticated machine learning algorithms can be used to analyze data from DNS of MHD turbulence thanks to the publicly available machine learning libraries. This provides an exciting opportunity to revisit some longstanding problems in astrophysical flows where linear regression has dominated modeling efforts. For example, one can look at the full data cube without planar averaging and study whether 3D localized structures play a dominant role in the sustenance of the turbulence. Here, deep learning [50] will be a useful alternative as it is known to outperform classical machine learning methods in image processing, voice recognition, natural language processing. One particular problem of interest is sheardriven dynamos [51, 52].
acknowledgments
We thank A. Brandenburg for several stimulating discussions and comments on an earlier draft of the paper. The work has been performed under the Project HPCEUROPA3 (INFRAIA20161730897), with the support of the EC Research Innovation Action under the H2020 Programme; in particular, the author gratefully acknowledges the support of Dhrubaditya Mitra (NORDITA, Stockholm) and the computer resources and technical support provided by PDC, Stockholm. We used the following python packages: numpy [53], jupyter notebook [54], matplotlib [55], pandas [56], scikitlearn [33], emcee [37], minepy [57].
References
References
 [1] Subramanian K 2019 Galaxies 7 47 (Preprint 1903.03744)
 [2] Wurster J and Li Z Y 2018 Frontiers in Astronomy and Space Sciences 5 39 ISSN 2296987X URL https://www.frontiersin.org/article/10.3389/fspas.2018.00039
 [3] Ercolano B and Pascucci I 2017 Royal Society Open Science 4 170114 (Preprint https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.170114) URL https://royalsocietypublishing.org/doi/abs/10.1098/rsos.170114
 [4] Brun A S and Browning M K 2017 Living Reviews in Solar Physics 14 4 ISSN 16144961 URL https://doi.org/10.1007/s4111601700078
 [5] Krause F and Steenbeck M 1967 Zeitschrift Naturforschung Teil A 22 671
 [6] Krause F and Raedler K H 1980 Meanfield magnetohydrodynamics and dynamo theory
 [7] Brandenburg A and Subramanian K 2005 Phys. Rep. 417 1–209 (Preprint astroph/0405052)
 [8] Jepps S A 1975 Journal of Fluid Mechanics 67 625–646
 [9] Pouquet A, Frisch U and Leorat J 1976 Journal of Fluid Mechanics 77 321–354
 [10] Vainshtein S I and Cattaneo F 1992 ApJ 393 165–171
 [11] Schrinner M, Rädler K H, Schmitt D, Rheinhardt M and Christensen U R 2007 Geophysical and Astrophysical Fluid Dynamics 101 81–116 (Preprint astroph/0609752)
 [12] Brandenburg A and Sokoloff D 2002 Geophysical and Astrophysical Fluid Dynamics 96 319–344 (Preprint astroph/0111568)
 [13] Hubbard A and Brandenburg A 2009 ApJ 706 712–726 (Preprint 0811.2561)
 [14] Rheinhardt M and Brandenburg A 2012 Astronomische Nachrichten 333 71–77 (Preprint 1110.2891)
 [15] LeCun Y, Bengio Y and Hinton G E 2015 Nature 521 436–444 URL https://doi.org/10.1038/nature14539
 [16] Hennigh O 2017 ArXiv eprints (Preprint 1705.09036)
 [17] Kutz J N 2017 Journal of Fluid Mechanics 814 1â4
 [18] Brunton S L and Kutz J N 2019 DataDriven Science and Engineering: Machine Learning, Dynamical Systems, and Control (Cambridge University Press)
 [19] Navon I M 2009 Data Assimilation for Numerical Weather Prediction: A Review (Berlin, Heidelberg: Springer Berlin Heidelberg) pp 21–65 ISBN 9783540710561 URL https://doi.org/10.1007/9783540710561_2
 [20] Kutz J N 2013 DataDriven Modeling & Scientific Computation: Methods for Complex Systems & Big Data (New York, NY, USA: Oxford University Press, Inc.) ISBN 0199660344, 9780199660346
 [21] Blackman E G and Brandenburg A 2002 ApJ 579 359–373 (Preprint astroph/0204497)
 [22] Charbonneau P 2014 Annual Review of Astronomy and Astrophysics 52 251–290 (Preprint https://doi.org/10.1146/annurevastro081913040012) URL https://doi.org/10.1146/annurevastro081913040012
 [23] Brandenburg A and Dobler W 2002 Computer Physics Communications 147 471–475 (Preprint astroph/0111569)
 [24] Haugen N E, Brandenburg A and Dobler W 2004 Phys. Rev. E 70 016308 (Preprint astroph/0307059)
 [25] Brandenburg A 2001 ApJ 550 824–840 (Preprint astroph/0006186)
 [26] Shukurov A, Sokoloff D, Subramanian K and Brandenburg A 2006 A&A 448 L33–L36 (Preprint astroph/0512592)
 [27] Bollen K A and Pearl J 2013 Eight Myths About Causality and Structural Equation Models (Dordrecht: Springer Netherlands) pp 301–328 ISBN 9789400760943 URL https://doi.org/10.1007/9789400760943_15
 [28] Hastie T, Tibshirani R and Friedman J 2009 The Elements of Statistical Learning (Springer, New York)
 [29] Breiman L 2001 Machine Learning 45 5–32 ISSN 15730565 URL https://doi.org/10.1023/A:1010933404324
 [30] Raschka S 2018 CoRR abs/1811.12808 (Preprint 1811.12808) URL http://arxiv.org/abs/1811.12808
 [31] Breiman L 2001 Statist. Sci. 16 199–231 URL https://doi.org/10.1214/ss/1009213726
 [32] Strobl C, Boulesteix A L, Kneib T, Augustin T and Zeileis A 2008 BMC Bioinformatics 9 307 ISSN 14712105 URL https://doi.org/10.1186/147121059307
 [33] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M and Duchesnay E 2011 Journal of Machine Learning Research 12 2825–2830
 [34] Breiman L, Friedman J, Stone C J and Olshen R A 1984 Classification and Regression Trees (Chapman and Hall/CRC) ISBN 9780412048418
 [35] Louppe G 2014 Understanding Random Forests: From Theory to Practice Ph.D. thesis University of Liege, Belgium arXiv:1407.7502
 [36] Grinstead C and Snell J 1997 Introduction to Probability 2nd ed (Providence, RI: American Mathematical Society) ISBN 9780821807491 URL http://books.google.fi/books?id=14oq4uWGCkwC
 [37] ForemanMackey D, Hogg D W, Lang D and Goodman J 2013 PASP 125 306 (Preprint 1202.3665)
 [38] Guido S and Müller A C 2017 Introduction to Machine Learning with Python (O’Reilly Media) ISBN 9781491962282
 [39] Géron A 2017 HandsOn Machine Learning with ScikitLearn and TensorFlow (O’Reilly Media) ISBN 9781491962282
 [40] Brandenburg A 2018 ArXiv eprints arXiv:1801.05384 (Preprint 1801.05384)
 [41] Blackman E G 2015 Space Sci. Rev. 188 59–91 (Preprint 1402.0933)
 [42] Rheinhardt M and Brandenburg A 2010 A&A 520 A28 (Preprint 1004.0689)
 [43] Meinshausen N and BÃ¼hlmann P 2010 Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 417–473 (Preprint https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.14679868.2010.00740.x) URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.14679868.2010.00740.x
 [44] Guyon I and Elisseeff A 2003 J. Mach. Learn. Res. 3 1157–1182 ISSN 15324435 URL http://dl.acm.org/citation.cfm?id=944919.944968
 [45] Wolpert D H 1992 Neural Networks 5 241 – 259 ISSN 08936080 URL http://www.sciencedirect.com/science/article/pii/S0893608005800231
 [46] Olson R S, Cava W L, Mustahsan Z, Varik A and Moore J H 2017 Datadriven advice for applying machine learning to bioinformatics problems Biocomputing 2018 (WORLD SCIENTIFIC) URL https://doi.org/10.1142%2F9789813235533_0018
 [47] Karak B B, Rheinhardt M, Brandenburg A, Käpylä P J and Käpylä M J 2014 ApJ 795 16 (Preprint 1406.4521)
 [48] Blackman E G and Field G B 2002 Phys. Rev. E 89 265007 (Preprint astroph/0207435)
 [49] Brandenburg A, Sokoloff D and Subramanian K 2012 Space Sci. Rev. 169 123–157 (Preprint 1203.6195)
 [50] Lusch B, Kutz J N and Brunton S L 2018 Nature Communications 9 URL https://doi.org/10.1038/s41467018072100
 [51] Tobias S M and Cattaneo F 2013 Nature 497 463–465
 [52] Nauman F and Blackman E G 2014 MNRAS 441 1855–1860 (Preprint 1403.4288)
 [53] Walt S v d, Colbert S C and Varoquaux G 2011 Computing in Science & Engineering 13 22–30 (Preprint https://aip.scitation.org/doi/pdf/10.1109/MCSE.2011.37) URL https://aip.scitation.org/doi/abs/10.1109/MCSE.2011.37
 [54] Kluyver T, RaganKelley B, Pérez F, Granger B E, Bussonnier M, Frederic J, Kelley K, Hamrick J B, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C and et al 2016 Jupyter notebooks  a publishing format for reproducible computational workflows ELPUB
 [55] Hunter J D 2007 Computing in Science & Engineering 9 90–95 (Preprint https://aip.scitation.org/doi/pdf/10.1109/MCSE.2007.55) URL https://aip.scitation.org/doi/abs/10.1109/MCSE.2007.55
 [56] McKinney W 2010 Data structures for statistical computing in python Proceedings of the 9th Python in Science Conference ed van der Walt S and Millman J pp 51 – 56
 [57] Albanese D, Filosi M, Visintainer R, Riccadonna S, Jurman G and Furlanello C 2012 Bioinformatics 29 407–408 ISSN 13674803 (Preprint http://oup.prod.sis.lan/bioinformatics/articlepdf/29/3/407/17105876/bts707.pdf) URL https://doi.org/10.1093/bioinformatics/bts707