Active image restoration
We study active restoration of noise-corrupted images generated via the Gibbs probability of an Ising ferromagnet in external magnetic field. Ferromagnetism accounts for the prior expectation of data smoothness, i.e. a positive correlation between neighbouring pixels (Ising spins), while the magnetic field refers to the bias. The restoration is actively supervised by requesting the true values of certain pixels after a noisy observation. This additional information improves restoration of other pixels. The optimal strategy of active inference is not known for realistic (two-dimensional) images. We determine this strategy for the mean-field version of the model and show that it amounts to supervising the values of spins (pixels) that do not agree with the sign of the average magnetization. The strategy leads to a transparent analytical expression for the minimal Bayesian risk, and shows that there is a maximal number of pixels beyond of which the supervision is useless. We show numerically that this strategy applies for two-dimensional images away from the critical regime. Within this regime the strategy is outperformed by its local (adaptive) version, which supervises pixels that do not agree with their Bayesian estimate. We show on transparent examples how active supervising can be essential in recovering noise-corrupted images and advocate for a wider usage of active methods in image restoration.
Many inference problems in machine learning amount to restoration of a hidden structure based on noisy observations. They are similar (sometimes isomorphic) to models of equilibrium statistical physics, where the role of noise is played by frozen disorder. In particular, the problem of noise-corrupted image restoration can be mapped to the two-dimensional Ising ferromagnet geman (); besag (); scot (); tanaka (); cohen (); nishimori (); nishi (). The Gibbs probability of this model serves as a prior probability for images. Spins (two-value variables equal to ) of the Ising model refer to the black-white pixels of a digital image. The ferromagnetic feature of the model means that neighbouring pixels are correlated, a legitimate minimal assumption on the prior probability.
Methods of solving inference problems cannot perform without prior information. Since its amount is limited, it is natural to study informed strategies for requesting prior information. In particular, prior information can be requested on the ground of the previous functioning of the inference method. This notion of active inference review_active () emerged in statistics wald () and has been applied to various inference problems including Hidden Markov Models Anderson2005 (); Hoffman2001 (); ag_active (), network inference Zhu2003 (); Bilgic2009 (); Bilgic2010 (); Moore2011 (), economics, experiment optimization etc review_active (). In the context of noisy image restoration, the concept of active inference assumes that there is a costly noiseless channel by which the correct values for some of pixels can be communicated directly to the image restorator. In practice, this noiseless channel may mean an additional precise (but costly) measurement of pixels. Since the channels is costly, only a small fraction (say ) of original pixels is communicated with a hope that—if employed properly—they can lead to a sizable improvement in restoration of other (noisy) pixels. The subject of active inference has two related issues: which prior information is to be requested and to what extent this information improves the method performance review_active (). The first question refers to the meaning of (prior) information, the second one to its value.
Most existing theoretical results on active inference in graphical models are concerned with sub-modular cost functions that allow to establish certain optimality guarantees Krause2005_UAI (); Krause2009_JAIR (). The sub-modularity assumption does not hold in several interesting situations, e.g. for systems with long-range correlations and scale-invariance, a range of phenomena that is collectively designated as critical behavior. The clearest example of such a behavior is the second-order phase transitions in statistical mechanics nishi (). It was found that natural images (e.g. nature scenes) are organized in a scale-invariant way ruderman () and do have long-range correlations bialek (). Statistical mechanics (e.g. the Ising model) is the most appropriate way of describing such images bialek (). Generally, the optimal strategy of active inference—i.e. which variables are to be supervised given a noisy observation—is not known beyond a direct enumeration Krause2009_JAIR ().
Here we shall study an active recognition (estimation or filtering) of a noise-corrupt image, which was generated via the two-dimensional Ising ferromagnet. The model does allow for a critical regime depending on the value of the ferromagnetic coupling constant, which is taken as the known hyperparameter of the model. For a large number of pixels, this critical regime is realized via a second-order phase-transition, with magnetization being an order parameter.
We analytically determined the optimal active supervising strategy of the Ising model in the mean-field limit. This mean-field limit of the Ising model was already studied for a Gaussian noise image-restoration problem nishi (); nishimori (). More generally, mean-field methods are widely applied in statistical inference problems jung (); forbes (); inoue (); geiger (); mitter (); see opper () for a book presentation. The inferred image is determined by minimizing the Bayesian risk robert (). The optimal mean-field strategy amounts to supervising only those pixels whose observed (noisy) value does not agree with the sign of the magnetization. It also shows that there is a maximal number of supervised variables, beyond of which the supervising is not meaningful. Without active supervising, the optimal inference of the mean-field model shows only two (in a sense trivial) regimes, viz. observation-dominated (where the prior information is not needed) and prior-dominated (where observations are redundant). With active supervising, there is a non-trivial coupling between observations and the prior knowledge, i.e. no separation into two different regimes is possible.
The mean-field solution of the actively supervised model is transparent and guides the understanding of a more complex, two-dimensional situation which is studied numerically. We saw that the mean-field optimal active supervising strategy performs well in the two-dimensional case, if the model is away from the critical regime. Otherwise, we found a strategy that is superior to the mean-field optimal one. This strategy amounts to comparing the observed (noisy) value of pixel with its Bayesian estimate: only those pixels are supervised for which these values disagree. We show on concrete examples how usual (not active) methods fail to restore noise-corrupted images, while the active supervising does lead to a satisfactory restoration.
This text is organized as follows. Section II recalls the general theory of optimal Bayesian inference with and without supervising. Here we also recall how this theory can be represented via Ising models. Next section finds out the optimal active supervising strategy in the mean-field version of the Ising model. Section IV studies numerically the two-dimensional Ising model. It shows that several results deduced in the mean-field situation also apply in this more realistic case. Section IV also works out a local version of that optimal supervising strategy and shows that it yields better results in the critical regime of the model, where the mean-field method does not apply. Section V studies the Ising model, where the prior probability has a small bias (in the value of spins) given by a weak external magnetic field. Note that sections IV and V (numerical results on a square Ising lattice) can be read independently from III (analytic results in the mean-field limit).
We summarize in the last section. In preparation for future research, Appendix studies hyperparameter estimation of the mean-field Ising model. The analysis here is deeper than normally done in literature, because we study the influence of non-identifiability on the optimal restoration.
Ii General formulation of image restoration and representation via Ising models
ii.1 Without supervising
Here we recall general ideas of the Bayesian inference for two-valued random variables with and without supervising. Next sections will specify them for the image restoration problem.
Let there are two dependent vector random variables and . Now is called the hidden variable, since it cannot be observed directly. Instead, we assume that a concrete value of is observed. On the ground of we want to find a representative value (an estimate) of that is likely to be the source of . We shall assume that , and call them spins (pixels). The joint probability is ; e.g. the conditional probability describes a noisy channel, where is the noise-corrupted observation of .
The quality of the estimate is given by the average Hamming distance (or overlap):
so that a larger means a better restoration. We stress that the averaging over in (1) is done “by hands”: since the hidden variables are not known, one generates them via (for fixed observations) and then takes the average. Eq. (1) coincides with the inverse Bayesian risk robert (). The latter is minimized, while (1) is maximized over .
In applications one also frequently studies the average of over . This averaging can also be done by hands, i.e. over sufficiently many applications, but it is important to stress that if is sufficiently large (and is an ergodic process), does self-average, i.e. according to the law of large numbers the sum (1) over a large number is replaced by its average .
Maximizing (1) over leads to the optimal method () with the largest overlap for given observations :
Eqs. (3) can serve, e.g. for evaluating the maximum-likelihood method (ML). Recall that the ML does not employ the prior probability , and is based on maximizing for given observations :
For completeness we mention an equivalent approach to introducing the overlap (1). Let us temporarily assume that we know the original realization of the random variable . Once it passes through the noisy channel, we obtain , where is the noise variable and means elemetwise multiplication of two vectors (Hadamard product). Now the overlap can be defined as
The joint distribution of and is .
To make independent from the concrete choice of and we look at the average
which brings us back to the averaged (3). However, there is no direct relation between and .
ii.2 With supervising
Given observations one requests an additional information on certain (supervising) so as to improve the quality of restoration for the remaining variables. This is described by the vector
where () means that the true value of the corresponding is requested (not requested). Naturally, the number of supervised spins is fixed (i.e. not all spins are supervised)
The supervising strategy is described by conditional probability , i.e. the supervising is generally probabilistic. Let us divide into supervised () and not supervised () parts. and depend on , but this dependence is not indicated explicitly to avoid excessive notations. E.g. if and , then and . Let and be (respectively) the joint probability of non-supervised and supervised spins and observations. These probabilities are found from .
Let the values of were requested and send via a noiseless channel (we explain below the meaning of this assumption). They appeared to be . Together with and , also is by now given. Hence the distribution of the non-supervised spins is conditioned both by observations and by supervised variables .
We estimate non-supervised spins as . The corresponding overlap reads:
Now if , and are sufficiently long, will self-average over .
The major problem of supervising is to find the best , which holds (8) and provides the largest average overlap (11). For understanding this problem one should rely on models, because there is no general solution for it under (for a small value of one can proceed with straightforward calculations).
Let us now explain the practical meaning of the above assumption on the existence of a noiseless channel. We note that frequently such channels do exist (e.g. because they relate to a sufficiently precise equipment), but they are costly, i.e. the cost per pixel (or spin) for using such a channel is high. In that case it is useful to employ the noisy channel for pixels, but still to use the noiseless channel for actively supervised pixels, where .
ii.3 Ising model
where is a constant hyperparameter (coupling constant), and is the Hamiltonian of the Ising ferromagnet, where means summation over neighbours of a lattice. In the context of the image recognition problem this is a square lattice. Here refer to black-white pixels of the original image, and the ferromagnetic coupling means that there is a prior information on positive pixel-pixel correlations. This is just the standard smoothness assumption.
We assume that the observed variables (noise-corrupted image) relate to via a symmetric and independent noise
where relates to the error probability , and where we recall that .
Now refers to the Gibbs distribution of a statistical system at temperature and with Hamiltonian . We shall assume that and are known hyperparameters. More generally, they are not known, but should be inferred from data; see berg () for a recent review. Standard methods for doing that are discussed in Appendix together with their limitations.
Iii Mean-field analysis
iii.1 The fully coupled (mean-field) Ising model
where we also assumed to show that we have to have for . The first sum in (16) goes through all under . Eq. (16) refers to the mean-field-interaction version of the random-field Ising model. It is known to be solvable nishi (); nishimori ().
For the latter integral is taken by the saddle-point method:
where the magnetization is determined from the saddle-point equation as
Now formally is a function of , but for this model (and in the limit ) it self-averages and becomes (almost) independent from . This known fact can be confirmed via calculating correlation functions between and . Thus we return to (17) employ there the saddle-point method, use (20) and end up with
where is to be determined from (20) via self-averaging:
One can verify from obtained via (17) that coincides with the average collective spin of the original image: .
For , (23) predicts . For , (23) predicts two solutions with and . They refer to two different ergodic components. Thus we have a second-order phase transition at . It belongs to the mean-field universality class.
and the self-averaged optimal overlap reads from (3)
Hence the message of (24, 25) is that there are only two extreme situations: for —which means a weak noise according to (14)—the prior information is irrelevant, since observations are reliable. Hence the optimal estimation method coincides with the maximum-likelihood; see (4, 13). For (strong noise, as seen from (14)) observations are irrelevant, since the estimate (24) depends only on the parameter of the prior . Put differently, either the prior information (given by in ) is irrelevant, or observations are irrelevant.
iii.2.1 The optimal overlap after supervising
We recall from section II.2 that supervising is described by the conditional probability of non-supervised spins given the values of supervised spins and observations , as well as by the conditional probability of the coordinates of supervised spins. Now (21, 22) show that for the present model the probabilities and factorize. This implies that does not depend on . Using again (21, 22) we get from (11):
where is found from (22), and where we assumed that (the probability of , given the observations ) is symmetric in the sense that all do have the same marginal-conditional probability: . The constraint (8) will be implemented in average. Hence holds two constraints:
The meaning of (28) is intuitively clear: the optimal supervising amounts to changing—from to —the (effective) distribution of observations, where is the probability of not supervising a given spin. The same idea can be implemented as an anzatz for more general models, but there it does not have to be optimal.
iii.2.2 Random supervising provides no advantage
iii.2.3 Active supervising
We turn to the active supervising and maximize given by (28) over under constraints (29, 30). To understand the idea of maximization, assume . Then , and (28) maximizes upon taking such that is maximally large and compatible with (29), and after that taking as large as constraints (29, 30) still allow. The general solution is written via introducing
which is equal to if . Then the probabilities maximizing (28) read
When starts to increase from , (35) monotonously increases from its value given by (25) till its maximal value . According to the non-supervised maximal overlap (25), for , but (i.e. the noise is weak, as seen from (14)) the prior is not relevant. This is improved after supervising, now the prior is always relevant provided that . It is not meaningful to supervise for , since having reached , the overlap there does not anymore increase with increasing ; see (35). Hence the values of and in (33)—as well as and in (34)—are redundant. Then (33, 34) show that observations that agree with the sign of the average magnetization should not be supervised: .
So far we assumed that the hyperparameters and in (respectively) and are known precisely. Generally, this is not the case, and hyperparameters themselves are to be found from data; see berg () for a recent review. Appendix discusses this problem for the considered model.
Iv Numerical results for a square lattice
iv.1 Two methods for active supervising
The above results concerned the mean-field model, i.e. an unrealistic situation if we take into account that real images are two-dimensional. Hence we turn to studying numerically the Ising model (15) on a square lattice with periodic boundary conditions. We start from (9) and (15), and study two different strategies of active supervising. In the first (global) strategy we supervise spins selecting them via the following criterion: given the observation vector , the supervised spins are chosen randomly among those that hold
i.e. those spins are supervised which do not agree with the sign of the collective value . Thus variables in (7) are determined via (36). This strategy is clearly inspired by the above mean-field solution (33, 34), where is the observed magnetization; see Fig. 1 for the value of averaged over many samples, as well as a single-sample form of it. For (which we assume to be the case from (13)) and a sufficiently large , we get that and have the same sign. Note that strategy (36) is easy to implement.
Within the second (local) strategy we randomly select supervised spins among those that hold
i.e. now one first calculates the optimal estimate according to (2, 15) and then supervises those spins that do not agree with observations. This strategy did not show up within the optimal solution for the mean-field situation. We choose it with an expectation that can work, where the mean-field method does not apply, i.e. fluctuations are essential.
Naturally, (36, 37) are to be compared with the random supervising, where supervised spins are chosen completely randomly, i.e. without any dependence on . Note that besides (36, 37) we also studied several other supervising strategies, e.g. when in the right-hand-side of (36) or (37) is changed to . We shall not discuss such strategies, since they proved to be sub-optimal; frequently they are worse than the random supervising, and sometimes they are worse than having no supervising at all.
iv.2.1 Generation of images
With a given coupling , we generated images by Monte Carlo simulation (Metropolis algorithm) from the two-dimensional Ising model (12) on a square lattice with periodic boundary conditions. The overall number of spins is spins. Given the noise probability (and hence ) from (13, 14), we flip each spin of the original image generating the noisy image . The conditional averages in (3, 10, 11) are calculated by averaging over samples (we did check that this number suffices and the averages saturate).
iv.2.2 Three regimes of the square Ising ferromagnet
Recall that the square (two-dimensional) Ising model (12) has three regimes depending on the ferromagnetic coupling constant landau (); see Fig. 1. For a small values of the system is in the paramagnetic state, where each spin is equally likely to assume values or . In the vicinity of a certain critical value we enter into the critical regime, where the paramagnetic state gets unstable and there are long-range fluctuations (hence correlations) landau (). It is well-known that the infinite square lattice has a second-order phase-transition point at landau (). This is close to the value observed in our numerics done on a finite lattice; see Fig. 1. Another indication of the critical regime is that the averaged magnetization does not coincide with the single-sample magnetization , as shown by Fig. 1. This is a sign of strong-fluctuations.
The third regime is set for a sufficiently large . Here fluctuations are relatively small. Indeed, Fig. 1 shows that for we get that a single-sample behavior coincides with the averaged one. The symmetry of the Hamiltonian (12) is spontaneously broken (this process starts from the critical regime). Hence within each given sample has a definite sign landau ().
Images generated in the third regime show pixels of one color on the background of another color; see Fig. 2. Such figures are interesting also because they are very susceptible to noise, and cannot be recovered by standard (i.e. not active) methods; see Fig. 2. In contrast, images generated in the critical regime do show an interesting fine-grained structure. To some extent they can be recovered via standard methods, though the active supervising still leads to a serious improvement; see Figs. 3 and 4.
iv.2.3 Images and overlaps
The performance measure of strategies (36, 37) is checked via overlaps (3, 11). Overlap close to one is a necessary condition for a good restoration. For definiteness, we compared the performance (i.e. the overlap) of the present non-supervised method with those of several standard filters—e.g. the median filter or the Gaussian filter—that are employed in the image-recognition; see e.g. tanaka (). Numerical packages for implementing these filters were taken from numeric (). In agreement with the fact that the considered method optimizes the overlap (3), we found that these filtering methods produce a smaller overlap than (3).
We note that in the present situation we have a possibility to look at additional performance measures: since we generate the data ourselves—i.e. we have the original image —we can employ directly an analogue of (5) that checks the restored image (after supervising) with the original image :
where is defined in (10). Note that since we do not average over and , the overlaps (11) and (38) are generally not equal for a considered finite value of . We confirmed that (11) and (38) are not precisely equal to each other, though they are normally quite close, since is still sufficiently large, and the self-averaging applies approximately. In all relevant situations, if a strategy has a superior performance according to (11), then it is also superior according to (38). We also stress that all the results on differences between the strategies were checked against varying the initial image .
We identified two regimes in the behaviour of and , i.e. the overlaps (11) within the local and global strategies, respectively. In both regimes the maximal (better) of them is larger than the overlap obtained via the random supervising at the same number of supervised spins [see Tables LABEL:tab1 and LABEL:tab2]:
Table LABEL:tab1 shows the first regime, where is sufficiently large, so that the system is away of the critical regime, which realized for ; see Fig. 1. In these regime the strategy (36) is dominant over (37), though their predictions are close to each other:
Altogether, for the situation is close to the mean-field regime. This is additionally confirmed by the fact that the random supervising does not provide any substantial improvement: ; see Table LABEL:tab1 and Fig. 2, and compare these with the discussion around (31). Fig. 2 describes regime (40) and shows that without supervising the real image restoration is absent, despite of the fact that the overlap for this restoration regime is sizable; see Table LABEL:tab1. The reason for this is that the non-supervised restoration is over-dominated by the prior information; cf. the discussion after (26). Fig. 2 also shows that the image restoration greatly improves after applying active methods. Both local and global methods lead to similar results here.
For smaller values of , the local strategy is better than the global one: ; see Table LABEL:tab2. Now the random supervising can improve over no supervising, and there are cases (for a small ), where the global strategy is worse than the random: . The reason why the global strategy does not apply is clear, since for the mean-field method does not apply, in particular because the model is within the crtical regime; see Fig. 1, where the critical regime can be identified by the region, where and sharply increase from zero to values larger than . It is encouraging that even for the local strategy performs well, e.g. it is visibly better than the random supervising; see Table LABEL:tab2. Fig. 3 illustrates this situation for . It is seen that the performance of the non-supervised restoration is still poor, although better than what was seen for Fig. 3. Now the local active scenario is clearly better than the global one. Finally, Fig. 4 presents a representative example of . Here all methods perform more or less reasonably, but it is clearly seen that fine details of the original image are captured only by the local supervising method.
Thus we suggest that the local strategy (37) is to be applied for this and similar models, because even when it is sub-optimal it is close to the optimal strategy.
V Numerical results with external magnetic field
So far we worked with the Hamiltonian (15) on a square lattice. The corresponding prior density is generated by the Ising Hamiltonian , which is symmetric with respect to the inversion . This symmetry can be broken with an external field that introduces a bias in the distribution of . Now instead of (15) we look at
which leads to the joint probability , and to the prior probability .
If assumes a small but generic value, the magnetization and its observed value become smoother functions of that are closer to their average values; see Fig. 5 and compare it with Fig. 1. This is expected, because the second-order phase-transition regime is now replaced by a smooth crossover from lower to higher values of and ; see Fig. 5. Since the situation is more stable (than for ), we can apply a smaller amount of supervising (i.e. smaller values of ).
Fig. 6 shows the original and recovered images for a small but generic value of the external field and for . It is seen that the original image has a fine-grained structure. At such a structure is seen for being roughly between and . As compared to other presented examples, here we applied a smaller amount of supervising (i.e. the of spins is supervised). Still this supervising is clearly useful, especially in its active scenario. In Fig. 6 we are still far from the mean-field, since ; cf. Table LABEL:tab5 for further data.
(b) The same as in (a), but for . It is seen that single realizations of and converge to the their averages upon increasing .
Restoration of noise-corrupted images is a fundamental problem of statistics, and it is also of obvious practical importance geman (); besag (); scot (); tanaka (); cohen (); nishimori (); nishi (). It joins together mathematical and physical statistics, since the simplest non-trivial approach to this problem is based on the two-dimensional, square-lattice Ising model with a ferromagnetic interaction between neighbouring binary pixels (spins) geman (); scot (). The prior probability of images within this model is determined via the Gibbs distribution, where the ferromagnetic coupling constants account for natural smoothness (correlation) between neighbouring spins. Hence the Ising model was extensively studied in the context of image restoration geman (); besag (); scot (); tanaka (); cohen (); nishimori (); nishi (). The optimal approach to restoration within the Ising model is based on maximizing the average overlap (the inverse Bayesian risk). This approach demands calculating Gibbsian (or equilibrium) averages of all Ising spins. In the sense of the average overlap, this approach outperforms those based on various types of filtering (e.g. median filtering) tanaka (). However, filtering approaches are easier to implement in practice.
Here we studied a ferromagnetic Ising model for active restoration of images, where prior information on certain (supervised) spins is requested using the initial observation of the noise-corrupted image. For a given number of supervised spins, the optimal supervising strategy maximizes the average overlap of non-supervised spins with their true (requested) values. Generally, this optimal strategy is not known. We found the optimal supervising strategy within the mean-field approximation, which is introduced via letting all spins couple with each other. It applies to realistic (two-dimensional) images, whenever the inter-spin coupling is large (the image is smooth) and hence fluctuations are small. The optimal strategy in this situation amounts to supervising those pixels (spins) that do not agree with the sign of the overall magnetization. The strategy shows that there is an upper limit on the number of supervised spins, beyond of which it is meaningless to supervise, since the performance will not change.
The mean-field approximation allows to study the active supervision in detail. It also leads to strategies that can apply more generally, i.e. in the critical regime of the two-dimensional model, where correlations are long-range and probability distributions are self-similar bialek (). The critical regime applies to natural images ruderman (). We explored one such strategy, where we supervise only those spins (pixels) that disagree with their Bayesian estimates. The performance of this strategy is comparable with the optimal one where the latter is known, i.e. the mean-field applies. Otherwise (e.g. in the critical regime), it outperforms over all other methods we tried.
In this study we assumed that hyperparamaters of the model—the smoothness parameter of the prior probability (ferromagnetic coupling), the bias (external magnetic field) and the noise parameter —are known. Generally, this is not the case: hyperparameters are to be determined from data. This is one instance of the hyperparameter learning for the Ising model that recently attracted much attention hyper_naive_mf (); hyper (); see hyper_review (); berg () for reviews. In Appendix we studied the hyperparameter learning within the mean-field approximation, and saw that this problem is non-trivial due to non-identifiability: if (for taken to be zero for simplicity) both hyperparameters and are unknown, then neither of them can be found via the standard maximum likelihood approach. The non-identifiability problem is crucial for image restoration, since imprecisely known hyperparameters may lead to a larger value of the average overlap (over-confidence), thereby creating a false impression about the performance of the restoration. Appendix also shows that within the mean-field approximation the non-identifiability problem can be resolved via the active supervising. A pertinent open problem is how the active supervising alters hyperparameter learning for the two-dimensional situation, especially in its critical regime.
A.E.A. acknowledges discussions with A. Galstyan. A.E.A. was supported by SCS of Armenia (Grant No. 18RF-015). This work is supported by National Natural Science Foundation of China (Grant No. 11505071), the Programme of Introducing Talents of Discipline to Universities under Grant No. B08033 and the Fundamental Research Funds for the Central Universities.
Appendix: Unknown hyperparameters
.1 The maximum-likelihood method for estimation of hyperparameters
The standard maximum-likelihood method of estimating unknown hyper-parameters is to start with the observations and probabilities at trial values of the hyper-parameters ephraim_review (). For the mean-field model studied in section III.1 these trial values are and , while the correct values of hyperparameters are still denoted by and .
Now if , then self-averages:
with the true (i.e. containing the correct hyper-parameters and ) probability . The global maximum of (42) is reached (among other possibilities) at and ephraim_review (). This fact follows from the positivity of the relative entropy: . However, the maximum is generally not unique, so the maximization over and does not need to return true values and . In practice the global optimization is generally intractable; hence it is implemented via an expectation–maximization (EM) method (Baum-Welch algorithm) ephraim_review ().
Note that self-averages with probability containing true values of and .
Now it is clear from (22, 45) that for 111For we get . Then (43) leads to . This satisfies (46), but there is no constraint on and on , i.e. no determination of hyperparameters is possible whatsoever. This is expected, because for the initial distribution of the spins is completely unbiased. the maximization of (44) will lead to , and this global maximum is achieved for
Eq. (46) cannot determine two unknowns and . If one of them is known precisely, e.g. , the other one will be found via (46). But if both and are unknown, neither of them will be found. This is an example of the non-identifiability problem in inference 1957 (); ito ().
.2 Overlap and over-confidence
The overlap and magnetization still self-average under the correct values and [cf. (25)]:
Recall that and hold (46). Under this constraint in (49) can be either larger or smaller than the optimal overlap given by (25); e.g. one can take sufficiently large and small so as to hold (46) and get from (49). Hence the overlap is a good criterion for distinguishing the optimal solution only if the hyperparameters are known precisely.
Once hyperparameters cannot be found precisely, there can be two possibilities for the overlap under trial values [see (46, 49)] and the optimal overlap given by (25): and . The former is a fair situation, where imprecise knowledge of hyperparameters ( and ) brings in a reduction in the overlap. In contrast, the latter is a dangerous situation, where imprecise knowledge leads to over-confidence (over-fitting). We feel that so far not enough attention was devoted to the over-confidence phenomenon both computationally and theoretically. One of the advantages of the present model is that it makes the phenomenon obvious.