Information Theory and Statistical Physics –
Lecture Notes
Abstract
This document consists of lecture notes for a graduate course, which focuses on the relations between Information Theory and Statistical Physics. The course is aimed at EE graduate students in the area of Communications and Information Theory, as well as to graduate students in Physics who have basic background in Information Theory. Strong emphasis is given to the analogy and parallelism between Information Theory and Statistical Physics, as well as to the insights, the analysis tools and techniques that can be borrowed from Statistical Physics and ‘imported’ to certain problem areas in Information Theory. This is a research trend that has been very active in the last few decades, and the hope is that by exposing the student to the meeting points between these two disciplines, we will enhance his/her background and perspective to carry out research in the field.
A short outline of the course is as follows: Introduction; Elementary Statistical Physics and its Relation to Information Theory; Analysis Tools in Statistical Physics; Systems of Interacting Particles and Phase Transitions; The Random Energy Model (REM) and Random Channel Coding; Additional Topics (optional).
Information Theory and Statistical Physics –
Lecture Notes
Neri Merhav
Department of Electrical Engineering
Technion  Israel Institute of Technology
Haifa 32000, ISRAEL
merhav@ee.technion.ac.il
Contents
 1 Introduction

2 Elementary Stat. Physics and Its Relation to IT
 2.1 What is Statistical Physics?
 2.2 Basic Postulates and the Microcanonical Ensemble
 2.3 The Canonical Ensemble
 2.4 Properties of the Partition Function and the Free Energy
 2.5 The Energy Equipartition Theorem
 2.6 The Grand–Canonical Ensemble (Optional)
 2.7 Gibbs’ Inequality, the 2nd Law, and the Data Processing Thm
 2.8 Large Deviations Theory and Physics of Information Measures
 3 Analysis Tools and Asymptotic Methods

4 Interacting Particles and Phase Transitions
 4.1 Introduction – Origins of Interactions
 4.2 A Few Models That Will be Discussed in This Subsection Only
 4.3 Models of Magnetic Materials – General
 4.4 Phase Transitions – A Qualitative Discussion
 4.5 The One–Dimensional Ising Model
 4.6 The Curie–Weiss Model
 4.7 Spin Glass Models With Random Parameters and Random Code Ensembles
 5 The Random Energy Model and Random Coding

6 Additional Topics (Optional)
 6.1 The REM With a Magnetic Field and Joint Source–Channel Coding
 6.2 The Generalized Random Energy Model (GREM) and Hierarchical Coding
 6.3 Phase Transitions of the Rate–Distortion Function
 6.4 Capacity of the Sherrington–Kirkpartrick Spin Glass
 6.5 Generalized Temperature, de Bruijn’s Identity, and Fisher Information
 6.6 The Gibbs Inequality and the Log–Sum Inequality
 6.7 Dynamics, Evolution of Info Measures, and Simulation
1 Introduction
This course is intended to EE graduate students in the field of Communications and Information Theory, and also to graduates of the Physics Department (in particular, graduates of the EE–Physics program) who have basic background in Information Theory, which is a prerequisite to this course. As its name suggests, this course focuses on relationships and interplay between Information Theory and Statistical Physics – a branch of physics that deals with many–particle systems using probabilitistic/statistical methods in the microscopic level.
The relationships between Information Theory and Statistical Physics (+ thermodynamics) are by no means new, and many researchers have been exploiting them for many years. Perhaps the first relation, or analogy, that crosses our minds is that in both fields, there is a fundamental notion of entropy. Actually, in Information Theory, the term entropy was coined after the thermodynamic entropy. The thermodynamic entropy was first introduced by Clausius (around 1850), whereas its probabilistic–statistical interpretation is due to Boltzmann (1872). It is virtually impossible to miss the functional resemblance between the two notions of entropy, and indeed it was recognized by Shannon and von Neumann. The well–known anecdote on this tells that von Neumann advised Shannon to adopt this term because it would provide him with “… a great edge in debates because nobody really knows what entropy is anyway.”
But the relationships between the two fields go far beyond the fact that both share the notion of entropy. In fact, these relationships have many aspects, and we will not cover all of them in this course, but just to give the idea of their scope, we will mention just a few.

The Maximum Entropy (ME) Principle. This is perhaps the oldest concept that ties the two fields and it has attracted a great deal of attention, not only of information theortists, but also that of researchers in related fields like signal processing, image processing, and the like. It is about a philosopy, or a belief, which, in a nutshell, is the following: If in a certain problem, the observed data comes from an unknown probability distribution, but we do have some knowledge (that stems e.g., from measurements) of certain moments of the underlying quantity/signal/random–variable, then assume that the unknown underlying probability distribution is the one with maximum entropy subject to (s.t.) moment constraints corresponding to this knowledge. For example, if we know the first and the second moment, then the ME distribution is Gaussian with matching first and second order moments. Indeed, the Gaussian model is perhaps the most widespread model for physical processes in Information Theory as well as in signal– and image processing. But why maximum entropy? The answer to this philosophical question is rooted in the second law of thermodynamics, which asserts that in an isolated system, the entropy cannot decrease, and hence, when the system reaches equilibrium, its entropy reaches its maximum. Of course, when it comes to problems in Information Theory and other related fields, this principle becomes quite heuristic, and so, one may question its relevance, but nevertheless, this approach has had an enormous impact on research trends throughout the last fifty years, after being proposed by Jaynes in the late fifties of the previous century, and further advocated by Shore and Johnson afterwards. In the book by Cover and Thomas, there is a very nice chapter on this, but we will not delve into this any further in this course.

Landauer’s Erasure Principle. Another aspect of these relations has to do with a piece of theory whose underlying guiding principle is that information is a physical entity. In every information bit in the universe there is a certain amount of energy. Specifically, Landauer’s erasure principle (from the early sixties of the previous century), which is based on a physical theory of information, asserts that every bit that one erases, increases the entropy of the universe by , where is Boltzmann’s constant. It is my personal opinion that these kind of theories should be taken with a grain of salt, but this is only my opinion. At any rate, this is not going to be included in the course either.

Large Deviations Theory as a Bridge Between Information Theory and Statistical Physics.
Both Information Theory and Statistical Physics have an intimate relation to large deviations theory, a branch of probability theory which focuses on the assessment of the exponential rates of decay of probabilities of rare events, where the most fundamental mathematical tool is the Chernoff bound. This is a topic that will be covered in the course and quite soon. 
Random Matrix Theory. How do the eigenvalues (or, more generally, the singular values) of random matrices behave when these matrices have very large dimensions or if they result from products of many randomly selected matrices? This is a hot area in probability theory with many applications, both in Statistical Physics and in Information Theory, especially in modern theories of wireless communication (e.g., MIMO systems). This is again outside the scope of this course, but whoever is interested to ‘taste’ it, is invited to read the 2004 paper by Tulino and Verdú in Foundations and Trends in Communications and Information Theory, a relatively new journal for tutorial papers.

Spin Glasses and Coding Theory. It turns out that many problems in channel coding theory (and also to some extent, source coding theory) can be mapped almost verbatim to parallel problems in the field of physics of spin glasses – amorphic magnetic materials with a high degree of disorder and very complicated physical behavior, which is cusomarily treated using statistical–mechanical approaches. It has been many years that researchers have made attempts to ‘import’ analysis techniques rooted in statistical physics of spin glasses and to apply them to analogous coding problems, with various degrees of success. This is one of main subjects of this course and we will study it extensively, at least from some aspects.
We can go on and on with this list and add more items in the context of these very fascinating meeting points between Information Theory and Statistical Physics, but for now, we stop here. We just mention that the last item will form the main core of the course. We will see that, not only these relations between Information Theory and Statistical Physics are interesting academically on their own right, but moreover, they also prove useful and beneficial in that they provide us with new insights and mathematical tools to deal with information–theoretic problems. These mathematical tools sometimes prove a lot more efficient than traditional tools used in Information Theory, and they may give either simpler expressions for performance analsysis, or improved bounds, or both.
At this point, let us have a brief review of the syllabus of this course, where as can be seen, the physics and the Information Theory subjects are interlaced with each other, rather than being given in two continuous, separate parts. This way, it is hoped that the relations between Information Theory and Statistical Physics will be seen more readily. The detailed structure of the remaining part of this course is as follows:

Elementary Statistical Physics and its Relation to Information Theory: What is statistical physics? Basic postulates and the micro–canonical ensemble; the canonical ensemble: the Boltzmann–Gibbs law, the partition function, thermodynamical potentials and their relations to information measures; the equipartition theorem; generalized ensembles (optional); Chernoff bounds and the Boltzmann–Gibbs law: rate functions in Information Theory and thermal equilibrium; physics of the Shannon limits.

Analysis Tools in Statistical Physics: The Laplace method of integration; the saddle–point method; transform methods for counting and for representing non–analytic functions; examples; the replica method – overview.

Systems of Interacting Particles and Phase Transitions: Models of many–particle systems with interactions (general) and examples; a qualitative explanation for the existence of phase transitions in physics and in information theory; ferromagnets and Ising models: the 1D Ising model, the CurieWeiss model; randomized spin–glass models: annealed vs. quenched randomness, and their relevance to coded communication systems.

The Random Energy Model (REM) and Random Channel Coding: Basic derivation and phase transitions – the glassy phase and the paramagnetic phase; random channel codes and the REM: the posterior distribution as an instance of the Boltzmann distribution, analysis and phase diagrams, implications on code ensemble performance analysis.

Additional Topics (optional): The REM in a magnetic field and joint source–channel coding; the generalized REM (GREM) and hierarchical ensembles of codes; phase transitions in the rate–distortion function; Shannon capacity of infinite–range spin–glasses; relation between temperature, de Bruijn’s identity, and Fisher information; the Gibbs inequality in Statistical Physics and its relation to the log–sum inequality of Information Theory.
As already said, there are also plenty of additional subjects that fall under the umbrella of relations between Information Theory and Statistical Physics, which will not be covered in this course. One very hot topic is that of codes on graphs, iterative decoding, belief propagation, and density evolution. The main reason for not including these topics is that they are already covered in the course of Dr. Igal Sason: “Codes on graphs.”
I would like to emphasize that prior basic background in Information Theory will be assumed, therefore, Information Theory is a prerequisite for this course. As for the physics part, prior background in statistical mechanics could be helpful, but it is not compulsory. The course is intended to be self–contained as far as the physics background goes. The bibliographical list includes, in addition to a few well known books in Information Theory, also several very good books in elementary Statistical Physics, as well as two books on the relations between these two fields.
As a final note, I feel compelled to clarify that the material of this course is by no means intended to be presented from a very comprehensive perspective and to consist of a full account of methods, problem areas and results. Like in many advanced graduate courses in our department, here too, the choice of topics, the approach, and the style strongly reflect the personal bias of the lecturer and his/her perspective on research interests in the field. This is also the reason that a considerable fraction of the topics and results that will be covered, are taken from articles in which I have been involved.
2 Elementary Stat. Physics and Its Relation to IT
2.1 What is Statistical Physics?
Statistical physics is a branch in Physics which deals with systems with a huge number of particles (or any other elementary units), e.g., of the order of magnitude of Avogadro’s number, that is, about particles. Evidently, when it comes to systems with such an enormously large number of particles, there is no hope to keep track of the physical state (e.g., position and momentum) of each and every individual particle by means of the classical methods in physics, that is, by solving a gigantic system of differential equations pertaining to Newton’s laws for all particles. Moreover, even if these differential equations could have been solved (at least approximately), the information that they would give us would be virtually useless. What we normally really want to know about our physical system boils down to a bunch of macroscopic parameters, such as energy, heat, pressure, temperature, volume, magnetization, and the like. In other words, while we continue to believe in the good old laws of physics that we have known for some time, even the classical ones, we no longer use them in the ordinary way that we are familar with from elementary physics courses. Rather, we think of the state of the system, at any given moment, as a realization of a certain probabilistic ensemble. This is to say that we approach the problem from a probabilistic (or a statistical) point of view. The beauty of statistical physics is that it derives the macroscopic theory of thermodynamics (i.e., the relationships between thermodynamical potentials, temperature, pressure, etc.) as ensemble averages that stem from this probabilistic microscopic theory – the theory of statistical physics, in the limit of an infinite number of particles, that is, the thermodynamic limit. As we shall see throughout this course, this thermodynamic limit is parallel to the asymptotic regimes that we are used to in Information Theory, most notably, the one pertaining to a certain ‘block length’ that goes to infinity.
2.2 Basic Postulates and the Microcanonical Ensemble
For the sake of concreteness, let us consider the example where our many–particle system is a gas, namely, a system with a very large number of mobile particles, which are free to move in a given volume. The microscopic state (or microstate, for short) of the system, at each time instant , consists, in this example, of the position and the momentum of each and every particle, . Since each one of these is a vector of three components, the microstate is then given by a –dimensional vector , whose trajectory along the time axis, in the phase space, , is called the phase trajectory.
Let us assume that the system is closed, i.e., isolated from its environment, in the sense that no energy flows inside or out. Imagine that the phase space is partitioned into very small hypercubes (or cells) . One of the basic postulates of statistical mechanics is the following: In the very long range, the relative amount of time at which spends at each such cell converges to a certain number between and , which can be given the meaning of the probability of this cell. Thus, there is an underlying assumption of equivalence between temporal averages and ensemble averages, namely, this is the assumption of ergodicity.
What are then the probabilities of these cells? We would like to derive these probabilities from first principles, based on as few as possible basic postulates. Our first such postulate is that for an isolated system (i.e., whose energy is fixed) all microscopic states are equiprobable. The rationale behind this postulate is twofold:

In the absence of additional information, there is no apparent reason that certain regions in phase space would have preference relative to any others.

This postulate is in harmony with a basic result in kinetic theory of gases – the Liouville theorem, which we will not touch upon in this course, but in a nutshell, it asserts that the phase trajectories must lie along hypersurfaces of constant probability density.^{1}^{1}1This is a result of the energy conservation law along with the fact that probability mass behaves like an incompressible fluid in the sense that whatever mass that flows into a certain region from some direction must be equal to the outgoing flow from some other direction. This is reflected in the so called continuity equation.
Before we proceed, let us slightly broaden the scope of our discussion. In a more general context, associated with our –particle physical system, is a certain instantaneous microstate, generically denoted by , where each , , may itself be a vector of several physical quantities associated particle number , e.g., its position, momentum, angular momentum, magnetic moment, spin, and so on, depending on the type and the nature of the physical system. For each possible value of , there is a certain Hamiltonian (i.e., energy function) that assigns to a certain energy .^{2}^{2}2For example, in the case of an ideal gas, , independently of the positions , namely, it accounts for the contribution of the kinetic energies only. In more complicated situations, there might be additional contributions of potential energy, which depend on the positions. Now, let us denote by the density–of–states function, i.e., the volume of the shell , or, slightly more precisely, , which will be denoted also as , where the dependence on will normally be ignored since is typically exponential in and will have virtually no effect on its exponential order as long as it is small. Then, our above postulate concerning the ensemble of an isolated system, which is called the microcanonincal ensemble, is that the probability density is given by
(1) 
In the discrete case, things are, of course, a lot easier: Then, would be the number of microstates with (exactly) and would be the uniform probability mass function across this set of states. In this case, is analogous to the size of a type class in Information Theory, and is the uniform distribution across this type class.
Back to the continuous case, note that is, in general, not dimensionless: In the above example of a gas, it has the physical units of , but we must get rid of these physical units because very soon we are going to apply non–linear functions on , like the logarithmic function. Thus, we must normalize this volume by an elementary reference volume. In the gas example, this reference volume is taken to be , where is Planck’s constant Joulessec. Informally, the intuition comes from the fact that is our best available “resolution” in the plane spanned by each component of and the corresponding component of , owing to the uncertainty principle in quantum mechanics, which tells us that the product of the standard deviations of each component () is lower bounded by , where . More formally, this reference volume is obtained in a natural manner from quantum statistical mechanics: by changing the integration variable to by using , where is the wave vector. This is a well–known relationship pertaining to particle–wave duality. Now, having redefined in units of this reference volume, which makes it then a dimensionless quantity, the entropy is defined as
(2) 
where is Boltzmann’s constant Joule/degree. We will soon see what is the relationship between and the information–theoretic entropy.
To get some feeling of this, it should be noted that normally, behaves as an exponential function of (at least asymptotically), and so, is roughly linear in . For example, if , then is the volume of a shell or surface of a –dimensional sphere with radius , which is proportional to , but we should divide this by to account for the fact that the particles are indistinguishable and we don’t count permutations as distinct physical states in this case.^{3}^{3}3Since the particles are mobile and since they have no colors and no identity certficiates, there is no distinction between a state where particle no. 15 has position and momentum while particle no. 437 has position and momentum and a state where these two particles are swapped. More precisely, one obtains:
(3) 
Assuming and , we get . A physical quantity like this, that has a linear scaling with the size of the system , is called an extensive quantity. So, energy, volume and entropy are extensive quantities. Other quantities, which are not extensive, i.e., independent of the system size, like temperature and pressure, are called intensive.
It is interesting to point out that from the function , or actually, the function , one can obtain the entire information about the relevant macroscopic physical quantities of the system, e.g., temperature, pressure, and so on. The temperature of the system is defined according to:
(4) 
where means that the derivative is taken in constant volume.^{4}^{4}4 This definition of temperature is related to the classical thermodynamical definition of entropy as , where is heat, as in the absence of external work, when the volume is fixed, all the energy comes from heat and so, . Intuitively, in most situations, we expect that would be an increasing function of (although this is not strictly always the case), which means . But is also expected to be increasing with (or equivalently, is increasing with , as otherwise, the heat capacity ). Thus, should decrease with , which means that the increase of in slows down as grows. In other words, we expect to be a concave function of . In the above example, indeed, is logarithmic in and we get , which means . Pressure is obtained by , which in our example, gives rise to the state equation of the ideal gas, .
How can we also see mathematically that under “conceivable conditions”, is a concave function? We know that the Shannon entropy is also a concave functional of the probability distribution. Is this related?
As both and are extensive quantities, let us define and
(5) 
i.e., the per–particle entropy as a function of the per–particle energy. Consider the case where the Hamiltonian is additive, i.e.,
(6) 
just like in the above example where . Then, obviously,
(7) 
and so, we get:
(8)  
and so, by taking and to , with , we get:
(9) 
which establishes the concavity of at least in the case of an additive Hamiltonian, which means that the entropy of mixing two systems of particles is greater than the total entropy before they are mixed (the second law). A similar proof can be generalized to the case where includes also a limited degree of interactions (short range interactions), e.g., , but this requires somewhat more caution. In general, however, concavity may no longer hold when there are long range interactions, e.g., where some terms of depend on a linear subset of particles. Simple examples can be found in: H. Touchette, “Methods for calculating nonconcave entropies,” arXiv:1003.0382v1 [condmat.statmech] 1 Mar 2010.
Example – Schottky defects. In a certain crystal, the atoms are located in a lattice, and at any positive temperature there may be defects, where some of the atoms are dislocated (see Fig. 1). Assuming that defects are sparse enough, such that around each dislocated atom all neighors are in place, the activation energy, , required for dislocation is fixed. Denoting the total number of atoms by and the number of defected ones by , the total energy is then , and so,
(10) 
or, equivalently,
Thus,
(11) 
which gives the number of defects as
(12) 
At , there are no defects, but their number increases gradually with , approximately according to . Note that from a slighly more information–theoretic point of view,
(13) 
where
Thus, the thermodynamical entropy is intimately related to the Shannon entropy. We will see shortly that this is no coincidence. Note also that is indeed concave in this example.
What happens if we have two independent systems with total energy , which lie in equilibrium with each other. What is the temperature ? How does the energy split between them? The number of combined microstates where system no. 1 has energy and system no. 2 has energy is . If the combined system is isolated, then the probability of such a combined microstate is proportional to . Keeping in mind that normally, and are exponential in , then for large , this product is dominated by the value of for which it is maximum, or equivalently, the sum of logarithms, , is maximum, i.e., it is a maximum entropy situation, which is the second law of thermodynamics. This maximum is normally achieved at the value of for which the derivative vanishes, i.e.,
(14) 
or
(15) 
which means
(16) 
Thus, in equilibrium, which is the maximum entropy situation, the energy splits in a way that temperatures are the same.
2.3 The Canonical Ensemble
So far we have assumed that our system is isolated, and therefore has a strictly fixed energy . Let us now relax this assumption and assume that our system is free to exchange energy with its large environment (heat bath) and that the total energy of the heat bath is by far larger than the typical energy of the system. The combined system, composed of our original system plus the heat bath, is now an isolated system at temperature . So what happens now?
Similarly as before, since the combined system is isolated, it is governed by the microcanonical ensemble. The only difference is that now we assume that one of the systems (the heat bath) is very large compared to the other (our test system). This means that if our small system is in microstate (for whatever definition of the microstate vector) with energy , then the heat bath must have energy to complement the total energy to . The number of ways that the heat bath may have energy is , where is the density–of–states function pertaining to the heat bath. In other words, the number of microstates of the combined system for which the small subsystem is in microstate is . Since the combined system is governed by the microcanonical ensemble, the probability of this is proportional to . More precisely:
(17) 
Let us focus on the numerator for now, and normalize the result at the end. Then,
(18)  
It is customary to work with the so called inverse temperature:
(19) 
and so,
(20) 
Thus, all that remains to do is to normalize, and we then obtain the Boltzmann–Gibbs (B–G) distribution, or the canonical ensemble, which describes the underlying probability law in equilibrium:
where is the normalization factor:
(21) 
in the discrete case, or
(22) 
in the continuous case.
This is one of the most fundamental results in statistical mechanics, which was obtained solely from the energy conservation law and the postulate that in an isolated system the distribution is uniform. The function is called the partition function, and as we shall see, its meaning is by far deeper than just being a normalization constant. Interestingly, a great deal of the macroscopic physical quantities, like the internal energy, the free energy, the entropy, the heat capacity, the pressure, etc., can be obtained from the partition function.
The B–G distribution tells us then that the system “prefers” to visit its low energy states more than the high energy states. And what counts is only energy differences, not absolute energies: If we add to all states a fixed amount of energy , this will result in an extra factor of both in the numerator and in the denominator of the B–G distribution, which will, of course, cancel out. Another obvious observation is that whenever the Hamiltonian is additive, that is, , the various particles are statistically independent: Additive Hamiltonians correspond to non–interacting particles. In other words, the ’s behave as if they were drawn from a memoryless source. And so, by the law of large numbers will tend (almost surely) to . Nonetheless, this is different from the microcanonical ensemble where was held strictly at the value of . The parallelism to Information Theory is as follows: The microcanonical ensemble is parallel to the uniform distribution over a type class and the canonical ensemble is parallel to a memoryless source.
The two ensembles are asymptotically equivalent as far as expectations go. They continue to be such even in cases of interactions, as long as these are short range. It is instructive to point out that the B–G distribution could have been obtained also in a different manner, owing to the maximum–entropy principle that we mentioned in the Introduction. Specifically, consider the following optimization problem:
(23) 
By formalizing the equivalent Lagrange problem, where now plays the role of a Lagrange multiplier:
(24) 
or equivalently,
(25) 
one readily verifies that the solution to this problem is the BG distribution where the choice of controls the average energy . In many physical systems, the Hamiltonian is a quadratic (or “harmonic”) function, e.g., , , , , , etc., in which case the resulting B–G distribution turns out to be Gaussian. This is at least part of the explanation why the Gaussian distribution is so frequently encountered in Nature. Note also that indeed, we have already seen in the Information Theory course that the Gaussian density maximizes the (differential) entropy s.t. a second order moment constraint, which is equivalent to our average energy constraint.
2.4 Properties of the Partition Function and the Free Energy
Let us now examine more closely the partition function and make a few observations about its basic properties. For simplicity, we shall assume that is discrete. First, let’s look at the limits: Obviously, is equal to the size of the entire set of microstates, which is also , This is the high temperature limit, where all microstates are equiprobable. At the other extreme, we have:
(26) 
which describes the situation where the system is frozen to the absolute zero. Only states with minimum energy – the ground–state energy, prevail.
Another important property of , or more precisely, of , is that it is a log–moment generating function: By taking derivatives of , we can obtain moments (or cumulants) of . For the first moment, we have
(27) 
Similarly, it is easy to show (exercise) that
(28) 
This in turn implies that , which means that must always be a convex function. Higher order derivatives provide higher order moments.
Next, we look at slightly differently than before. Instead of summing across all states, we go by energy levels (similarly as in the method of types). This amounts to:
(29)  
The quantity is the (per–particle) free energy. Similarly, the entire free energy, , is defined as
(30) 
The physical meaning of the free energy is this: A change, or a difference, , in the free energy means the minimum amount of work it takes to transfer the system from equilibrium state 1 to another equilibrium state 2 in an isothermal (fixed temperature) process. And this minimum is achieved when the process is quasistatic, i.e., so slow that the system is always almost in equilibrium. Equivalently, is the maximum amount of work that that can be exploited from the system, namely, the part of the energy that is free for doing work (i.e., not dissipated as heat) in fixed temperature. Again, this maximum is attained by a quasistatic process.
We see that the value of that minimizes , dominates the partition function and hence captures most of the probability. As grows without bound, the energy probability distribution becomes sharper and sharper around . Thus, we see that equilibrium in the canonical ensemble amounts to minimum free energy. This extends the second law of thermodynamics from the microcanonical ensemble of isolated systems, whose equilibrium obeys the maximum entropy principle. The maximum entropy principle is replaced, more generally, by the minimum free energy principle. Note that the Lagrange minimization problem that we formalized before, i.e.,
(31) 
is nothing but minimization of the free energy, provided that we identify with the physical entropy (to be done very soon) and the Lagrange multiplier with . Thus, the B–G distribution minimizes the free energy for a given temperature.
Although we have not yet seen this explicitly, but there were already hints and terminology suggests that the thermodynamical entropy is intimately related to the Shannon entropy . We will also see it shortly in a more formal manner. But what is the information–theoretic analogue of the free energy?
Here is a preliminary guess based on a very rough consideration: The last chain of equalities reminds us what happens when we sum over probabilities type–by–type in IT problems: The exponentials are analoguous (up to a normalization factor) to probabilities, which in the memoryless case, are given by . Each such probability is weighted by the size of the type class, which as is known from the method of types, is exponentially , whose physical analogue is . The product gives in IT and in statistical physics. This suggests that perhaps the free energy has some analogy with the divergence. Is this true? We will see shortly a somewhat more rigorous argument.
More formally, let us define
(32) 
and, in order to avoid dragging the constant , let us define . Then, the above chain of equalities, written slighlty differently, gives
Thus, is (a certain variant of) the Legendre transform^{5}^{5}5More precisely, the 1D Legendre transform of a real function is defined as . If is convex, it can readily be shown that: (i) The inverse transform has the very same form, i.e., , and (ii) The derivatives and are inverses of each other. of . As is (normally) a concave function, then it can readily be shown (execrise) that the inverse transform is:
(33) 
The achiever, , of in the forward transform is obtained by equating the derivative to zero, i.e., it is the solution to the equation
(34) 
or in other words, the inverse function of . By the same token, the achiever, , of in the backward transform is obtained by equating the other derivative to zero, i.e., it is the solution to the equation
(35) 
or in other words, the inverse function of .
Exercise: Show that the functions and
are inverses of one another.
This establishes a relationship between the
typical per–particle energy and the inverse temperature
that gives rise to (cf. the Lagrange interpretation above, where we said
that controls the average energy).
Now, obersve that whenever and are related as explained
above, we have:
(36) 
On the other hand, if we look at the Shannon entropy pertaining to the B–G distribution, we get:
which is exactly the same expression as before, and so, and are identical whenever and are related accordingly. The former, as we recall, we defined as the normalized logarithm of the number of microstates with per–particle energy . Thus, we have learned that the number of such microstates is exponentially , a result that looks familar to what we learned from the method of types in IT, using combinatorial arguments for finite–alphabet sequences. Here we got the same result from substantially different considerations, which are applicable in situations far more general than those of finite alphabets (continuous alphabets included). Another look at this relation is the following:
(37)  
which means that for all , and so,
(38) 
A compatible lower bound is obtained by observing that the minimizing
gives rise to , which makes the event
a high–probability event, by the
weak law of large numbers.
A good reference for further study and
from a more general perspective is:
M. J. W. Hall, “Universal geometric
approach to uncertainty, entropy, and information,” Phys. Rev. A, vol. 59, no. 4, pp. 2602–2615,
April 1999.
Having established the identity between the Shannon–theoretic entropy and the thermodynamical entropy, we now move on, as promised, to the free energy and seek its information–theoretic counterpart. More precisely, we will look at the difference between the free energies of two different probability distributions, one of which is the B–G distibution. Consider first, the following chain of equalities concerning the B–G distribution:
(39)  
Consider next another probability distribution , different in general from and hence corresponding to non–equilibrium. Let us now look at the divergence:
or equivalently,
Thus, the free energy difference is indeed related to the the divergence. For a given temperature, the free energy away from equilibrium is always larger than the free energy at equilibrium. Since the system “wants” to minimize the free energy, it eventually converges to the B–G distribution. More details on this can be found in:

H. Qian, “Relative entropy: free energy …,” Phys. Rev. E, vol. 63, 042103, 2001.

G. B. Baǵci, arXiv:condmat/070300v1, 1 Mar. 2007.
Another interesting relation between the divergence and physical quantities is that the divergence is proportional to the dissipated work (average work free energy difference) between two equilibrium states at the same temperature but corresponding to two different values of some external control parameter. Details can be found in: R. Kawai, J. M. R. Parrondo, and C. Van den Broeck, “Dissipation: the phase–space perspective,” Phys. Rev. Lett., vol. 98, 080602, 2007.
Let us now summarize the main properties of the partition function that we have seen thus far:

is a continuous function. and .

Generating moments: , convexity of , and hence also of .

and are a Legendre–transform pair. is concave.

coincides with the Shannon entropy of the BG distribution.

Exercise: Consider for an imaginary temperature
, where , and define as the inverse
Fourier transform of . Show that is the density of states,
i.e., for , the number of states with energy between and
is given by .
Thus, can be related to energy enumeration in two different ways:
one is by the Legendre transform of for real , and the other
is by the inverse Fourier transform of for imaginary .
This double connection between and is no coincidence, as we shall
see later on.
Example – A two level system. Similarly to the earlier example of Schottky defets, which was previously given in the context of the microcanonical ensemble, consider now a system of independent particles, each having two possible states: state of zero energy and state , whose energy is , i.e., , . The ’s are independent, each having a marginal:
(40) 
In this case,
(41) 
and
(42) 
To find , we take the derivative and equate to zero:
(43) 
which gives
(44) 
On substituting this back into the above expression of , we get:
(45) 
which after a short algebraic manipulation, becomes
(46) 
just like in the Schottky example. In the other direction:
(47) 
whose achiever solves the zero–derivative equation:
(48) 
or equivalently,
(49) 
which is exactly the inverse function of above, and which when plugged back into the expression of , indeed gives
(50) 
Comment: A very similar model (and hence with similar results) pertains to non–interacting spins (magnetic moments), where the only difference is that rather than . Here, the meaning of the parameter becomes that of a magnetic field, which is more customarily denoted by (or ), and which is either parallel or antiparallel to that of the spin, and so the potential energy (in the appropriate physical units), , is either or . Thus,
(51) 
The net magnetization per–spin is defined as
(52) 
This is the paramagnetic characteristic of the magnetization as a function of the magnetic field: As , the magnetization accordingly. When the magnetic field is removed (), the magnetization vanishes too. We will get back to this model and its extensions in the sequel.
Exercise: Consider a system of non–interacting particles, each having a quadratic Hamiltonian, , . Show that here,
(53) 
and
(54) 
Show that and hence .
2.5 The Energy Equipartition Theorem
From the last exercise, we have learned that for a quadratic Hamiltonian, , we have , namely, the average per–particle energy, is given , independently of . If we have such quadratic terms, then of course, we end up with . In the case of the ideal gas, we have 3 such terms (one for each dimension) per particle, thus a total of terms, and so, , which is exactly what we obtained also in the microcanonical ensemble, which is equivalent (recall that this was obtained then by equating to the derivative of ). In fact, we observe that in the canonical ensemble, whenever we have an Hamiltonian of the form some arbitrary terms that do not depend on , then is Gaussian (with variance ) and independent of the other guys, i.e., . Hence it contributes an amount of
(55) 
to the total average energy, independently of . It is more precise to refer to this as a degree of freedom rather than a particle. This is because in the 3D world, the kinetic energy, for example, is given by , that is, each particle contributes three additive quadratic terms rather than one (just like three independent one–dimensional particles) and so, it contributes . This principle is called the the energy equipartition theorem. In the sequel, we will see that it is quite intimately related to rate–distortion theory for quadratic distortion measures.
Below is a direct derivation of the equipartition theorem:
This simple trick, that bypasses the need to calculate integrals, can easily be extended in two directions at least (exercise):

Let and let , where is a positive definite matrix. This corresponds to a physical system with a quadratic Hamiltonian, which includes also interactions between pairs (e.g., Harmonic oscillators or springs, which are coupled because they are tied to one another). It turns out that here, regardless of , we get:
(56) 
Back to the case of a scalar , but suppose now a more general power–law Hamiltoinan, . In this case, we get