An Elementary Introduction to Kalman Filtering
Abstract
Kalman filtering is a classic state estimation technique used widely in engineering applications such as statistical signal processing and control of vehicles. It is now being used to solve problems in computer systems, such as controlling the voltage and frequency of processors to minimize energy while meeting throughput requirements.
Although there are many presentations of Kalman filtering in the literature, they are usually focused on particular problem domains such as linear systems with Gaussian noise or robot navigation, which makes it difficult to understand the general principles behind Kalman filtering. In this paper, we first present the general statistical ideas behind Kalman filtering at a level accessible to anyone with a basic knowledge of probability theory and calculus, and then show how these abstract concepts can be applied to state estimation problems in linear systems. This separation of abstract concepts from applications should make it easier to apply Kalman filtering to other problems in computer systems.
MyFramelinecolor=blue, outerlinewidth=2pt, roundcorner=20pt, innertopmargin=nerbottommargin=nerrightmargin=20pt, innerleftmargin=20pt, backgroundcolor=gray!50!white \acmConference[]
printacmref=false, printccs=true, printfolios=false \setcopyrightnone
1 Introduction
Kalman filtering is a state estimation technique invented in 1960 by Rudolf E. Kálmán [Kalman (1960)]. It is used in many areas including spacecraft navigation, motion planning in robotics, signal processing, and wireless sensor networks [Souza et al. (2016), Nagarajan et al. (2011), Thrun et al. (2005), Welch and Bishop (1995), Hess and Rantzer (2010)] because of its small computational and memory requirements, and its ability to extract useful information from noisy data. Recent work shows how Kalman filtering can be used in controllers for computer systems [Bergman (2009), Pothukuchi et al. (2016), Imes and Hoffmann (2016), Imes et al. (2015)].
Although many presentations of Kalman filtering exist in the literature [, Barker et al. (1994), Welch and Bishop (1995), Balakrishnan (1987), Eubank (2005), Grewal and Andrews (2014), Evensen (2006), Faragher (2012), Chui and Chen (2017), Lindquist and Picci (2017), Nakamura et al. (2007), Cao and Schwartz (2004)], they are usually focused on particular applications like robot motion or state estimation in linear systems with Gaussian noise. This can make it difficult to see how to apply Kalman filtering to other problems. The goal of this paper is to present the abstract statistical ideas behind Kalman filtering independently of particular applications, and then show how these ideas can be applied to solve particular problems such as state estimation in linear systems.
Abstractly, Kalman filtering can be viewed as an algorithm for combining imprecise estimates of some unknown value to obtain a more precise estimate of that value. We use informal methods similar to Kalman filtering in everyday life. When we want to buy a house for example, we may ask a couple of real estate agents to give us independent estimates of what the house is worth. For now, we use the word “independent” informally to mean that the two agents are not allowed to consult with each other. If the two estimates are different, as is likely, how do we combine them into a single value to make an offer on the house? This is an example of a data fusion problem.
One solution to our realestate problem is to take the average of the two estimates; if these estimates are and , they are combined using the formula . This gives equal weight to the estimates. Suppose however we have additional information about the two agents; perhaps the first one is a novice while the second one has a lot of experience in real estate. In that case, we may have more confidence in the second estimate, so we may give it more weight by using a formula such as . In general, we can consider a convex combination of the two estimates, which is a formula of the form , where ; intuitively, the more confidence we have in the second estimate, the closer should be to . In the extreme case, when , we discard the first estimate and use only the second estimate.
The expression is an example of a linear estimator [Kay (1993)]. The statistics behind Kalman filtering tell us how to pick the optimal value of : the weight given to an estimate should be proportional to the confidence we have in that estimate, which is intuitively reasonable.
To quantify these ideas, we need to formalize the concepts of estimates and confidence in estimates. Section 2 describes the model used in Kalman filtering. Estimates are modeled as random samples from certain distributions, and confidence in estimates is quantified in terms of the variances and covariances of these distributions.

How should uncertain estimates be fused optimally?
Section 3 shows how to fuse scalar estimates such as house prices optimally. It is also shown that the problem of fusing more than two estimates can be reduced to the problem of fusing two estimates at a time, without any loss in the quality of the final estimate.
Section 4 extends these results to estimates that are vectors, such as state vectors representing the estimated position and velocity of a robot or spacecraft. The math is more complicated than in the scalar case but the basic ideas remain the same, except that instead of working with variances of scalar estimates, we must work with covariance matrices of vector estimates.

In some applications, estimates are vectors but only a part of the vector may be directly observable. For example, the state of a spacecraft may be represented by its position and velocity, but only its position may be directly observable. In such cases, how do we obtain a complete estimate from a partial estimate?
Section 5 introduces the Best Linear Unbiased Estimator (BLUE), which is used in Kalman filtering for this purpose. It can be seen as a generalization of the ordinary least squares (OLS) method to problems in which data comes from distributions rather than being a set of discrete points.
Section 6 shows how these results can be used to solve state estimation problems for linear systems, which is the usual context for presenting Kalman filters. First, we consider the problem of state estimation when the entire state is observable, which can be solved using the data fusion results from Sections 3 and 4. Then we consider the more complex problem of state estimation when the state is only partially observable, which requires in addition the BLUE estimator from Section 5. Section 6.3 illustrates Kalman filtering with a concrete example.
2 Formalization of estimates
This section makes precise the notions of estimates and confidence in estimates, which were introduced informally in Section 1.
2.1 Scalar estimates
One way to model the behavior of an agent producing scalar estimates such as house prices is through a probability distribution function (usually shortened to distribution) like the ones shown in Figure 1 in which the xaxis represents values that can be assigned to the house, and the yaxis represents the probability density. Each agent has its own distribution, and obtaining an estimate from an agent corresponds to drawing a random sample from the distribution for agent .
Most presentations of Kalman filters assume distributions are Gaussian but we assume that we know only the mean and the variance of each distribution . We write to denote that is a random sample drawn from distribution which has a mean of and a variance of . The reciprocal of the variance of a distribution is sometimes called the precision of that distribution.
The informal notion of “confidence in the estimates made by an agent” is quantified by the variance of the distribution from which the estimates are drawn. An experienced agent making highconfidence estimates is modeled by a distribution with a smaller variance than one used to model an inexperienced agent; notice that in Figure 1, the inexperienced agent is “all over the map.”
This approach to modeling confidence in estimates may seem nonintuitive since there is no mention of how close the estimates made by an agent are to the actual value of the house. In particular, an agent can make estimates that are very far off from the actual value of the house but as long as his estimates fall within a narrow range of values, we would still say that we have high confidence in his estimates. In statistics, this is explained by making a distinction between accuracy and precision. Accuracy is a measure of how close an estimate of a quantity is to the true value of that quantity (the true value is sometimes called the ground truth). Precision on the other hand is a measure of how close the estimates are to each other, and is defined without reference to ground truth. A metaphor that is often used to explain this difference is shooting at a bullseye. In this case, ground truth is provided by the center of the bullseye. A precise shooter is one whose shots are clustered closely together even if they may be far from the bullseye. In contrast, the shots of an accurate but not precise shooter would be scattered widely in a region surrounding the bullseye. For the problems considered in this paper, there may be no ground truth, and confidence in estimates is related to precision, not accuracy.
The informal notion of getting independent estimates from the two agents is modeled by requiring that estimates and be uncorrelated; that is, . This is not the same thing as requiring them to be independent random variables, as explained in Appendix A.1. Lemma 2.1 shows how the mean and variance of a linear combination of pairwise uncorrelated random variables can be computed from the means and variances of the random variables.
Let be a set of pairwise uncorrelated random variables. Let be a random variable that is a linear combination of the ’s. The mean and variance of are the following:
(1)  
(2) 
Equation 1 follows from the fact that expectation is a linear operator:
Equation 2 follows from linearity of the expectation operator and the fact that the estimates are pairwise uncorrelated:
Since the variables are pairwise uncorrelated, if , from which the result follows.
2.2 Vector estimates
In some applications, estimates are vectors. For example, the state of a robot moving along a single dimension might be represented by a vector containing its position and velocity. Similarly, the vital signs of a person might be represented by a vector containing his temperature, pulse rate and blood pressure. In this paper, we denote a vector by a boldfaced lowercase letter, and a matrix by an uppercase letter. The covariance matrix of a random variable with mean is the matrix .
Estimates: An estimate is a random sample drawn from a distribution with mean and covariance matrix , written as . The inverse of the covariance matrix is called the precision or information matrix. Note that if the dimension of is one, the covariance matrix reduces to variance.
Uncorrelated estimates: Estimates and are uncorrelated if . This is equivalent to saying that every component of is uncorrelated with every component of .
Let be a set of pairwise uncorrelated random vectors of length . Let . Then, the mean and variance of y are the following:
(3)  
(4) 
The proof is similar to the proof of Lemma 2.1.
Equation 3 follows from the linearity of the expectation operator.
Equation 4 can be proved as follows:
The variables are pairwise uncorrelated, therefore if , from which the result follows.
3 Fusing Scalar Estimates
Section 3.1 discusses the problem of fusing two scalar estimates. Section 3.2 generalizes this to the problem of fusing scalar estimates. Section 3.3 shows that fusing estimates can be done iteratively by fusing two estimates at a time without any loss of quality in the final estimate.
3.1 Fusing two scalar estimates
We now consider the problem of choosing the optimal value of the parameter in the formula for fusing uncorrelated estimates and . How should optimality be defined? One reasonable definition is that the optimal value of minimizes the variance of ; since confidence in an estimate is inversely proportional to the variance of the distribution from which the estimates are drawn, this definition of optimality will produce the highestconfidence fused estimates. The variance of is called the mean square error (MSE) of that estimator, and it obviously depends on ; the minimum value of this variance as is varied is called the minimum mean square error error (MMSE) below.
Let and be uncorrelated estimates, and suppose they are fused using the formula . The value of is minimized for .
From Lemma 2.1,
(5) 
Differentiating with respect to and setting the derivative to zero proves the result. The second derivative, (), is positive, showing that reaches a minimum at this point.
In the literature, this optimal value of is called the Kalman gain .
Substituting into the linear fusion model, we get the optimal linear estimator :
As a step towards fusion of estimates, it is useful to rewrite this as follows:
(6) 
Substituting into Equation 5 gives the following expression for the variance of :
(7) 
The expressions for and are complicated because they contain the reciprocals of variances. If we let and denote the precisions of the two distributions, the expressions for and can be written more simply as follows:
(8)  
(9) 
These results say that the weight we should give to an estimate is proportional to the confidence we have in that estimate, which is intuitively reasonable. Note that if , the expectation is regardless of the value . In this case, is said to be an unbiased estimator, and the optimal value of minimizes the variance of the unbiased estimator.
3.2 Fusing multiple scalar estimates
The approach in Section 3.1 can be generalized to optimally fuse multiple pairwise uncorrelated estimates . Let denote the linear estimator given parameters , which we denote by .
Let pairwise uncorrelated estimates drawn from distributions be fused using the linear model where . The value of is minimized for
From Lemma 2.1, . To find the values of that minimize the variance under the constraint that the ’s sum to , we use the method of Lagrange multipliers. Define
where is the Lagrange multiplier. Taking the partial derivatives of with respect to each and setting these derivatives to zero, we find . From this, and the fact that sum of the ’s is , the result follows.
The variance is given by the following expression:
(10) 
As in Section 3.1, these expressions are more intuitive if the variance is replaced with precision.
(11)  
(12) 
3.3 Incremental fusing is optimal
In many applications, the estimates become available successively over a period of time. While it is possible to store all the estimates and use Equations 11 and 12 to fuse all the estimates from scratch whenever a new estimate becomes available, it is possible to save both time and storage if one can do this fusion incrementally. In this section, we show that just as a sequence of numbers can be added by keeping a running sum and adding the numbers to this running sum one at a time, a sequence of estimates can be fused by keeping a “running estimate” and fusing estimates from the sequence one at a time into this running estimate without any loss in the quality of the final estimate. In short, we want to show that . Note that this is not the same thing as showing , interpreted as a binary function, is associative.
Figure 2 shows the process of incrementally fusing estimates. Imagine that time progresses from left to right in this picture. Estimate is available initially, and the other estimates become available in succession; the precision of each estimate is shown in parentheses next to each estimate. When the estimate becomes available, it is fused with using Equation 8. In Figure 2, the labels on the edges connecting and to are the weights given to these estimates in Equation 8. When estimate becomes available, it is fused with using Equation 8; as before, the labels on the edges correspond to the weights given to and when they are fused to produce .
3.4 Summary
The main result in this section can be summarized informally as follows. When using a linear model to fuse uncertain scalar estimates, the weight given to each estimate should be inversely proportional to the variance of that estimate. Furthermore, when fusing estimates, estimates can be fused incrementally without any loss in the quality of the final result. More formally, the results in this section for fusing scalar estimates are often expressed in terms of the Kalman gain, as shown below; these equations can be applied recursively to fuse multiple estimates.
(13) (14) (15) (16)
4 Fusing Vector Estimates
This section addresses the problem of fusing estimates when the estimates are vectors. Although the math is more complicated, the conclusion is that the results in Section 3 for fusing scalar estimates can be extended to the vector case simply by replacing variances with covariance matrices, as shown in this section.
4.1 Fusing multiple vector estimates
For vectors, the linear data fusion model is
(17) 
Here stands for the matrix parameters . All the vectors are assumed to be of the same length. If the means of the random variables are identical, this is an unbiased estimator.
Optimality: The MSE in this case is the expected value of the twonorm of (), which is . Note that if the vectors have length 1, this reduces to variance. The parameters in the linear data fusion model are chosen to minimize this MSE.
Theorem 4.1 generalizes Theorem 3.2 to the vector case. The proof of this theorem uses matrix derivatives and is given in Appendix A.3 since it is not needed for understanding the rest of this paper. What is important is to compare Theorems 4.1 and 3.2 and notice that the expressions are similar, the main difference being that the role of variance in the scalar case is played by the covariance matrix in the vector case.
Let pairwise uncorrelated estimates drawn from distributions be fused using the linear model , where . The is minimized for
(18) 
The covariance matrix of the optimal estimator can be determined by substituting the optimal values into the expression for in Lemma 2.2.
(19) 
In the vector case, precision is the inverse of a covariance matrix, denoted by . Equations 20–21 use precision to express the optimal estimator and its variance, and generalize Equations 11–12 to the vector case.
(20)  
(21) 
As in the scalar case, fusion of vector estimates can be done incrementally without loss of precision. The proof is similar to the one in Section 3.3, and is omitted.
There are several equivalent expressions for the Kalman gain for the fusion of two estimates. The following one, which is easily derived from Equation 18, is the vector analog of Equation 13:
(22) 
The covariance matrix of can be written as follows.
(23)  
(24) 
4.2 Summary
The results in this section can be summarized in terms of the Kalman gain K as follows:
(25) (26) (27) (28)
5 Best linear unbiased estimator (BLUE)
In some applications, estimates are vectors but only part of the vector may be given to us directly, and it is necessary to estimate the hidden portion. This section introduces a statistical method called the Best Linear Unbiased Estimator (BLUE).
Consider the general problem of determining a value for vector y given a value for a vector x. If there is a functional relationship between x and y (say and is given), it is easy to compute y given a value for x. In our context however, x and y are random variables so such a precise functional relationship will not hold. The best we can do is to estimate the likely value of y, given a value of x and the information we have about how x and y are correlated.
Figure 3 shows an example in which and are scalarvalued random variables. The grey ellipse in this figure, called a confidence ellipse, is a projection of the joint distribution of and onto the plane that shows where some large proportion of the values are likely to be. For a given value , there are in general many points that lie within this ellipse, but these values are clustered around the line shown in the figure so the value is a reasonable estimate for the value of corresponding to . This line, called the best linear unbiased estimator (BLUE), is the analog of ordinary least squares (OLS) for distributions. Given a set of discrete points where each and are scalars, OLS determines the “best” linear relationship between these points, where best is defined as minimizing the square error between the predicted and actual values of . This relation can then be used to predict a value for given a value for . The BLUE estimator presented below generalizes this to the case when and are vectors, and are random variables obtained from distributions instead of a set of discrete points.
5.1 Computing BLUE
Let and be random variables. Consider a linear estimator . How should we choose and b? As in the OLS approach, we can pick values that minimize the MSE between random variable y and the estimate .
Setting the partial derivatives of with respect to b and to zero, we find that , and , where is the covariance between y and x. Therefore, the best linear estimator is
(29) 
This is an unbiased estimator because the mean of is equal to . Note that if (that is, x and y are uncorrelated), the best estimate of y is just , so knowing the value of x does not give us any additional information about y as one would expect. In Figure 3, this corresponds to the case when the BLUE line is parallel to the xaxis. At the other extreme, suppose that y and x are functionally related so . In that case, it is easy to see that , so as expected. In Figure 3, this corresponds to the case when the confidence ellipse shrinks down to the BLUE line.
The covariance matrix of this estimator is the following:
(30) 
Intuitively, knowing the value of x permits us to reduce the uncertainty in the value of y by an additive term that depends on how strongly y and x are correlated.
6 Kalman filters for linear systems
In this section, we apply the algorithms developed in Sections 35 to the particular problem of estimating the state of linear systems, which is the classical application of Kalman filtering.
Figure 4(a) shows how the evolution over time of the state of such a system can be computed if the initial state and the model of the system dynamics are known precisely. Time advances in discrete steps. The state of the system at any time step is a function of the state of the system at the previous time step and the control inputs applied to the system during that interval. This is usually expressed by an equation of the form where is the control input. The function is nonlinear in the general case, and can be different for different steps. If the system is linear, the relation for state evolution over time can be written as , where and are (timedependent) matrices that can be determined from the physics of the system. Therefore, if the initial state is known exactly and the system dynamics are modeled perfectly by the and matrices, the evolution of the state with time can be computed precisely.
In general however, we may not know the initial state exactly, and the system dynamics and control inputs may not be known precisely. These inaccuracies may cause the computed and actual states to diverge unacceptably over time. To avoid this, we can make measurements of the state after each time step. If these measurements were exact and the entire state could be observed after each time step, there would of course be no need to model the system dynamics. However, in general, (i) the measurements themselves are imprecise, and (ii) some components of the state may not be directly observable by measurements.
6.1 Fusing complete observations of the state
If the entire state can be observed through measurements, we have two imprecise estimates for the state after each time step, one from the model of the system dynamics and one from the measurement. If these estimates are uncorrelated and their covariance matrices are known, we can use Equations 25:28 to fuse these estimates and compute the covariance matrix of this fused estimate. The fused estimate of the state is fed into the model of the system to compute the estimate of the state and its covariance at the next time step, and the entire process is repeated.
Figure 4(b) shows the dataflow diagram of this computation. For each state in the precise computation of Figure 4(a), there are three random variables in Figure 4(b): the estimate from the model of the dynamical system, denoted by , the estimate from the measurement, denoted by , and the fused estimate, denoted by . Intuitively, the notation stands for the estimate of the state at time given the information at time , and it is often referred to as the a priori estimate. Similarly, is the corresponding estimate given the information available at time , which includes information from the measurement, and is often referred to as the a posteriori estimate. To set up this computation, we introduce the following notation.

The initial state is denoted by and its covariance by .

Uncertainty in the system model and control inputs is represented by making a random variable and introducing a zeromean noise term into the state evolution equation, which becomes
(31) The covariance matrix of is denoted by and is assumed to be uncorrelated with .

The imprecise measurement at time is modeled by a random variable where is a noise term. has a covariance matrix and is uncorrelated to .
Examining Figure 4(c), we see that if we can compute , the covariance matrix of , from , we have everything we need to implement the vector data fusion technique described in Section 4. This can be done by applying Equation 4, which tells us that . This equation propagates uncertainty in the input of to its output.
6.2 Fusing partial observations of the state
If some components of the state cannot be measured directly, the prediction phase remains unchanged from Section 6.1 but the fusion phase is different and can be understood intuitively in terms of the following steps.

The BLUE estimator in Section 5 is used to obtain the a posteriori estimate of the hidden state from the a posteriori estimate of the observable state.

The a posteriori estimates of the observable and hidden portions of the state are composed to produce the a posteriori estimate of the entire state.
The actual implementation produces the final result directly without going through these steps, as shown in Figure 4(d) but these incremental steps are useful for understanding how all this works.
Example: 2D state
Figure 5 illustrates these steps for a twodimensional problem in which the statevector has two components, and only the first component can be measured directly. We use the simplified notation below to focus on the essentials.

a priori state estimate:

covariance matrix of a priori estimate:

a posteriori state estimate:

measurement:

variance of measurement:
The three steps discussed above for obtaining the a posteriori estimate involve the following calculations, shown pictorially in Figure 5.