# Online Algorithms For Parameter Mean And Variance Estimation In Dynamic Regression Models

## Abstract

We study the problem of estimating the parameters of a regression model from a set of observations, each consisting of a response and a predictor. The response is assumed to be related to the predictor via a regression model of unknown parameters. Often, in such models the parameters to be estimated are assumed to be constant. Here we consider the more general scenario where the parameters are allowed to evolve over time, a more natural assumption for many applications. We model these dynamics via a linear update equation with additive noise that is often used in a wide range of engineering applications, particularly in the well-known and widely used Kalman filter (where the system state it seeks to estimate maps to the parameter values here). We derive an approximate algorithm to estimate both the mean and the variance of the parameter estimates in an online fashion for a generic regression model. This algorithm turns out to be equivalent to the extended Kalman filter. We specialize our algorithm to the multivariate exponential family distribution to obtain a generalization of the generalized linear model (GLM). Because the common regression models encountered in practice such as logistic, exponential and multinomial all have observations modeled through an exponential family distribution, our results are used to easily obtain algorithms for online mean and variance parameter estimation for all these regression models in the context of time-dependent parameters. Lastly, we propose to use these algorithms in the contextual multi-armed bandit scenario, where so far model parameters are assumed static and observations univariate and Gaussian or Bernoulli. Both of these restrictions can be relaxed using the algorithms described here, which we combine with Thompson sampling to show the resulting performance on a simulation.

## 1 Introduction

Regression models are one of the main tools of statistical modeling and supervised machine learning. In regression models, responses are related to predictors by a probabilistic model that is a function of model parameters that we want to estimate from the data. The most common regression model, linear regression, assumes the response follows a Gaussian distribution that can take values anywhere in the real numbers. However, different applications have a response of a different nature. For example, the response might take values only in the positive real numbers (e.g., the time between the arrivals of two buses), or in the non-negative integers (e.g., the number of thunders in a day, or the number and identity of the items chosen by someone from a catalog of items). In such cases, assuming that the response is Gaussian might produce an inferior model to one obtained assuming a different distribution that better matches the nature of the response data. When the response takes on two values, it may be modeled as a Bernoulli random variable. When it is a non-negative real number, it may be modeled as an exponential, Erlang, or gamma distribution. When the response is a non-negative integer it may be modeled as a Poisson, negative binomial, or binomial random variable. When the response is a category it may be modeled by a multinomial or categorical distribution.

All these distributions, and more, are part of the so-called exponential family ([14], [15], [21]). The exponential family also includes distributions for random vectors with vector entries that are correlated, e.g., as in the multivariate Gaussian distribution, as well as with independent entries with different distribution types. The generalized linear model (GLM) introduced in [17] provides the theory to build regression models with static parameters where the response follows a distribution in the exponential family. The GLM can then be seen as a generalization of linear regression.

Many applications of regression models, indeed those that rely on the GLM, assume the parameters are static. This assumption is too restrictive when modeling time-series, which often exhibit trends, seasonal components, and local correlations. Similarly, in many applications the model parameters represent variables that describe a portion of the state of the underlying system that cannot be or are not directly measured, but with well-known relationships that describe their dynamics. The response can then be thought of as a noisy function of the system state and the predictors, and the goal of the regression model is to estimate the system state over time based on the observation time-series.

Situations where an underlying dynamical system is only observed through noisy measurements are often encountered in engineering applications, and the celebrated and widely used Kalman filter, introduced in [9] solves the corresponding regression model when the parameter dynamics (linear with additive Gaussian noise) and the observation distribution (Gaussian with a mean that is a linear function of the state) are simple enough. The Kalman filter can be seen as an algorithm that efficiently estimates the mean and covariance matrix of the parameters in a linear regression model, where the parameters evolve in time through linear dynamics, based on the time-series of responses and predictors. Furthermore, the Kalman filter is an online algorithm that updates the parameter estimates in a relatively simple fashion when new observations arrive, without having to re-analyze all previous data. In addition, it is a second-order algorithm, in contrast to stochastic gradient descent (e.g., [3]). So while the Kalman filter takes more computation per observation than stochastic gradient descent, it can converge to an accurate estimate of the parameters with fewer observations, while allowing for much more flexibility in the modeling of the underlying parameter dynamics. Lastly but importantly, because the Kalman filter provides an estimate of the mean and covariance matrix, unlike other approaches that only focus on the mean, its results can be used to construct confidence intervals or samples of the parameters or of the model predictions — these statistics are often necessary in several applications such as in contextual multi-armed bandits (e.g., [12]). Many generalizations to the Kalman filter have now been developed, e.g., to allow for non-linear state dynamics state dynamics, or for observations that are noisy and non-linear and/or non-Gaussian functions of the state as in the extended Kalman filter, e.g., see [19] for a good overview.

More recently, there has been interest in merging the ideas from Kalman filtering, namely modeling dynamics in the parameters of a regression model, with the flexibility to model observations from the wide class of probability distributions that the GLM allows through its treatment of the exponential family ([22], [7], [5], [6], [10], [8]). It turns out that approximate algorithms that are very similar to the extended Kalman filter can be derived for a wide range of choices for the observation distributions. Introducing and describing some of these algorithms for a fairly general class of models, including the multivariate GLM, is the main focus of this paper. The derivation we follow is novel and simpler (e.g., it does not invoke any Kalman filter theory nor uses conjugacy). And the form of the exponential family we use is slightly more general than that used in other references on these methods, because the nuissance parameter matrix included in our model is absent in other references that deal with multivariate responses.

The second focus of this paper is to propose the application of these methods to the contextual multi-armed bandit problem ([12], [4]), and show the resulting performance through simulations. This application is novel to the best of our knowledge. It broadens the class of models that can be used to describe the rewards, separates the concept of a response and a reward, and allows for the explicit modeling of dynamics in the parameters that map the context to the rewards.

Section 2 introduces the class of models we study, and reviews the exponential family and the generalized linear model. Section 3 describes the general online algorithm to estimate the mean and variance of the parameters, and specializes it to the multivariate exponential family, the univariate exponential family, and to several examples of commonly used distributions. We sketch the derivation of the algorithm in Section 4. We apply the methods developed to the contextual multi-armed bandit problem in Sections 5, and conclude in Section 6.

## 2 Model Setup

We assume all vector and matrix entries are real numbers. We denote vectors using boldface small caps, and assume them to be column vectors. We use small caps for scalars, and boldface large caps for matrices. If is a matrix, we denote its inverse by , and its transpose by .

### 2.1 Regression Models For The Response

We assume we have received pairs of observations , for where is the -th response that we want to explain based on the -th predictor , for some positive integers , and . We denote by the information or history after observations, i.e., is the set that includes the first responses and predictors.

We postulate a model that relates the response to the predictor via a probability distribution where are the model parameters at the -th observation, with one parameter for each row in the predictor matrix is a -by- matrix and nuisance parameter that is assumed known and that we will often omit. As we will see later, the nuisance parameter plays the role of the covariance of the observations in linear regression, and is the identity matrix in many other cases of interest.

We consider regression models where the probability of the response depends on the predictors and the parameters only through the -by- signal namely, models where

Rather than working with , we will typically work with its logarithm, denoted by which we assume to be a well-behaved function, particularly that the first and second derivatives with respect to exist and are finite.

Many of the commonly used regression models fit the model above, including univariate and multivariate linear regression with response logistic or binomial, categorical or multinomial, exponential, Poisson, negative binomial, gamma, etc. Our model also includes cases where the different entries in the response vector are conditionally independent given the signal, and follow any distribution that is a function of the signal or a subset of the signal entries. For example, we can have as many predictor vectors as entries in , i.e., and have the -the response entry depend only on the -th signal entry so that , with being a different function for different response entries . Because all the entries still depend on the parameters this setup allows for combining the information from different types of measurements that depend on the parameters to obtain more accurate parameter estimates. Also, the predictor matrix may have a lot of structure, e.g., to allow for some parameters to be shared across entries in the response vector, and others that are specific to subsets of the response.

### 2.2 The Natural Exponential Family

Here we drop the time subscript of our vectors and matrices to avoid notational clutter, so, e.g., becomes simply . All of the probability distributions of interest to model the response mentioned above, and more, can be re-arranged so that their log-likelihood has the following so-called natural exponential form

(1) |

Here, is a -by-1 vector referred to as the natural parameter, with in its -th entry. It is a function of the signal in our models — this will be made specific in the next section. Crucially, the function is independent of and the function is independent of We assume that is twice differentiable with respect to its argument. The -by- nuisance parameter matrix is assumed symmetric and known. When is unknown, it can be estimated through several methods that are not online, e.g., see Chapter 13 in [6].

It can be shown, e.g., see Appendix B, that the mean and covariance matrix of are given by

(2) | ||||

(3) |

where is a column vector with in its -th entry, and is the -by- matrix with in its -th row and -th column.

Most of the frequently encountered univariate and multivariate distributions have the exponential form above. In addition, a union of independent random vectors that are in the natural exponential family is also in the natural exponential family. E.g., a random vector with entries distributed according to different members of the exponential family is still in the natural exponential family with a natural parameter given by the union of the natural parameters of its entries. This will allow us to estimate shared parameters in a regression model from multiple time series of a potentially different nature.

We consider models where the response is distributed according to Equation 1, which is a function of . We use the GLM to relate to the signal

### 2.3 The Generalized Linear Model

In the GLM, introduced in [17], we assume that the signal is a function of the mean of the observation , in addition to assuming Equation 1 for the observation . The GLM then assumes that there is a known one-to-one mapping between the natural parameter in Equation 1 and the signal so we can view the likelihood or any statistic of either as a function of or of . Specifically, we have

(4) |

for known functions and , and where the so-called link function maps the mean of to the signal. The mean and covariance in Equations 2 and 3 can then be seen to be either a function of the natural parameter or of the signal e.g.,

(5) |

Here the function is the inverse of . We refer to as the response function; it maps the signal to the mean of the response and plays a prominent role in our algorithms.

Any invertible function can be used as the link function in a GLM, but one that often makes sense, and where the mathematics to learn the model simplifies, is the canonical link that results in the signal being equal to the natural parameter, i.e., in

Previous treatments of the GLM in the context of dynamic parameters consider either a univariate response with the univariate case of Equation 1 (e.g., see [22]), or a multivariate response with Equation 1 but with the nuissance parameter matrix equal to the identity (e.g., see Chapter 2 in [10]). In this sense our treatment is a slight generalization.

### 2.4 Parameter Dynamics: The Kalman Filter

We assume that the parameters evolve according to

(6) |

where is a zero-mean random vector with known covariance matrix We also assume that the noise is uncorrelated with itself over time, and uncorrelated with the observation parameters. The known input vector drives the parameters through the appropriately sized and known matrix and is a known -by- square matrix. We also assume that at time zero the mean and variance of are known, i.e., that The general setup above includes several special cases of interest, described next.

When (the identity matrix), , and we end up with the simple parameter dynamics which is the standard regression problem with static parameters. This becomes the GLM when we also assume that the response is in the exponential family, with natural parameter that is a function of the signal. In this sense, our model is a generalization of the GLM where the parameters are allowed to vary over time.

When and , we end up with the simple parameter dynamics . This allows the parameters to drift or diffuse over time in an unbiased (zero-mean) way, according to the noise . This model is appealing for a range of applications, e.g., it could model the conversion rate of visitors to the Netflix signup page as a function of their browser, country, day of week, time of day, etc.

Equation 6 is central to the study of linear dynamical systems, control theory, and other related areas. Specifically, it is core to the Kalman filter, which also assumes a dynamical system that evolves according to Equation 6, but with a response that is a linear function of the parameters (the state in Kalman filter parlance) and additive Gaussian noise. The Kalman filter is an online algorithm that estimates the mean and variance of the system state from the noisy observations. So our setup is very related. The main difference is that we do not restrict our response to be Gaussian, but rather a non-linear function of , itself a linear function of the state. The non-linearity and the noise characteristics of the observations follow from the choice of regression model made, e.g., from the specific mapping between the signal and the response: choosing a Gaussian distribution for the response with a mean equal to the signal yields the standard Kalman filter. In this sense, our model is a generalization of the standard Kalman filter. A variant of the Kalman filter known as the extended Kalman filter deals with general non-linear observations, and has been applied to responses modeled through the exponential family ([6], [7]), yielding an algorithm that can be shown to be equivalent to ours.

Based on the assumptions above, we obtain the following factorization of the joint probability function of predictors, responses and parameters.

(7) |

## 3 Estimating Model Parameters

We seek an algorithm to compute the mean and covariance matrix of the model parameters using all the observations we have at any given time. We want this algorithm to be online, i.e., to perform a relatively simple update to the previous mean and covariance estimates when a new observation arrives, without having to re-analyze previous observations.

### 3.1 The General Algorithm

Let and denote the mean and covariance matrix of First, we initialize the mean and covariance of to the known values and .

We proceed by induction: we assume we know that has parameter mean and covariance matrix and , and use them and the new observation to compute and through a two stepped process suggested by the following simple relation:

(8) |

Equation 8 relates to its prior which predicts based on all previous information up to but ignoring the observation at time , and the log-likelihood of the latest observation .

#### Step 1: Prediction

We compute the mean and covariance of the prior via

(9) | ||||

(10) |

This equation is exact and does not require assuming any functional form for It follows fairly directly from Equation 6, e.g., see Appendix A for a derivation of these and other equations in this section. When the parameter dynamics are non-linear, the mean and covariance of can be approximated through expressions identical to Equations 9 and 10 by suitably re-defining the matrices that appear in them, e.g., by linearizing the parameter dynamics around or via numerical simulation as in the so-called unscented Kalman filter ([6], [10] and [19]).

Note that when is the identity matrix, Equation 10 shows that the variance of the parameter estimates always increases in the prediction step, unless In addition, because the system input is deterministic, it does not contribute to the covariance matrix.

When the predictor becomes known, we can use Equations 9 and 10 to determine the mean and covariance matrix of the signal given and :

(11) | ||||

(12) |

Lastly, the covariance matrix between the signal and the parameters is given by . The latter follows from the fact that signal is a linear function of the parameters, with as the weights.

#### Step 2: Estimation

Now we update the estimates from the prediction step to incorporate the new observation, obtaining the mean and covariance of the posterior We first compute the following matrices

(13) |

Here is the -by- Hessian matrix of the log-likelihood, evaluated at the predicted value of the signal As we will see, in many models of interest, this matrix is the negative of the variance of evaluated at the predicted signal value The matrix then grows with the expected variance of the predicted signal response , but decreases when the expected variance of the response increases.

We then compute the covariance via:

(14) |

Computing the inverse of starting from Equation 13 can often be numerically unstable, e.g., because the determinant of can be very small in magnitude. A more robust way to compute is via

(15) |

This expression follows directly from Equations 14 and 13 after applying the Kailath variant of the Woodbury identity (e.g., see [18]).

We finally compute the mean of the parameters by:

(16) | ||||

Our main algorithm proceeds by executing the prediction and estimation steps for each arriving observation, namely evaluating Equations 9, 10, 14 and 16 with every new observation. Equations 16 and 14 are approximate, and follow from (1) a second-order Taylor expansion of around , and (2) assuming the prior is Gaussian with mean and covariance given by and . A sketch of the argument is described in Section 4. Because the two assumptions we make are exact in the case of linear regression with a Gaussian prior for Equations 16 and 14 are exact in that case and correspond to the standard Kalman filter equations.

West et al. ([22]) make the different approximation that the prior is conjugate to the likelihood obtaining a slightly different algorithm that has only been developed for the univariate response scenario.

### 3.2 The Univariate Signal Case

Many regression models involve a scalar signal and a scalar response where the predictor is now simply a vector . This is the situation for the most commonly encountered regression models, such as linear, logistic, Poisson or exponential. The matrices and then also become scalars, so the update Equations 16 and 14 simplify to

(17) | ||||

(18) |

where the predicted signal is . The result is very appealing because no matrix inverses need to be computed.

### 3.3 The Dynamic Generalized Linear Model

Here we consider models where the response is in the exponential family of Equation 1, and where the natural parameter is related to the signal via Equations 4 and 5. The gradient in these models can be shown to be given by

(19) |

So the gradient is always proportional to the error , the difference between the response and its mean according to the model at the given signal. The covariance matrix is in general a function of the signal but we drop that dependence in our notation below to reduce clutter.

The Hessian is then obtained by differentiating Equation 19 with respect to the signal once more, resulting in

(20) |

Evaluating Equations 19 and 20 for a given choice of the likelihood and link function (which determines the response function ) at the predicted signal , and plugging the resulting expressions into Equations 14 and 16 completes the algorithm.

#### The canonical Link

When the canonical link is used, the natural parameter is equal to the signal, so , and . Equations 19 and 20 simplify to

(21) | ||||

(22) |

So the gradient is proportional to the error, as before, and the Hessian is proportional to the negative of the covariance of Evaluating the gradient and Hessian above at the predicted signal , and plugging in the resulting expressions into Equations 14 and 16 yields the update equations

(23) | ||||

(24) | ||||

(25) | ||||

(26) | ||||

(27) | ||||

(28) |

Equation 26 is the numerically stable analog of Equation 14 that avoids inverses of potentially close-to-singular matrices.

Multivariate linear regression is one of many examples that falls in this class of models. There, and can be shown to be in the natural exponential family (e.g., see Equation 54 in the Appendix), with and and covariance matrix equal to which in this case is independent of the signal. So the equations above yield the standard Kalman filter equations.

(29) | ||||

(30) | ||||

(31) |

Other distributions in the natural exponential family, e.g., the multinomial, have variances that are a function of the signal — linear regression is the exception.

The univariate signal case covers the majority of applications encountered in practice. Here the signal , the response , and the nuisance parameter are all scalars, and the predictor is a vector, so the update equations become:

(32) | ||||

(33) |

with being the variance of the response evaluated at the predicted signal . Equation 32 shows that the effect of the new observation is to reduce the covariance of the parameters by an amount proportional to , and a gain that gets smaller when there is more variance in the predicted signal, as captured by and larger when the response is expected to have a higher variance .

Many common regression models fall in this category. Univariate linear regression with and

This is already in natural exponential form with , and with playing the role of the natural parameter (i.e., the canonical link was used to map the mean of the response to the signal). Substituting and in Equations 32 and 33 yields the univariate Kalman filter.

(34) | ||||

(35) |

In Poisson Regression is a positive integer that follows a Poisson distribution with mean . The likelihood is which is again in natural exponential form with as the natural parameter, and with . The variance of a Poisson random variable is equal to its mean, so Equations 32 and 33 become

(36) | ||||

(37) |

In Exponential Regression is a non-negative real number that follows an exponential distribution with mean , so with mean and variance . Note that here , so unlike other models here, the update in the mean is negatively proportional to the error, namely:

(38) | ||||

(39) |

In Logistic Regression the response is a Bernoulli random variable. It takes on the value 1 with probability and the value 0 with probability . So is the response function, and its inverse is the link function. The likelihood becomes

(40) |

This is again in the natural exponential family with , and variance . The last equation above implies that is indeed the canonical link. So Equations 32 and 33 become

(41) | ||||

(42) |

## 4 Sketch Of Derivation

We start from Equation 8, and view the likelihood as a function of the model parameters, namely . We then approximate about via the second-order Taylor expansion:

(43) |

Because the signal is given by we have that

(44) |

We also make the second approximation that . This approximation is what is needed to make the mathematics below work out, but could be justified in that the Gaussian distribution is the continuous distribution that has maximum entropy given a mean and covariance matrix, and these are the only known statistics of .

Using the two approximations in Equation 8 results in being proportional to

(45) | |||

(46) |

where

(47) |

Equation 45 follows from completing squares, e.g., see Appendix C, and the proportional sign indicates that terms independent of were dropped. The result shows that under our approximations, To finish the argument, we substitute the expressions in Equation 44 into Equation 47, and apply the Woodbury matrix inversion formula to the expression for in Equation 47 to finally get the update Equations 14 and 16.

## 5 Contextual Multi-Armed Bandits

The models we discuss here can be and have been applied to a wide range of situations to model, analyze and forecast univariate and multivariate time series. E.g., see Chapter 14 in [6] or [8] for a range of examples. Here we apply the models discussed to the contextual multi-armed bandits scenario, where so far only univariate time series modeled through a linear or logistic regression have been considered. In the latter case, the only treatment known to us approximates the covariance matrix as diagonal. The models we have discussed enable explore/exploit algorithms for contextual multi-armed bandit scenarios where the reward depends on a multivariate response vector distributed according to the exponential family, and where the true parameters of the different arms are dynamic. We hope this broadens the situations where contextual multi-armed bandit approaches can be helpful.

The standard setup involves a player interacting with a slot machine with arms over multiple rounds. Every time an arm is played a reward gets generated. Different plays of the same arm generate different rewards, i.e., the reward is a random variable. Different arms have different and unknown reward distributions, which are a function of an observed context. At every time step, the player must use the observed context for each arm and all the history of the game to decide which arm to pull and then collect the reward. We seek algorithms that the player can use to decide what arm to play at every round in order to maximize the sum of the rewards received. These algorithms build statistical models to predict the reward for each arm based on the context, and decide how to balance exploring arms about which little is known with the exploitation of arms that have been explored enough to be predictable. The exploration/exploitation trade-off requires having a handle on the uncertainty of the predicted reward, so the models used need to predict at least the mean and variance of the reward for each arm. Real applications such as personalized news recommendations or digital advertising often have tight temporal and computational constraints per round, so the methods to update the statistical models with every outcome need to be online.

Popular and useful model choices describe the univariate reward for each arm as a linear function of the context plus Gaussian noise (i.e., through a linear regression, e.g., see [12]), or through a logistic regression ([4]). In the latter case, the algorithm that updates the model based on new observations uses a diagonal approximation of the parameter covariance matrix. In all these references, model parameters are assumed static (although their estimates change with every observation). In the non-contextual multi-armed bandit problem, recent efforts have tried to generalize the distributions for the rewards to the univariate exponential family [11], and as far as we know this is the first treatment for the contextual case.

We consider the following scenario. The parameters of all arms are compiled in the single parameter vector that, unlike other settings, is allowed to change over time according to Equation 6. Some entries in correspond to parameters for a single arm, and others are parameters shared across multiple or all arms. We describe the model parameters via where is the history of contexts and responses seen up to and including round . At the start of round , we observe the context matrix for each arm , and combine this information with our knowledge of to decide which arm to play. Denote the arm played by , and its corresponding context matrix simply by to make it consistent with the notation in the rest of this paper. Playing arm results in a response with a distribution in the (possibly multivariate) exponential family that depends on the context The relation between the response and the context is given by the dynamic GLM in Section 3.3, so the mean of is a function of the signal The response is used to update our estimates of the model parameters , according to the algorithm described in Section 3.3, to be used in round .

We assume the reward received in round is a known deterministic function of the response, e.g., a linear combination of the entries in . If we knew the actual model parameters, the optimal strategy to maximize the rewards collected throughout the game would be to play the arm with the highest average reward, i.e., We define the regret i.e., the difference between the means of the rewards of the optimal arm and the arm played given the context and the model parameters.

Unlike the more standard contextual setup, ours allow for the explicit modeling of parameter dynamics. It also broadens the choice of probability distribution to use for the response or reward to more naturally match the model choice to the nature and dimensionality of the reward data. E.g., we can use a Poisson regression when the reward is a positive integer, or have a response with multiple entries each with a different distribution, use all response entries to update our parameter estimates, and then define the reward to be a single entry in the response.

#### Thompson Sampling

A contextual multi-armed bandit algorithm uses the knowledge of the parameters at each round and the context to decide which arm to play. The widely used upper confidence bound (UCB) approach constructs an upper bound on the reward for each arm using the mean and covariance of the parameter estimates at every round, and selects as the arm with the highest upper bound, e.g., see [12]. Another approach that has gained recent popularity ([4], [2]) is the so-called Thompson sampling introduced in [20], where arm is selected at round with a probability that it is optimal given the current distribution for the model parameters. It is only recently that asymptotic bounds for its performance have been developed both for the contextual ([2], for the linear regression case only) and the non-contextual ([1]) case. The studies mentioned have found Thompson sampling to perform at pair or better relative to other approaches, and to be more robust than UCB when there is a delay in observing the rewards.

Thompson sampling is also very easy to implement. In one variant we sample a parameter value from the distribution and let In another variant, rather than sampling the model parameters once for all arms from we generate independent samples from the same distribution, the sample for arm denoted by , and then let The latter approach is found in [2] for the linear regression case to have a total regret that asymptotically scales with the number of model parameters rather than with its square as in the first variant, so we use the second variant in our simulations.

### 5.1 Simulations

Our goal here is to demonstrate how our online regression models work in the contextual bandits case when the observations are multivariate and not Gaussian, and when the model parameters are allowed to be dynamic. The goal is not to compare different contextual bandit algorithms, so we only focus on Thompson sampling. The model we simulate is inspired by the problem of optimizing the Netflix sign-up experience. Each arm corresponds to a variant of the sign-up pages that a visitor experiences — a combination of text displayed, supporting images, language chosen, etc. The context corresponds to the visitor’s type of device and/or browser, the day of week, time of day, country where the request originated, etc. Some of these predictors are continuous, such as the time of day, and others are categorical, such as the day of the week. The goal is maximizing signups by choosing the sign-up variant the is most likely to lead to a conversion given the context. We also observe other related outcomes associated to each visitor, such as the time spent on the sign-up experience, and whether they provide their email before signing up. We assume that these other observations are also related to the model parameters (though possibly with different context vectors), and use them to improve our parameter estimates. So our response is multivariate, even if the reward is based on a single entry of the response vector. Lastly, we want to let the model parameters drift over time, because we know that different aspects of the Netflix product are relevant over time, e.g., different videos in our streaming catalog will be the most compelling in a month than today.

Denote the response by and the reward by We model through a logistic regression, with a probability of taking the value 1 of and variance . The two other response entries do not affect the reward, but we use them to improve our estimates of the model parameters. We model as a linear regression with mean and variance , and through another logistic regression, with mean and variance . The signal is We assume that the entries of the response are independent of each other conditioned on the signal, so the nuisance parameter matrix is diagonal and time-independent, with the vector as its diagonal, and the covariance matrix of the response is diagonal with the vector as its diagonal.

The context matrix for arm has one row for each model parameter entry and one column per response entry. Some rows correspond to parameters shared by all arms, and others to parameters corresponding to a single arm. To construct we simulate continuous and categorical predictors that we sample at every round. We let play the role of the continuous predictors, and sample each column from a zero-mean Gaussian with covariance . The diagonal entries in are sampled independently from an exponential distribution with rate of 1, and the off-diagonal entries all have a correlation of . We let the categorical predictor be a sample from a uniform categorical distribution with entries, i.e., all entries in are zero except for one that is set to 1. We also let be an indicator vector that specifies that arm is being evaluated. It has entries that are all zero except for its -th entry which is set to 1. Letting be a column vector with entries, all set to 1, we define the context matrix for arm as

(48) |

Here denotes the Kronecker product between two vectors or matrices. The first rows of simply specify what arm is being evaluated, the next rows correspond to the continuous predictors, followed by rows for the categorical predictors. The next rows are the interaction terms between the continuous predictors and the arm (only rows corresponding to the arm are non-zero), and the last rows are the interaction terms between the categorical predictor and the arm chosen (all these rows are zero except one that is set to 1). The number of rows and model parameters is then . We let and . Note that without the interaction terms, the optimal arm would be independent of the context.

We set the model parameter dynamics to where has diagonal entries that are independent exponential random variables with rate , and a correlation coefficient of 0.2 for its off-diagonal entries. We sample a different matrix at every round. We assume the first visitor arrives at , and start the game by sampling from a zero-mean Gaussian with diagonal covariance matrix. The diagonal entries are independent samples from an exponential distribution with rate equal to 1. We initialize the mean and covariance estimates of as and , where is the identity matrix.

At round , starting from the mean and covariance estimates of the parameters, we compute the mean and covariance of the parameters. We then sample one value of the model parameters for each arm from the resulting prior distribution, and construct the context matrices for each arm. We use the context matrices and the parameter samples to choose (which defines ) based on Thompson sampling, we play and observe the response to obtain the round’s reward, and update the parameter estimates to obtain and covariance and start the next round.