A Tutorial on Online Supervised Learning with Applications to Node Classification in Social Networks
We revisit the elegant observation of T. Cover  which, perhaps, is not as well-known to the broader community as it should be. The first goal of the tutorial is to explain—through the prism of this elementary result—how to solve certain sequence prediction problems by modeling sets of solutions rather than the unknown data-generating mechanism. We extend Cover’s observation in several directions and focus on computational aspects of the proposed algorithms. The applicability of the methods is illustrated on several examples, including node classification in a network.
The second aim of this tutorial is to demonstrate the following phenomenon: it is possible to predict as well as a combinatorial “benchmark” for which we have a certain multiplicative approximation algorithm, even if the exact computation of the benchmark given all the data is NP-hard. The proposed prediction methods, therefore, circumvent some of the computational difficulties associated with finding the best model given the data. These difficulties arise rather quickly when one attempts to develop a probabilistic model for graph-based or other problems with a combinatorial structure.
1The basics of bit prediction
Consider the task of predicting an unknown sequence of ’s in a streaming fashion. At time , a forecaster chooses based on the history observed so far. After this prediction is made, the value is revealed to the forecaster. The average number of mistakes incurred on the sequence is
where is if is true, and otherwise. A randomized algorithm is determined by the means
of the distributions puts on the outcomes at time . The expected average number of mistakes made on the sequence by a randomized algorithm is
where the expectation is with respect to the random choices , drawn from the distributions with means , .
Whenever a prediction algorithm has low expected error on some sequence , it must be at the expense of being worse on other sequences. Why? On average over the sequences, the algorithm necessarily incurs an error of . Indeed, denoting by a sequence of independent unbiased -valued (Rademacher) random variables, it holds that
by an elementary inductive calculation, keeping in mind that . As a consequence, it is impossible to compare prediction algorithms when all sequences are treated equally.
Evidently, any algorithm induces a function on the hypercube , whose average value is . Cover  asked the converse: given a function , is there an algorithm with the property
In words, if we specify the average number of mistakes we are willing to tolerate for each sequence, is there an algorithm that achieves the goal? If such an algorithm exists, we shall say that is achievable. Let us call stable if
for any coordinate, keeping the rest fixed. Cover’s observation  is now summarized as
That is, for any function that does not change too fast along any edge of the hypercube, there exists an algorithm that attains the average number of mistakes given by if and only if is on average over all sequences.
As an immediate consequence, for any stable , is equivalent to existence of an algorithm with
This latter version of the Lemma will be used in the sequel, and we shall say that is achievable even if holds with an inequality.
Perhaps, it is worth emphasizing the following message of the lemma:
Existence of a forecasting strategy with a given mistake bound for an arbitrary sequence can be verified by checking a probabilistic inequality.
The proof of the more general multi-class statement (Lemma ?) appears in the appendix; it uses backward induction, and may be viewed as a “potential function” argument.
2Modeling solutions through
As we have seen, there is no algorithm that can predict sequences uniformly better than another algorithm. Thankfully, we do not care about all sequences. A typical prediction problem has some structure that informs us of the sequences we should focus on. The structure is often captured through a stochastic description of the generative process, such as an i.i.d. or an autoregressive assumption. The stochastic assumption, however, may not be justified in applications that involve complex interactions and time dependencies, such as in the social network example below.
The approach in this tutorial is different: we provide a non-stochastic description of the “prior knowledge” via the function . The function specifies the expected proportion of mistakes we are willing to tolerate on each sequence. We tilt down towards the sequences we care about, at the expense of making it larger on some other sequences that we are unlikely to see anyway. Lemma ? guarantees existence of a prediction strategy with proportion of mistakes given by , as long as is stable and at least on average. Furthermore, given , the algorithm is simple to state, as we will see below.
In 1950’s, David Hagelbarger  and Claude Shannon  at the Bell Labs built the so-called “mind reading machines” to play the game of matching pennies. According to some accounts,
How can we design such a machine? Using the approach outlined above, we would need to capture the possible patterns of behavior we might see in the sequences and encode this knowledge in . We have already seen in Example ? how to take advantage of imbalanced sequences. Of course, this may not be the only structure available, and we shall now describe a few general approaches to building .
The first basic construction will be called aggregation. Suppose are stable functions, each satisfying . It is then possible to show that the best-of-all aggregate
is stable and satisfies for
with an absolute constant . The penalty for aggregating “experts” depends only logarithmically on , and diminishes when is large. A reader familiar with the literature on prediction with expert advice will recognize the form of as a regret bound of the Exponential Weights algorithm.
Another useful (and most-studied) way to construct is by taking a subset and letting
the normalized Hamming distance between and the set , penalized by . Recall that the normalized Hamming distance is
where . The definition in automatically ensures stability of , and the smallest that guarantees is
the Rademacher averages of the set .
Observe that the function in Example ? can be written as with . This is the simplest nontrivial set , since matching the performance of a singleton simply amounts to outputting this very sequence.
As one makes a larger subset of the hypercube, the Hamming distance from any decreases, yet the overall penalty becomes larger. On the extreme of this spectrum is . Insisting on a small error on this set is not possible, and, indeed,
the performance of random guessing.
The goal is now clear: for the problem at hand, we would like to define to be large enough to capture the underlying structure of solutions, yet not too large. In Section 5 we come back to this issue when discussing combinatorial relaxations of .
3Application: node classification
We now discuss an application of Lemma ?. Let be a known undirected graph representing a social network. At each time step , a user in the network opens her Facebook page, and the system needs to decide whether to classify the user as type “” or “”, say, in order to decide on an advertisement to display. We assume here that the feedback on the “correct” type is revealed to the system after the prediction is made. The more natural partial information version of this problem is outside the scope of this short tutorial, and we refer the reader to .
The prediction should be made based on all the information revealed thus far (the types of users in the network that appeared before time ), the global graph structure, and the position of the current user in the network. In Section 8 we will also discuss the version of this problem where covariate information about the users is revealed, but at the moment assume that the graph itself provides enough signal to form a prediction. A fascinating question is: what types of functions capture the graph structure relevant to the problem? Below we provide two examples, only scratching the surface of what is possible.
Suppose we have a hunch that the type of the user ( or ) is correlated with the community to which she belongs. For simplicity, suppose there are two communities, more densely connected within than across (see Figure 1). To capture the idea of correlating communities and labels, we set to be small on labelings that assign homogenous values within each community. We make the following simplifying assumptions: (i) , (ii) we only predict the label of each node once, and (iii) the order in which the nodes are presented is fixed (this assumption is easily removed). Smoothness of a labeling with respect to the graph may be computed via
where is the label in that corresponds to vertex . This function in counts the number of disagreements in labels at the endpoints of each edge. The value is also known as the size of the cut induced by (the smallest possible being MinCut). As desired, the function in gives a smaller value to the labelings that are homogenous within the communities. A more concise way to write is in terms of the graph Laplacian
where , the diagonal matrix contains degrees of the nodes, and is the adjacency matrix.
Unfortunately, the function in is not stable. It also has an undesirable property, illustrated by the following example. The cut size is for a star graph, where nodes, labeled as , are connected to the center node, labeled as . The large value of the cut does not capture the simplicity of this labeling, which is only one bit away from being a constant .
Instead, we opt for the indirect definition . More precisely, we define
for , and then set
Parameter should be larger than the value of MinCut, for otherwise the set is empty. The function has the interpretation as the proportion of vertices whose labels need to be flipped to achieve the value at most for the cut, compensated by the Rademacher averages of the set . While we can give some straightforward bounds on the Rademacher averages of , the investigation of this value for various graphs, including random ones, is an interesting research question.
While MinCut is computationally easy, the calculation of becomes NP-hard in general if we allow -valued weights on the edges and define with respect to the weighted Laplacian
Such a definition can be used to model trust-distrust networks, and we certainly would like to develop computationally efficient methods for this problem. Somewhat surprisingly, this is possible in certain cases, even though evaluating is computationally hard. See Sections Section 6 and Section 7 for details.
3.2Exercise: predicting voting preferences
Suppose individuals (connected via the known social network as in the previous example) arrive to the voting station one by one, and we are predicting whether they will vote for Grump or for Blinton. After our prediction is made, the voter reveals her true binary preference. Suppose we know the individual’s place in the network and the voting preferences of the individuals observed thus far. Our task is to design an online prediction algorithm that makes as few mistakes as possible.
Suppose we have prior knowledge that Grump supporters may be described well by a ball in the network (a ball with center and radius is the set of vertices at most hops away from ), but the center of this ball in the network is not known. Suppose each individual has at most friends. We leave it as an exercise to design a function for this prediction problem.
4Extension to multi-class prediction
We now extend the result of Cover to -ary outcomes, i.e. . As before, the expected prediction error is given by
but uniformly random guessing now incurs an expected cost of . By the same token, on average over the sequences, any algorithm must incur the expected cost of .
Let us define a couple of shorthands. We shall use the notation , , and denote the set of probability distributions on outcomes by .
We shall say that is stable if for any coordinate (and holding the rest fixed),
We now overload the notation and define a randomized forecasting strategy as a collection of distribution-valued functions of histories:
Once again, it follows from the Lemma that is equivalent to existence of a strategy with . Lemma ? can be seen as a special case of Lemma ? for .
By repeatedly referring to “existence of a prediction strategy” in the previous sections, we, perhaps, gave the impression that these methods are difficult to find. Thankfully, this is not the case.
5.1The exact algorithm
The proofs of Lemma ? and ? are constructive, and the algorithms are easy to state. For binary prediction (), the randomized algorithm is defined on round by the mean
of the distribution on the outcomes , where the expectation is over independent Rademacher random variables . The prediction is then a random draw such that .
The two evaluations of in are performed at neighboring vertices of the hypercube differing in the -th coordinate. If the function values are equal in expectation, the mean is equal to zero, which corresponds to the uniform distribution on . In this case, the function does not provide any guidance on which prediction to prefer. On the other hand, if the absolute difference is in expectation (the largest allowed by stability), the distribution is supported on one of the outcomes and prediction is deterministic. Between these extremes, the difference in values of measures the influence of -th coordinate on the potential function , where the past outcomes have been fixed and future is uniformly-random. We emphasize that Rademacher random variables for future rounds are purely an outcome of the minimax analysis, as we assume no generative mechanism for the sequence .
For , the randomized algorithm is defined on round by a distribution
where the expectation is with respect to , each independent uniform on . Given that the values have been computed for each , the minimization in is performed by a simple water-filling -time algorithm which can be found in the proof of Lemma ? (see also ). The actual prediction is then a random draw .
Computing the expectations in and may be costly. Thankfully, a doubly-randomized strategy works by drawing a random sequence per iteration. For binary prediction, the algorithm on round becomes: draw independent Rademacher random variables , compute
and draw from the distribution on with the mean . This randomized strategy was essentially proposed in .
For , we draw uniform independent , solve for
and then draw prediction from the distribution .
In the binary prediction case, the proof of Lemma ? is immediate by the linearity of expectation. The analogous argument for is more tricky and follows from a more general random playout technique introduced in . This technique also yields a proof of Lemma ? (for both binary and multi-class cases) for adaptively chosen sequences of outcomes, an issue we have not yet discussed (see also Section 8 below).
6Relaxations and the power of improper learning
In the rest of the tutorial, we focus on the binary prediction problem for simplicity. Recall that the computation in involves drawing random bits and evaluating the function on two neighboring vertices of the hypercube. If is defined as
then computing or involves comparing two distances to the set , as shown in Figure 2. Note that the knowledge of is not needed, as this value cancels off in the difference.
we may extend the function to any by defining it as the right-hand side of . The extended is still stable in the sense of .
Suppose that calculating the distance is computationally expensive, due to the combinatorial nature of . Let be a set containing , and suppose that is easier to compute. The following observation is immediate:
By construction, is achievable, and
By relaxing the set to a larger set , we may gain in computation while suffering a multiplicative factor in the Rademacher complexity. Crucially, this factor does not multiply the Hamming distance but only the term . The latter is typically of lower order and diminishing with . We may summarize the observation as
There is a reason we belabor this simple observation. In the literature on online learning, it has been noted that one may guarantee
when one has a multiplicative approximation algorithm for the benchmark. The performance of the algorithm in this case is compared to
However, the bound easily becomes vacuous (say, the error rate of the offline benchmark is and the multiplicative factor is a constant or logarithmic in ). The version where only enters the remainder term seems much more attractive.
The key to obtaining is the improper nature of the prediction methods or : the prediction need not be in any way consistent with any of the models in . Informally:
Using improper learning methods, it may be possible to predict as well as a combinatorial benchmark (plus a lower-order term) even when computing this very benchmark given all the data is NP-hard.
To make the statement more meaningful, we will show that can be upper bounded—for some interesting examples—in a way that does not render the mistake guarantee vacuous. We start with a simple example in Section 6.1, and then present a more complex machinery based on Constraint Satisfaction in Section 7.
6.1Example: node classification
Consider the example in Section 3.1, and suppose, additionally, that the undirected graph has weights on edges. The weight on edge is denoted by , and when is not an edge. Positive and negative edges may model friend/foe or trust/distrust networks.
We define as in , with the understanding that is now the weighted graph Laplacian: with the diagonal matrix, . As before, set . Why would evaluation of be computationally hard? First, if we can evaluate for any , we can also find the value
However, if all the edge weights are , then becomes
which may be recognized as the value of MinUnCut, an NP-hard problem in general. Hence, we cannot hope to evaluate the Hamming distance to exactly. Our first impulse is to approximate the value in ; in the case of MinUnCut this can be done with a multiplicative factor of . However, it is not clear how to turn such a multiplicative approximation into a mistake bound with the factor in front of the combinatorial benchmark. Yet, such a bound is possible, as we show next.
Following Observation ?, we set
as in . Note that evaluating amounts to maximization of a linear function subject to a quadratic constraint and a box constraint . This can be accomplished with standard optimization toolboxes.
It remains to estimate . To this end, notice that
with . Hence, we can upper bound
where is the th eigenvalue of . The upper bound in depends on the spectrum of the underlying network’s Laplacian with an added regularization term . It is an interesting research direction to find tighter upper bounds, especially when the social network evolves according to a random process.
To summarize, we relaxed to a larger set , for which computation can be performed with an off-the-shelf optimization toolbox. Further, we derived a (rather crude) upper bound on the Rademacher averages of this set. In our calculations, however, we did not obtain an upper bound on the multiplicative gap between the original Rademacher averages and the larger value . Hence, the price we paid for efficiently computable solutions remains unknown. In the next section, we present a generic way to glean this payment from known approximations to integer programs.
Unlike the rest of the tutorial, this section is not self-contained. Our aim is to sketch the technique in , hiding the details under the rug. We also refer the reader to the literature on approximation algorithms based on semidefinite and linear programming (see e.g.  and references therein).
7.1Relaxing the optimization problem
Consider the set
for some , which we call a constraint. The definitions and in terms of graph Laplacian and weighted graph Laplacian are two examples of such a definition. A more general example is Constraint Satisfaction: let be a collection of functions and set
Recall that computing a prediction amounts to evaluating the weighted Hamming distance from to , which—in view of —is equivalent to finding the value
and then setting
Observation ? suggests that if cannot be easily computed for the set , we should aim to find a larger set for which this optimization is easier. A twist here is that we will not write down the definition of explicitly (although it can be understood as a projection of a certain higher-dimensional object). Instead, let us replace in by a value of some other optimization problem (to be specified in a bit) and set
As before, the condition for achievability of this function implies that the smallest value of the constant is
Before defining , let us state a version of Observation ?:
Immediate from and , and the fact that defined by contains .
We now define , along with two more auxiliary optimization problems, and prove .
7.2Setting up auxiliary optimization problems
is an optimization of a linear objective over vertices of the hypercube, under the restriction . In the literature, much effort has been devoted to analyzing a dual problem: minimization of , possibly subject to a linear constraint. Our plan of attack is to define dual auxiliary optimization formulations, then use “integrality gap” for these problems, and pass back to the primal objective in order to prove .
Let us define the set of probability distributions on those vertices of the hypercube that yield the value of at least for the linear objective:
Define the optimization problem
the minimum constraint value achievable on the vertices of the hypercube, given that the linear objective value is at least . The second equality in holds true because the minimum of a linear (in ) objective is attained at a singleton, a vertex of the hypercube.
Both and are combinatorial optimization problems, which may be computationally intractable. A common approach to approximating these hard problems is to pass from distributions to pseudo-distributions. Roughly speaking, a pseudo-distribution at “level ” only behaves like a distribution for tuples of variables of size up to . Associated to a pseudo-distribution is a notion of a pseudo-expectation, denoted by . We refer to  for details.
Let be the set of suitably defined pseudo-distributions with the property
whenever (see  for the precise definitions of these sets in terms of semidefinite programs). Define a relaxation of as
We write “SDP” here because relaxations we have in mind arise from semidefinite relaxations, but the arguments below are generic and hold for other approximations of combinatorial optimization problems.
The integrality gap of the dual formulation is
with . We emphasize that we define the gap for the dual problems. Next, we show that the gap appears when relating to .
Let us fix and for brevity omit it from and definitions. First, it holds that
Indeed, is the value of that guarantees existence of some pseudo-distribution in that respects the constraint of on average. Hence, for this value of , the minimum in will include this pseudo-distribution, and, hence, the value of the minimum is at most . By the same token,
Indeed, is the value of achieved by some satisfying . The maximization in includes this by the definition of , and, hence, the maximum is larger than .
Next, by the definition of the integrality gap and ,
By and ,
because the value of is nondecreasing in .
It remains to provide concrete examples where is bounded in a non-trivial manner.
7.3Back to node classification
Recall the node classification example discussed in Section 6.1. In view of Lemma ?, to conclude the mistake bound we only need to get an estimate on . For the case , the integrality gap defined in is the ratio of an integer quadratic program (IQP) subject to linear constraint and its relaxed version. We turn to , which tells us that within levels of Lasserre hierarchy one can solve the IQP subject to linear constraints with the gap of at most
where is the -th eigenvalue of the normalized graph Laplacian. One can verify that grows as , and thus we essentially pay an extra factor of for having a polynomial-time algorithm, with the computational complexity of .
There are several points worth emphasizing:
As another manifestation of improper learning, the prediction algorithm does not need to “round the solution.” The integrality gap only appears in the mistake analysis. This means that any improvement in the gap analysis of semidefinite or other relaxations immediately translates to a tighter mistake bound for the same prediction algorithm.
It is the expected gap (with respect to a random direction ) that enters the bound, a quantity that may be smaller than the worst-case gap. It would be interesting to quantify this gain.
As an alternative to definition , one may combine the linear objective and the constraint in a single “penalized” form. Such an approach allows one to use Metric Labeling approximation algorithms , and we refer to  for more details.
In most applications of online prediction, some additional side information is available to the decision-maker before she makes the prediction. Consider the following generalization of the problem introduced in Section 1. At time , the forecaster observes side information , makes a forecast (based on the history and ), and then the value is observed.
Let be a function of two sequences: . The function is stable if
where . We will prove the following generalization of Lemma ?.
Above, the expectation on the left-hand side of is over the randomization of the algorithm and the ’s, while on the right-hand side the expectation is over the ’s. In , the expectation is both over the ’s and over the independent Rademacher random variables.
An attentive reader will notice that the guarantee is not very interesting because ’s are chosen independently while ’s are fixed ahead of time. The issue would be resolved if ’s could be chosen by Nature after seeing . Let us call such a Nature semi-adaptive, and reserve the word adaptive for Nature that chooses ’s based also on the full history of (including learner’s predictions).
Inspecting the proof of Lemma ?, we see that the optimal doubly-randomized strategy is to draw and and then set the mean of the distribution to be
Notice that the coordinates of are the actual observations, while are hallucinated. If ’s are i.i.d., these hypothetical observations are available, for instance, if one has access to a pool of unlabeled data. In fact, the statement of Lemma ? holds verbatim for any stochastic process governing evolution of ’s, as long as we can “roll out the future” according to this process.
Finally, we state one more extension of Cover’s result, lifting any stochastic assumptions on the generation of the ’s.
In Section 1, we introduced a “canonical” way to define for the case of no side information. The analogous canonical definition that takes side information into account is as follows. Let be a class of functions . If is chosen well, one of the functions in this class will explain, approximately, the relationship between ’s and ’s. It is then natural to take the projection
the set of vertices of the hypercube achieved by evaluating some on the given data. We may now define
as before. The function defined in this way is indeed small if is close to the values of some on the data.
It remains to give an expression for . For the i.i.d. side information case of Lemma ?, the condition means that the smallest value of ensuring achievability is
the Rademacher averages of . For the adversarial case of Lemma ?, condition means that the smallest value of is
is the sequential Rademacher complexity .
9Discussion and Research Directions
The prediction results discussed in this tutorial hold for arbitrary sequences – even for those chosen adversarially and adaptively in response to forecaster’s past predictions. Treating the prediction problem as a multi-stage game against Nature has been very fruitful, both for the theoretical analysis and for the algorithmic development. Even though we discuss maliciously chosen sequences, it is certainly not our aim to paint any prediction problem as adversarial. Rather, we view the “individual sequence” results as being robust and applicable in situations when modeling the underlying stochastic process is difficult. For instance, one may try to model the node prediction problem described in Section 3.1 probabilistically—e.g. as a Stochastic Block Model—but such a model is unlikely to be true in the real world. Of course, the ultimate test is how the two approaches perform on real data. In the node classification example, the methods discussed in this tutorial performed very well in our own experiments, often surpassing the performance of more classical machine learning methods. Perhaps it is worth emphasizing that the prediction algorithms developed here are very distinct from these classical methods, and, if anything, this tutorial serves the purpose of enlarging the algorithmic toolkit.
We presented some very basic ideas, only scratching the surface of what is possible. Among some of the most interesting (to us) and promising research directions are:
Develop linear or sublinear time methods for solving prediction problems on large-scale graphs.
Run more experiments on real-world data and explore the types of functions that lead to good prediction performance.
Develop partial-information versions of the problem. Some initial steps for contextual bandits were taken in .
Analyze the setting of constrained sequences. That is, develop methods when Nature is not fully adversarial, yet also not i.i.d.
Develop efficient prediction methods that go beyond i.i.d. covariates.
For more additional questions or clarifications, please feel free to email us.
Define functions as
with being a constant. We desire to prove that there is an algorithm such that
Consider the last time step and write the above expression as
Let denote the conditional expectation given . We shall prove that there exists a randomized strategy for the last step such that for any ,
This last statement is translated as
Writing , the left-hand side of is
The stability condition means that we can choose to equalize the choices of . Let be the sorted values of
in non-increasing order. In view of the stability condition,
Hence, can be chosen so that all have the same value (see Figure 3). One can check that this is the minimizing choice for .
Let denote this optimal choice. The common value of can then be written as
and hence is equal to
This value is precisely , as per Eq. , thus verifying . Repeating the argument for until , we find that
thus ensuring existence of an algorithm with equal to zero. The other direction of the statement is proved by taking sequences uniformly at random from , concluding the proof.
When , the solution takes on a simpler form
which is found by equating the two alternatives in .
As in the proof of Lemma ?, define functions as
with being a constant. Having observed and at the present time step, we solve
The same steps as in Lemma ? (for binary prediction) lead to the solution
We remark that depends on , as given by the protocol of the problem. Then equals to
We now take expectation over on both sides:
The argument continues back to , with