A Survey on Application of Machine Learning Techniques in Optical Networks
Abstract
Today, the amount of data that can be retrieved from communications networks is extremely high and diverse (e.g., data regarding usersâ behavior, traffic traces, network alarms, signal quality indicators, etc.). Advanced mathematical tools are required to extract useful information from this large set of network data. In particular, Machine Learning (ML) is regarded as a promising methodological area to perform networkdata analysis and enable, e.g., automatized network selfconfiguration and fault management. In this survey we classify and describe relevant studies dealing with the applications of ML to optical communications and networking. Optical networks and system are facing an unprecedented growth in terms of complexity due to the introduction of a huge number of adjustable parameters (such as routing configurations, modulation format, symbol rate, coding schemes, etc.), mainly due to the adoption of, among the others, coherent transmission/reception technology, advanced digital signal processing and to the presence of nonlinear effects in optical fiber systems. Although a good number of research papers have appeared in the last years, the application of ML to optical networks is still in its early stage. In this survey we provide an introductory reference for researchers and practitioners interested in this field. To stimulate further work in this area, we conclude the paper proposing new possible research directions.
I Introduction
Machine learning (ML) is a branch of Artificial Intelligence that pushes forward the idea that, by giving access to the right data, machines can learn by themselves how to solve a specific problem [1]. By leveraging complex mathematical and statistical tools, ML renders machines capable of performing independently intellectual tasks that have been traditionally solved by human beings. This idea of automatizing complex tasks has generated much interest in the networking field, on the expectation that several activities involved in the design and operation of communication networks can be offloaded to machines. Some applications of ML in different networking areas have already matched these expectations in areas such as intrusion detection [2], traffic classification [3], cognitive radios [4].
Among various networking areas, in this survey we focus on machine learning for optical networking. Optical networks constitute the basic physical infrastructure of all largeprovider networks worldwide, thanks to their high capacity, low cost and many other attractive properties [5]. They are now penetrating new important telecom markets as datacom and the access segment, and there is no sign that they could find a substitute in the foreseeable future. Different approaches to improve the performance of optical networks have been investigated, such as routing, wavelength assignment, traffic grooming and survivability [6, 7]. In this survey, we will cover ML applications in both the areas of optical communication and optical networking to potentially stimulate new crosslayer research directions. In fact, ML application can be useful especially in crosslayer settings, where data analysis at physical layer, e.g., monitoring Bit Error Rate (BER), can trigger changes at network layer, e.g., in routing, spectrum and modulation format assignments. The application of ML to optical communication and networking is still in its infancy and the survey aims at constituting an introductory reference for researchers and practitioners willing to get acquainted with existing ML applications as well as to investigate new research directions.
A legitimate question that arises in the optical networking field today is: why machine learning, a methodological area that has been applied and investigated for at least three decades, is only gaining momentum now? The answer is certainly very articulated, and it most likely involves not purely technical aspects [8]. From a technical perspective though, recent technical progress at both optical communication/system and network level is at the basis of an unprecedented growth in the complexity of optical networks. On a system side, while optical channel modeling has always been complex, the recent adoption of coherent technologies [9] has made modeling even more difficult by introducing a plethora of adjustable design parameters (as modulation formats, symbol rates, adaptive coding rates and flexible channel spacing) to optimize transmission systems in terms of bitrate transmission distance product. In addition, what makes this optimization even more challenging is that the optical channel is nonlinear due to Kerr effect. On a networking side, the increased complexity of the underlying transmission systems is reflected in a series of advancements in both data plane and control plane. At data plane, the Elastic Optical Network (EON) concept [10] has emerged as a novel optical network architecture able to respond to the increased need of elasticity in allocating optical network resources. In contrast to traditional fixedgrid Wavelength Division Multiplexing (WDM) networks, EON offers flexible (almost continuous) bandwidth allocation. Resource allocation in EON can be performed to adapt to the several abovementioned decision variables made available by new transmission systems, including different transmission techniques, such as Orthogonal Frequency Division Multiplexing (OFDM), Nyquist WDM (NWDM), transponder types (e.g., BVT^{1}^{1}1For a complete list of acronyms, the reader is referred to the Glossary at the end of the paper., SBVT), modulation formats (e.g., QPSK, QAM), and coding rates. This flexibility makes the resource allocation problems much more challenging for network engineers. At control plane, dynamic control, as in Softwaredefined networking (SDN), promises to enable longawaited ondemand reconfiguration and virtualization. But reconfiguring the optical substrate poses several challenges, in terms of, e.g., network reoptimization, spectrum fragmentation, amplifier power settings, unexpected penalties due to nonlinearities, which call for strict integration between the control elements (SDN controllers, network orchestrators) and optical performance monitors working at the equipment level.
All these âdegrees of freedomâ and limitations do pose severe challenges to system and network engineers when it comes to deciding what the best system and/or network design is. Machine learning is currently perceived as a paradigm shift for the design of future optical networks and systems. These techniques should allow to infer, from data obtained by various types of monitors (e.g., signal quality, traffic samples, etc.), useful characteristics that could not be easily or directly measured. Some envisioned applications in the optical domain include fault prediction, intrusion detection, physicalflow security, impairmentaware routing, lowmargin design, trafficaware capacity reconfigurations, but many other can be envisioned and will be surveyed in the next sections.
The survey is organized as follows. In Section II, we overview some preliminary ML concepts, focusing especially on those targeted in the following sections. In Section III we discuss the main motivations behind the application of ML in the optical domain and we classify the main areas of applications. In Section IV and Section V, we classify and summarize a large number of studies describing applications of ML at the transmission layer and network layer. Section VI discusses some possible open areas of research and future directions. Section VII concludes the survey.
Ii Overview of Learning methods used in Optical Networks
This section provides an overview of some of the most popular algorithms that are commonly classified as machine learning. The literature on ML is so extensive that even a superficial overview of all the main ML approaches goes far beyond the possibilities of this section, and the readers can refer to a number of fundamental books on the subjects [11, 12, 13, 14]. In this section our aim is instead to provide an insight on the inner working on some of the techniques that are used in the optical networking literature surveyed in the remainder of this paper. We here provide the reader with some basic insights that might help better understanding the remaining parts of this survey paper. We divide the algorithms in three main categories, described in the next section, which are also represented in Fig. 1: supervised learning, unsupervised learning and reinforcement learning. Semisupervised learning, a hybrid of supervised and unsupervised learning, is also introduced. ML algorithms have been successfully applied to a wide variety of problems. Before delving into the different machine learning methods, it is worth pointing out that, in the context of telecommunication networks, there has been over a decade of research on the application of ML techniques to wireless networks, ranging from opportunistic spectrum access [15] to channel estimation and signal detection in orthogonal frequencydivision multiplexing systems [16], to MultipleInputMultipleOutput communications [17].
Iia Supervised Learning
Supervised learning is used in a variety of applications, such as speech recognition, spam detection and object recognition. The goal is to predict the value of one or more output variables given the value of a vector of input variables . The output variable can be a continuous variable (regression problem) or a discrete variable (classification problem). A training data set comprises samples of the input variables and the corresponding output values. Different learning methods construct a function that allows to predict the value of the output variables in correspondence to a new value of the inputs. Supervised learning can be broken down into two main classes, described below: parametric models, where the number of parameters is fixed, and nonparametric models, where their number is dependent on the training set.
IiA1 Parametric models
In this case, the function is a combination of a fixed number of parametric basis functions. These models use training data to estimate a fixed set of parameters . After the learning stage, the training data can be discarded since the prediction in correspondence to new inputs is computed using only the learned parameters . Linear models for regression and classification, which consist of a linear combination of fixed nonlinear basis functions, are the simplest parametric models in terms of analytical and computational properties. However, their applicability is limited to problems with lowdimensional input space. In the remainder of this subsection we focus on neural networks (NN), since they are the most successful example of parametric models.
NNs apply a series of functional transformations to the inputs. An NN is a network of units or neurons. The basis function or activation function used by each unit is a nonlinear function of a linear combination of the unit’s inputs. Each neuron has a bias parameter that allows for any fixed offset in the data. The bias is incorporated in the set of parameters by adding a dummy input of unitary value to each unit (see Figure 2). The coefficients of the linear combination are the parameters estimated during the training. The most commonly used nonlinear functions are the logistic sigmoid and the hyperbolic tangent. The activation function of the output units of the NN is the identity function, the logistic sigmoid function, and the softmax function, for regression, binary classification, and multiclass classification problems respectively.
Different types of connections between the units result in different NNs with distinct characteristics. All units between the inputs and output of the NN are called hidden units. In the case of a NN, the network is a directed acyclic graph. Typically, NNs are organized in layers, with units in each layer receiving inputs only from units in the immediately preceding layer and forwarding their output only to the immediately following layer. NNs with one layer of hidden units and linear output units can approximate arbitrary well any continuous function on a compact domain provided that a sufficient number of hidden units is used.
Given a training set, a NN is trained by minimizing an error function with respect to the set of parameters . Depending on the type of problem and the corresponding choice of activation function of the output units, different error functions are used. Typically in case of regression models, the sum of square error is used, whereas for classification the crossentropy error function is adopted. It is important to note that the error function is non convex function of the network parameters, for which multiple optimal local solutions exist. Iterative numerical methods based on gradient information are the most common methods used to find the vector that minimizes the error function. For a NN the error backpropagation algorithm, which provides an efficient method for evaluating the derivatives of the error function with respect to , is the most commonly used. We should at this point mention that, before training the network, the training set is typically preprocessed by applying a linear transformation to rescale each of the input variables independently in case of continuous data or discrete ordinal data. The transformed variables have zero mean and unit standard deviation. The same procedure is applied to the target values in case of regression problems. In case of discrete categorical data, a 1ofK coding scheme is used. This form of preprocessing is known as feature normalization and it is used before training most machine learning algorithms since most models are designed with the assumption that all features have comparable scales^{2}^{2}2Decision tree based models are a wellknown exception..
IiA2 Nonparametric models
In nonparametric methods the number of parameters depends on the training set. These methods keep a subset or the entirety of the training data and use them during prediction. The most used approaches are nearest neighbor models and support vector machines (SVMs). Both can be used for regression and classification problems.
In the case of knearest neighbor methods, all training data samples are stored (training phase). During prediction, the knearest samples to the new input value are retrieved. For classification problem, a voting mechanism is used; for regression problems, the mean or median of the k nearest samples provides the prediction. To select the best value of k, crossvalidation can be used. Depending on the dimension of the training set, iterating through all samples to compute the closest k neighbors might not be feasible. In this case, kd trees or localitysensitive hash tables can be used to compute the knearest neighbors.
In SVMs, basis functions are centered on training samples; the training procedure selects a subset of the basis functions. The number of selected basis functions, and the number of training samples that have to be stored, is typically much smaller than the cardinality of the training dataset. SVMs build a linear decision boundary with the largest possible distance from the training samples. Only the closest points to the separators, the support vectors, are stored. An important feature of SVMs is that by applying a kernel function they can embed data into a higher dimensional space, in which data points can be linearly separated. To determine the parameters of SVMs, a nonlinear optimization problem with a convex objective function has to be solved, for which efficient algorithms exist.
IiB Unsupervised Learning
Social network analysis, genes clustering and market research are among the most successful applications of unsupervised learning methods.
In the case of unsupervised learning the training dataset consists only of a set of input vectors . While unsupervised learning can address different tasks, clustering or cluster analysis is the most common.
Clustering is the process of grouping data so that the intracluster similarity is high, while the intercluster similarity is low. The similarity is typically expressed as a distance function, which depends on the type of data. There exists a variety of clustering approaches. Here, we focus on two algorithms, kmeans and Gaussian mixture model as examples of partitioning approaches and modelbased approaches, respectively, given their wide area of applicability. The reader is referred to [18] for a comprehensive overview of cluster analysis.
kmeans is perhaps the most wellknown clustering algorithm. It is an iterative algorithm starting with an initial partition of the data into k clusters. Then the centre of each cluster is computed and data points are assigned to the cluster with the closest centre. The procedure  centre computation and data assignment  is repeated until the assignment does not change or a predefined maximum number of iterations is exceeded. Doing so, the algorithm may terminate at a local optimum partition. Moreover, kmeans is well known to be sensitive to outliers. It is worth noting that there exists ways to compute k automatically [19], and an online version of the algorithm exists.
While kmeans assigns each point uniquely to one cluster, probabilistic approaches allow a soft assignment and provide a measure of the uncertainty associated with the assignment. The Gaussian Mixture Model (GMM)  a linear superposition of Gaussian distributions  is one of the most widely used probabilistic approaches to clustering. The parameters of the model are the mixing coefficient of each Gaussian component, the mean and the covariance of each Gaussian distribution. To maximize the log likelihood function with respect to the parameters given a dataset, the expectation maximization algorithm is used, since no closed form solution exists in this case. The initialization of the parameters can be done using kmeans. In particular, the mean and covariance of each Gaussian component can be initialized to sample means and covariances of the cluster obtained by kmeans, and the mixing coefficients can be set to the fraction of data points assigned by kmeans to each cluster. After initializing the parameters and evaluating the initial value of the log likelihood, the algorithm alternates between two steps. In the expectation step, the current values of the parameters are used to determine the “responsibility” of each component for the observed data (i.e., the conditional probability of latent variables given the dataset). The maximization step uses these responsibilities to compute a maximum likelihood estimate of the model’s parameters. Convergence is checked with respect to the log likelihood function or the parameters.
IiC Semisupervised Learning
Semisupervised learning methods are a hybrid of the previous two introduced above, and address problems in which most of the training samples are unlabeled, while only a few labeled data points are available. The obvious advantage is that in many domains a wealth of unlabeled data points is readily available. Semisupervised learning is used for the same type of applications as supervised learning. It is particularly useful when labeled data points are not so common or too expensive to obtain and the use of available unlabeled data can improve performance.
Selftraining is the oldest form of semisupervised learning [20]. It is an iterative process; during the first stage only labeled data points are used by a supervised learning algorithm. Then, at each step, some of the unlabeled points are labeled according to the prediction resulting for the trained decision function and these points are used along with the original labeled data to retrain using the same supervised learning algorithm.
Since the introduction of selftraining, the idea of using labeled and unlabeled data has resulted in many semisupervised learning algorithms. According to the classification proposed in [20], semisupervised learning techniques can be organized in four classes: i) methods based on generative models^{3}^{3}3Generative methods estimate the joint distribution of the input and output variables. From the joint distribution one can obtain the conditional distribution , which is then used to predict the output values in correspondence to new input values. Generative methods can exploit both labeled and unlabeled data.; ii) methods based on the assumption that the decision boundary should lie in a lowdensity region; iii) graphbased methods; iv) twostep methods (first an unsupervised learning step to change the data representation or construct a new kernel; then a supervised learning step based on the new representation or kernel).
IiD Reinforcement Learning
Reinforcement learning (RL) is used, in general, to address applications such as robotics, finance (investment decisions), inventory management, where the goal is to learn a policy, i.e., a mapping between states of the environment to actions to be performed, while directly interacting with the environment.
The RL paradigm allows agents to learn by exploring the available actions and refining their behavior using only an evaluative feedback, referred to as the reward. The agent’s goal is to maximize its longterm performance. Hence, the agent does not just take into account the immediate reward, but it evaluates the consequences of its actions on the future. Delayed reward and trialanderror constitute the two most significant features of RL.
RL is usually performed in the context of Markov decision processes (MDP). The agent’s perception at time is represented as a state , where is the finite set of environment states. The agent interacts with the environment by performing actions. At time the agent selects an action , where is the finite set of actions of the agent, which could trigger a transition to a new state. The agent will receive a reward as a result of the transition, according to the reward function . The agentâs goal is to find the sequence of stateaction pairs that maximizes the expected discounted reward, i.e., the optimal policy. In the context of MDP, it has been proved that an optimal deterministic and stationary policy exists. There exist a number of algorithms that learn the optimal policy both in case the state transition and reward functions are known (modelbased learning) and in case they are not (modelfree learning). The most used RL algorithm is Qlearning, a modelfree algorithm that estimates the optimal actionvalue function. An actionvalue function, named Qfunction, is the expected return of a stateaction pair for a given policy. The optimal actionvalue function, , corresponds to the maximum expected return for a stateaction pair. After learning function , the agent selects the action with the corresponding highest Qvalue in correspondence to the current state.
A tablebased solution such as the one described above is only suitable in case of problems with limited stateaction space. In order to generalize the policy learned in correspondence to states not previously experienced by the agent, RL methods can be combined with existing function approximation methods, e.g., neural networks.
IiE Overfitting, underfitting and model selection
In this section, we discuss a wellknown problem of ML algorithms along with its solutions. Although we focus on supervised learning techniques, the discussion is also relevant for unsupervised learning methods.
Overfitting and underfitting are two sides of the same coin: model selection. Overfitting happens when the model we use is too complex for the available dataset (e.g., a high polynomial order in the case of linear regression with polynomial basis functions or a too large number of hidden neurons for a neural network). In this case, the model will fit the training data too closely^{4}^{4}4As an extreme example, consider a simple regression problem for predicting a realvalue target variable as a function of a realvalue observation variable. Let us assume a linear regression model with polynomial basis function of the input variable. If we have samples and we select as the order of the polynomial, we can fit the model perfectly to the data points., including noisy samples and outliers, but will result in very poor generalization, i.e., it will provide inaccurate predictions for new data points. At the other end of the spectrum, underfitting is caused by the selection of models that are not complex enough to capture important features in the data (e.g., when we use a linear model to fit quadratic data).
Since the error measured on the training samples is a poor indicator for generalization, to evaluate the model performance the available dataset is split into two, the training set and the test set. The model is trained on the training set and then evaluated using the test set. Typically around of the samples are assigned to the training set and the remaining are assigned to the test set. Another option that is very useful in case of a limited dataset is to use crossvalidation so that as much as possible of the available data is exploited for training. In this case, the dataset is divided into subsets. The model is trained times using each of the subset for validation and the remaining subsets for training. The performance is averaged over the runs. In case of overfitting, the error measured on the test set is high and the error on the training set is small. On the other hand, in the case of underfitting, both the error measured on the training set and the test set are high.
There are different ways to select a model that does not exhibit overfitting and underfitting. One possibility is to train a range of models, compare their performance on an independent dataset (the validation set), and then select the one with the best performance. However, the most common technique is regularization. It consists in adding an extra term  the regularization term  to the error function used in the training stage. The simplest form of the regularization term is the sum of the squares of all parameters, which is known as weight decay and drives parameters towards zero. Another common choice is the sum of the absolute values of the parameters (lasso). An additional parameter, the regularization coefficient , weighs the relative importance of the regularization term and the datadependent error. A large value of heavily penalizes large absolute values of the parameters. It should be noted that the datadependent error computed over the training set increases with . The error computed over the validation set is high for both small and high values. In the first case, the regularization term has little impact potentially resulting in overfitting. In the latter case, the datadependent error has little impact resulting in a poor model performance. A simple automatic procedure for selecting the best consists in training the model with a range of values for the regularization parameter and select the value that corresponds to the minimum validation error. In the case of NNs with a large number of hidden units, dropout  a technique that consists in randomly removing units and their connections during training  has been shown to outperform other regularization methods [21].
Iii Motivation for using machine learning in optical networks and highspeed communication systems
In optical communication systems, one of the motivations for using machine learning is that, due to the optical fiber nonlinearities, closed form expression for the optical channel cannot be obtained. This has implications for the performance predictions of optical communication systems in terms of BER and qualityfactor (Qfactor) and also for signal demodulation [22],[23], [24]. Mathematically speaking, this means that conditional probability or likelihood function of the data, is unknown and must be learned from the data ( and represent received and transmitted information sequence, respectively). Finally, for the optimum resource allocation and lightpath establishment, physical parameters of the optical fiber channel, such as employed modulation format, chromatic dispersion, polarization mode dispersion, optical signal to noise ratio and polarization dependent loss, must be estimated and passed to the higher networking layer [25]. Estimation of optical fiber channel parameters needs to be performed prior to signal demodulation, which is not trivial due to the presence of transmission impairments. In that case, it is useful to employ a machine learning approach and learn the mapping between the signal features and optical fiber channel parameters [26].
Moving from the physical layer to the networking layer, the same motivation applies for the application of machine learning techniques. As described in more details below, machine learning will allow network designer to build data driven models for more accurate and optimized network provisioning and management. Design and management of optical networks is continuously evolving, driven by the enormous increase of transported traffic and drastic changes in traffic requirements, e.g., in terms of capacity, latency, user experience and Quality of Service (QoS). Current optical networks are expected to be run at much higher utilization than in the past, while providing strict guarantees on the provided quality of service. While aggressive optimization and trafficengineering methodologies are required to achieve these objectives, such complex methodologies may suffer scalability issues, and involve unacceptable computational complexity. In this context, ML is regarded as a promising methodological area to address this issue, as it enables automatized network selfconfiguration and fast decisionmaking by leveraging the plethora of data that can be retrieved via network monitors.
In this context, several use cases can benefit from the application of machine learning and data analytics techniques. In this paper we divide these use cases in physical layer and network layer use cases. The remainder of this section provides a highlevel introduction to the main use cases of machine learning in optical networks, as graphically shown in Fig. 3, while a detailed survey of existing studies is then provided in Sections IV and V.
Iiia Physical Layer Domain
As mentioned in the previous section, several challenges need to be addressed at the physical layer of an optical network such as estimation of Quality of Transmission (QoT) and optical performance monitoring, when performing lightpath provisioning. In the following, a more detailed description of the cases for the application of ML at the physical layer is presented.

QoT estimation.
Prior to the deployment of a new lightpath, a system engineer must identify a set of design parameters (e.g., modulation formats, baud rate, coding rate, etc.) that guarantee that the lightpath satisfies a certain QoT. Lately, due to continuous advances in optical transmission, the number of alternative design parameters for lightpath deployment is significantly increasing, and this large variety of possible parameters challenges the ability of a system engineer to address manually all the possible combinations. As of today, existing (predeployment) estimation techniques for lightpath QoT belong to two categories: 1) “exact” analytical models estimating physicallayer impairments, which provide accurate results, but incur heavy computational requirements and 2) marginated formulas, which are computationally faster, but typically introduce high marginations that lead to underutilization of network resources. Moreover, it is worth noting that, due to the complex interaction of multiple system parameters (e.g., input signal power, number of channels, link type, modulation format, symbol rate, channel spacing, etc.) and, most importantly, due to the nonlinear signal propagation through the optical channel, deriving accurate analytical models is a challenging task, and assumptions about the system under consideration must be made in order to adopt approximate models. MLbased classifiers promise to provide a means to automatically predict whether unestablished lightpaths will meet the required system threshold. Moreover, even when lightpaths are already established, continuously monitoring the quality of signal transmission via measuring, e.g., the received Optical SignaltoNoise Ratio (OSNR), BER or Qfactor, allows to perform early detection of QoT degradation. Based on ML processing of monitored data, network operators can activate reaction procedures, such as lightpath rerouting, transmission power adjustment etc., to maintain the required signal quality. 
Optical amplifiers control.
In current optical networks, lightpath provisioning is becoming more dynamic, in response to the emergence of new services that require huge amount of bandwidth over limited periods of time (let us refer, e.g., to a videocontent provider requiring additional bandwidth during peak hours, or a large file transfer for scientific purposes). Unfortunately, dynamic setup and teardown of lightpaths over different wavelength channels forces network operators to reconfigure network devices “on the fly” to maintain physicallayer stability. In response to rapid changes of lightpath deployment, Erbium Doped Fiber Amplifiers (EDFAs) suffer from wavelengthdependent power excursions, e.g., when a new lightpath is provisioned or when an existing lightpath is dropped. Thus, an automatic control of preamplification signal power levels is required, especially in case a cascade of multiple EDFAs is traversed, to avoid that excessive postamplification power discrepancy between different lightpaths may cause signal distortion. Thanks to the availability of historical data retrieved by monitoring network status, ML regression can be applied to accurately predict responses of the amplifiers and evaluate the effect of channel changes onto postEDFAs power levels of lightpaths. 
Modulation Format Recognition (MFR).
Modern optical transmitters and receivers provide high flexibility in the utilized bandwidth, carrier frequency and modulation format, mainly to adapt the transmission to the required bitrate and optical signal reach in a flexible/elastic networking environment. Given that at the transmission side an arbitrary coherent optical modulation format can be adopted, knowing this decision in advance also at the receiver side is not always possible. Therefore, ML can help recognizing the modulation format of an incoming optical signal to perform an accurate signal demodulation and, consequently, signal processing and detection. 
Nonlinearity mitigation.
Due to optical fiber nonlinearities, the behaviour of several performance parameters, such as BER, Qfactor, Chromatic Dispersion (CD), Polarization Mode Dispersion (PMD), is highly unpredictable, and complex analytical models should be adopted to react to signal degradation and/or compensate undesired nonlinear effects. While approximated analytical models are usually adopted to solve such complex nonlinear problems, ML provides the advantage of directly capturing the effects of such nonlinearities, typically exploiting knowledge of historical data and creating direct inputoutput relations between the monitored parameters and the desired outputs. 
Optical performance monitoring (OPM).
With increasing capacity requirements for optical communication systems, performance monitoring is vital to ensure robust and reliable networks. Optical performance monitor estimates the parameters of the optical fiber channel and those parameters are later passed to the higher networking layer for the lightpath establishment. Typically, optical fiber link parameters need to be estimated at the monitoring points along the link. This implies that current optical performance monitoring techniques which require full signal demodulation are too complex and expensive. There is a great need for cheaper and simpler solutions such that a large number of optical performance monitoring devices can be employed. This means that optical performance monitoring should ideally contain just a single photodiode and machine learning algorithms that can learn mapping between the detected signal and optical fiber channel parameters and finally predict optical fiber channel parameters from power eyediagrams.
IiiB Network Layer Domain
At the network layer, several other use cases for ML arise. Provisioning of new lightpaths or restoration of existing ones upon network failure require complex and fast decisions that depend on several quicklyevolving data, since, e.g., operators must take into consideration the impact onto existing connections provided by newlyinserted traffic. In general, an estimation of users’ and service requirements is very desirable for an effective network operation, as it allows to avoid overprovisioning of network resources and to deploy resources with adequate margins at a reasonable cost. Also at the network layer, we identify four main use cases.

Traffic Prediction.
Accurate traffic prediction in the timespace domain allows operators to effectively plan and operate their networks. In the design phase, traffic prediction allows to reduce overprovisioning as much as possible. During network operation, resource utilization can be optimized by performing traffic engineering based on realtime data, eventually rerouting existing traffic and reserving resources for future incoming connections. Application of traffic prediction, and the relative ML techniques, vary substantially according to the considered network segment (e.g., approaches for intradatacenter networks may be different than those for access networks), as traffic characteristics strongly depends on the considered network segment. 
Virtual Topology Design (VTD) and Reconfiguration.
The abstraction of communication network services by means of a virtual topology is widely adopted by network operators and service providers. This abstraction consists of representing the connectivity between two endpoints (e.g., two data centers) via an adjacency in the virtual topology, (i.e., a virtual link), although the two endpoints are not necessarily physically connected. After the set of all virtual links has been defined, i.e., after all the lightpath requests have been identified, VTD requires solving a Routing and Wavelength Assignment (RWA) problem for each lightpath on top of the underlying physical network. In general, many virtual topologies can coexist in the same physical network, and they may represent, e.g., service required by different customers, or even different services, each with specific set of requirements (e.g., in terms of QoS, bandwidth, and/or latency), provisioned to the same customer. Furthermore, VTD is not only necessary when a new service is provisioned and new resources are allocated in the network. In some cases, e.g., when network failures occur or when the utilization of network resources undergoes reoptimization procedures, existing (i.e., alreadydesigned) virtual topologies are rearranged, and in these cases we refer to the virtual topology reconfiguration. Note that, to perform design and reconfiguration of virtual topologies, network operators not only need to provision (or reallocate) network capacity for the required services, but may also need to provide additional resources according to the specific service characteristics, e.g., for guaranteeing service protection and/or meeting QoS or latency requirements. This type of service provisioning, is often referred to as network slicing, due to the fact that each provisioned service (i.e., each VT) represents a slice of the overall network. To address VTD, use of ML is helpful as it allows to simultaneously consider a large number of different and heterogeneous service requirements for a variety of virtual topologies, thus enabling fast decision making and resource provisioning. 
Failure detection and localization.
When managing a network, the ability to perform failure detection and localization or even to determine the cause of network failure is crucial as it may enable operators to promptly perform traffic rerouting, in order to maintain service status and meet Service Level Agreements (SLAs), and rapidly recover from the failure. Handling network failures can be accomplished at different levels. E.g., performing failure detection, i.e., identifying the set of lightpaths that were affected by a failure, is a relatively simpler task, which allows network operators to only reconfigure the affected lightpaths by, e.g., rerouting the corresponding traffic. Moreover, the ability of performing also failure localization enables the activation of recovery procedures. This way, prefailure network status can be restored, which is, in general, an optimized situation from the point of view of resources utilization. Furthermore, determining also the cause of network failure, e.g., temporary traffic congestion, devices disruption, or even anomalous behaviour of failure monitors, is useful to adopt the proper restoring and traffic reconfiguration procedures, as sometimes remote reconfiguration of lightpaths can be enough to handle the failure, while in some other cases infield intervention is necessary. In this context, ML can help handling the large amount of information derived from the continuous activity of a huge number of network monitors and alarms. 
Traffic flow classification.
When different types of services coexist in the same network infrastructure, classifying the corresponding traffic flows before their provisioning may enable efficient resource allocation, while reducing the risk of under and overprovisioning. Moreover, accurate flow classification is also exploited for already provisioned services to apply flowspecific policies, e.g., to handle packets priority, to perform flow and congestion control, and to guarantee proper QoS to each flow according to the SLAs. In this context, ML is useful as it enables fast classification and flows differentiation, based on the various traffic characteristics and exploiting the large amount of information carried by data packets. 
Path computation.
When performing network resources allocation for an incoming service request, a proper path should be selected in order to efficiently exploit the available network resources to accommodate the requested traffic with the desired QoS and without affecting the existing services, previously provisioned in the network. Traditionally, path computation is performed by using costbased routing algorithms, such as Dijkstra, BellmanFord, Yen algorithms, which rely on the definition of a predefined cost metric (e.g., based on the distance between source and destination, the endtoend delay, the energy consumption, or even a combination of several metrics) to discriminate between alternative paths. In this context, use of ML can be helpful as it allows to simultaneously consider several parameters featuring the incoming service request together with current network state information, with no need for complex networkcosts evaluations and thus enabling fast path selection and service provisioning.
IiiC A birdeye view of the surveyed studies
The physical and networklayer use cases described above have been tackled in existing studies by exploiting several machine learning tools (i.e., supervised and/or unsupervised learning, etc.) and leveraging different types of network monitored data (e.g., BER, OSNR, link load, network alarms, etc.).
In Tables I and II we summarize the various physical and networklayer use cases and highlight the features of the machine learning approaches which have been used in literature to solve these problems. In the tables we also indicate specific reference papers addressing these issues, which will be described in the following sections in more detail.
Use Case  ML category  ML methodology  Input data  Output data  Training data  Ref. 
QoT estimation  supervised  kriging/norm minimization  OSNR (historical data)  OSNR  synthetic  [27] 
OSNR/Qfactor  BER  synthetic  [28, 29]  
OSNR/PMD/CD/SPM  blocking prob.  synthetic  [30]  
CBR  error vector magnitude, OSNR  Qfactor  real  [31]  
lightpath route, length, number of copropagating ligthpaths  Qfactor  synthetic  [32, 33]  
RF  lightpath route, length, modulation format, traffic volume  BER  synthetic  [34]  
regression with gradient descent  SNR (historical data)  SNR  synthetic  [35]  
NN  lightpath route and length, number of traversed EDFAs, degree of destination, used channel wavelength  Qfactor  synthetic  [36, 37]  
OPM  supervised  NN  eye diagram and amplitude histogram param.  OSNR/PMD/CD  real  [38] 
NN, SVM  asynchronous amplitude histogram  MF  real  [26]  
NN  asyncrhonous constellation diagram and amplitude histogram param.  OSNR/PMD/CD  synthetic  [39, 40, 41, 42]  
Kernelbased ridge regression  eye diagram and phase portraits param.  PMD/CD  real  [43]  
Optical amplifiers control  supervised  CBR  power mask param. (NF, GF)  OSNR  real  [44, 45] 
NNs  EDFA input/output power  EDFA operating point  real  [46, 47]  
Ridge regression, Kernelized Bayesian regr.  WDM channel usage  postEDFA power discrepancy  real  [48]  
unsupervised  evolutional alg.  EDFA input/output power  EDFA operating point  real  [49]  
MF recognition  unsupervised  6 clustering alg.  Stokes space param.  MF  synthetic  [50] 
kmeans  received symbols  MF  real  [51]  
supervised  NN  asynchronous amplitude histogram  MF  synthetic  [52]  
NN, SVM  asynchronous amplitude histogram  MF  real  [53, 54, 26]  
variational Bayesian techn. for GMM  Stokes space param.  MF  real  [55]  
Nonlinearity mitigation  supervised  Bayesian filtering, NNs, EM  received symbols  OSNR, Symbol error rate  real  [23, 24, 56] 
ELM  received symbols  selfphase modulation  synthetic  [57]  
kNearest Neighbors  received symbols  BER  real  [58]  
Newtonbased SVM  received symbols  Qfactor  real  [59]  
binary SVM  received symbols  symbol decision boundaries  synthetic  [60]  
NN  received subcarrier symbols  Qfactor  synthetic  [61] 
Use Case  ML category  ML methodology  Input data  Output data  Training data  Ref. 
Traffic prediction and virtual topology (re)design  supervised  ARIMA  historical realtime traffic matrices  predicted traffic matrix  synthetic  [62], [63] 
NN  historical endtoend maximum bitrate traffic  predicted endtoend traffic  synthetic  [64], [65]  
Reinforcement learning  previous solutions of a multiobjective GA for VTD  updated VT  synthetic  [66], [67]  
unsupervised  NMF, clustering  CDR, PoI matrix  similarity patterns in base station traffic  real  [68]  
Failure detection  supervised  Bayesian Inference  BER, received power  list of failures for all lightpaths  real  [69] 
Bayesian Inference, EM  FTTH network dataset with missing data  complete dataset  real  [70], [71]  
Kriging  previously established lightpaths with already available failure localization and monitoring data  estimate of failure localization at link level for all lightpaths  real  [72]  
(1) LUCIDA: Regression and classification  
(2) BANDO: Anomaly Detection  (1) LUCIDA: historic BER and received power, notifications from BANDO  
(2) BANDO: maximum BER, threshold BER at setup, monitored BER  (1) LUCIDA: failure classification  
(2) BANDO: anomalies in BER  real  [73]  
Regression, decision tree, SVM  BER, frequencypower pairs  localized set of failures  real  [74]  
Flow classification  supervised  HMM, EM  packet loss data  loss classification: congestionloss or contentionloss  synthetic  [75] 
NN  source/destination IP addresses, source/destination ports, transport layer protocol, packet sizes, and a set of intraflow timings within the first 40 packets of a flow  classified flow for DC (mice flow or elephant flow)  synthetic  [76]  
Path Computation  supervised  QLearning  traffic requests, set of candidate paths between each sourcedestination pair  optimum paths for each sourcedestination pair to minimize burstloss probability  synthetic  [77] 
unsupervised  FCM  traffic requests, path lengths, set of modulation formats, OSNR, BER  mapping of an optimum modulation format to a lightpath  synthetic  [78] 
Iv Detailed Survey of ML in Physical Layer Problems
Iva Quality of Transmission Estimation
QoT estimation consists of computing transmission quality metrics such as OSNR, BER, Qfactor, CD or PMD based on measurements directly collected from the field by means of optical performance monitors installed at the receiver side [79] and/or on ligthpath characteristics. QoT estimation is typically applied in two scenarios:

predicting the transmission quality of unestablished ligthpaths based on historical observations and measurements collected from already deployed ones;

monitoring the transmission quality of alreadydeployed lightpaths with the aim of identifying faults and malfunctions.
QoT prediction of unestablished lightpaths relies on intelligent tools, capable of predicting whether a candidate ligthpath will meet the required quality of service guarantees (mapped onto OSNR, BER or Qfactor threshold values): the problem is typically formulated as a binary classification problem, where the classifier outputs a yes/no answer based on the ligthpath characteristics (e.g., its length, number of links, modulation format used for transmission, overall spectrum occupation of the traversed links etc.).
In [32] a cognitive Case Based Reasoning (CBR) approach is proposed, which relies on the maintenance of a knowledge database where information on the measured Qfactor of deployed lightpaths is stored, together with their route, selected wavelength, total length, total number and standard deviation of the number of copropagating lightpaths per link. Whenever a new traffic requests arrives, the most “similar” one (where similarity is computed by means of the Euclidean distance in the multidimensional space of normalized features) is retrieved from the database and a decision is made by comparing the associated Qfactor measurement with a predefined system threshold. As a correct dimensioning and maintenance of the database greatly affect the performance of the CBR technique, algorithms are proposed to keep it up to date and to remove old or useless entries. The tradeoff between database size, computational time and effectiveness of the classification performance is extensively studied: in [33], the technique is shown to outperform stateoftheart ML algorithms such as Naive Bayes, J4.8 tree and Random Forests (RFs). Experimental results achieved with data obtained from a real testbed are discussed in [31].
A databaseoriented approach is proposed also in [35] to reduce uncertainties on network parameters and design margins, where field data are collected by a software defined network controller and stored in a central repository. Then, a QTool is used to produce an estimate of the fieldmeasured SignaltoNoise Ratio (SNR) based on educated guesses on the (unknown) network parameters and such guesses are iteratively updated by means of a gradient descent algorithm, until the difference between the estimated and the fieldmeasured SNR falls below a predefined threshold. The new estimated parameters are stored in the database and yield to new design margins, which can be used for future demands. The tradeoff between database size and ranges of the SNR estimation error are evaluated via numerical simulations.
Similarly, in the context of multicast transmission in optical network, a NN is trained in [36, 37] using as features the ligthpath total length, the number of traversed EDFAs, the maximum link length, the degree of destination node and the channel wavelength used for transmission of candidate lightpaths, to predict whether the Qfactor will exceed a given system threshold. The NN is trained online with data minibatches, according to the network evolution, to allow for sequential updates of the prediction model. A dropout technique is adopted during training to avoid overfitting. The classification output is exploited by a heuristic algorithm for dynamic routing and spectrum assignment, which decides whether the request must be served or blocked. The algorithm performance is assessed in terms of blocking probability.
A random forest binary classifier is adopted in [34] to predict the probability that the BER of unestablished lightpaths will exceed a system threshold. The classifier takes as input a set of features including the total length and maximum link length of the candidate ligthpath, the number of traversed links, the amount of traffic to be transmitted and the modulation format to be adopted for transmission. Several alternative combinations of routes and modulation formats are considered and the classifier identifies the ones that will most likely satisfy the BER requirements.
Two alternative approaches, namely network kriging (first described in [80]) and norm minimization (typically used in network tomography [81]), are applied in [29, 30] in the context of QoT estimation: they rely on the installation of probe lightpaths that do not carry user data but are used to gather field measurements. The proposed inference methodologies exploit the spatial correlation between the QoT metrics of probes and datacarrying lightpaths sharing some physical links to provide an estimate of the Qfactor of already deployed or perspective lightpaths. These methods can be applied assuming either a centralized decisional tool or in a distributed fashion, where each node has only local knowledge of the network measurements. As installing probe lightpaths is costly and occupies spectral resources, the tradeoff between number of probes and accuracy of the estimation is studied. Several heuristic algorithms for the placement of the probes are proposed in [27]. A further refinement of the methodologies which takes into account the presence of neighbor channels appears in [28].
Additionally, a datadriven approach using a machine learning technique, Gaussian processes nonlinear regression (GPR), is proposed and experimentally demonstrated for performance prediction of WDM optical communication systems [26]. The core of the proposed approach (and indeed of any ML technique) is generalization: first the model is learned from the measured data acquired under one set of system configurations, and then the inferred model is applied to perform predictions for a new set of system configurations. The advantage of the approach is that complex system dynamics can be captured from measured data more easily than from simulations. Accurate BER predictions as a function of input power, transmission length, symbol rate and interchannel spacing are reported using numerical simulations and proofofprinciple experimental validation for a 24 28 GBd QPSK WDM optical transmission system.
Finally, a control and management architecture integrating an intelligent QoT estimator is proposed in [82] and its feasibility is demonstrated with implementation in a real testbed.
IvB Optical amplifiers control
The operating point of EDFAs influences their Noise Figure (NF) and gain flatness (GF), which have a considerable impact on the overall ligtpath QoT. The adaptive adjustment of the operating point based on the signal input power can be accomplished by means of ML algorithms. Most of the existing studies [46, 47, 44, 45, 49] rely on a preliminary amplifier characterization process aimed at experimentally evaluating the value of the metrics of interest (e.g., NF, GF and gain control accuracy) within its power mask (i.e., the amplifier operating region, depicted in Fig. 4).
The characterization results are then represented as a set of discrete values within the operation region. In EDFA implementations, stateoftheart microcontrollers cannot easily obtain GF and NF values for points that were not measured during the characterization. Unfortunately, producing a large amount of fine grained measurements is time consuming. To address this issue, ML algorithms can be used to interpolate the mapping function over nonmeasured points.
For the interpolation, authors of [46, 47] adopt a NN implementing both feedforward and backward error propagation. Experimental results with single and cascaded amplifiers report interpolation errors below 0.5 dB. Conversely, a cognitive methodology is proposed in [44], which is applied in dynamic network scenarios upon arrival of a new lightpath request: a knowledge database is maintained where measurements of the amplifier gains of already established lightpaths are stored, together with the lightpath characteristics (e.g., number of links, total length, etc.) and the OSNR value measured at the receiver. The database entries showing the highest similarities with the incoming lightpath request are retrieved, the vectors of gains associated to their respective amplifiers are considered and a new choice of gains is generated by perturbation of such values. Then, the OSNR value that would be obtained with the new vector of gains is estimated via simulation and stored in the database as a new entry. After this, the vector associated to the highest OSNR is used for tuning the amplifier gains when the new lightpath is deployed.
An implementation of realtime EDFA setpoint adjustment using the GMPLS control plane and interpolation rule based on a weighted Euclidean distance computation is described in [45] and extended in [49] to cascaded amplifiers.
Differently from the previous references, in [48] the issue of modelling the channel dependence of EDFA power excursion is approached by defining a regression problem, where the input feature set is an array of binary values indicating the occupation of each spectrum channel in a WDM grid and the predicted variable is the postEDFA power discrepancy. Two learning approaches (i.e., the Ridge regression and Kernelized Bayesian regression models) are compared for a setup with 2 and 3 amplifier spans, in case of singlechannel and superchannel adddrops. Based on the predicted values, suggestion on the spectrum allocation ensuring the least power discrepancy among channels can be provided.
IvC Modulation Format Recognition
The issue of autonomous modulation format identification in digital coherent receivers (i.e., without requiring information from the transmitter) has been addressed by means of a variety of ML algorithms, including kmeans clustering [51] and neural networks [53, 54]. Papers [50] and [55] take advantage of the Stokes space signal representation (see Fig. 5 for the representation of DPBPSK, DPQPSK and DP8QAM), which is not affected by frequency and phase offsets.
The first reference compares the performance of 6 unsupervised clustering algorithms to discriminate among 5 different formats (i.e. BPSK, QPSK, 8PSK, 8QAM, 16QAM) in terms of True Positive Rate and running time depending on the OSNR at the receiver. For some of the considered algorithms, the issue of predetermining the number of clusters is solved by means of the silhouette coefficient, which evaluates the tightness of different clustering structures by considering the inter and intracluster distances. The second reference adopts an unsupervised variational Bayesian expectation maximization algorithm to count the number of clusters in the Stokes space representation of the received signal and provides an input to a cost function used to identify the modulation format. The experimental validation is conducted over PSK (with ) and QAM (with ) modulated signals.
Conversely, features extracted from asynchronous amplitude histograms sampled from the eyediagram after equalization in digital coherent transceivers are used in [53, 54, 52] to train NNs. In [53, 54], a NN is used for hierarchical extraction of the amplitude histograms’ features, in order to obtain a compressed representation, aimed at reducing the number of neurons in the hidden layers with respect to the number of features. In [52], a NN is combined with a genetic algorithm to improve the efficiency of the weight selection procedure during the training phase. Both studies provide numerical results over experimentally generated data: the former obtains 0% error rate in discriminating among three modulation formats (PMQPSK, 16QAM and 64QAM), the latter shows the tradeoff between error rate and number of histogram bins considering six different formats (NRZOOK, ODB, NRZDPSK, RZDQPSK, PMRZQPSK and PMNRZ16QAM).
IvD Nonlinearity Mitigation
One of the performance metrics commonly used for optical communication systems is the dataratedistance product. Due to the fiber loss, optical amplification needs to be employed and, for increasing transmission distance, an increasing number of optical amplifiers must be employed accordingly. Optical amplifiers add noise and to retain the signaltonoise ratio optical signal power is increased. However, increasing the optical signal power beyond a certain value will enhance optical fiber nonlinearities which leads to Nonlinear Interference (NLI) noise. NLI will impact symbol detection and the focus of many papers, such as [58, 60, 23, 56, 24, 57, 59] has been on applying ML approaches to perform optimum symbol detection.
In general, the task of the receiver is to perform optimum symbol detection. In the case when the noise has circularly symmetric Gaussian distribution, the optimum symbol detection is performed by minimizing the Euclidean distance between the received symbol and all the possible symbols of the constellation alphabet, . This type of symbol detection will then have linear decision boundaries. For the case of memoryless nonlinearity, such as nonlinear phase noise, I/Q modulator and driving electronics nonlinearity, the noise associated with the symbol may no longer be circularly symmetric. This means that the clusters in constellation diagram become distorted (elliptically shaped instead of circularly symmetric in some cases). In those particular cases, optimum symbol detection is no longer based on Euclidean distance matrix, and the knowledge and full parametrization of the likelihood function, , is necessary. To determine and parameterize the likelihood function and finally perform optimum symbol detection, ML techniques, such as SVM, kernel density estimator, knearest neighbors and Gaussian mixture models can be employed. A gain of approximately 3 dB in the input power to the fiber has been achieved, by employing Gaussian mixture model in combination with expectation maximization, for 14 Gbaud DP 16QAM transmission over a 800 km dispersion compensated link [23].
Furthermore, in [58] a distanceweighted kNearestNeighbors classifier is adopted to compensate system impairments in zerodispersion, dispersion managed and dispersion unmanaged links, with 16QAM transmission, whereas in [61] NNs are proposed for nonlinear equalization in 16QAM OFDM transmission (one neural network per subcarrier is adopted, with a number of neurons equal to the number of symbols). To reduce the computational complexity of the training phase, an Extreme Learning Machine (ELM) equalizer is proposed in [57]. ELM is a NN where the weights minimizing the inputoutput mapping error can be computed by means of a generalized matrix inversion, without requiring any weight optimization step.
SVMs are adopted in [60, 59]: in [60], a battery of binary SVM classifiers is used to identify decision boundaries separating the points of a PSK constellation, whereas in [59] fast Newtonbased SVMs are employed to mitigate intersubcarrier intermixing in 16QAM OFDM transmission.
All the above mentioned approaches lead to a 0.53 dB improvement in terms of BER/Qfactor.
IvE Optical Performance Monitoring
Artificial neural networks are well suited machine learning tools to perform optical performance monitoring as they can be used to learn the complex mapping between samples or extracted features from the symbols and optical fiber channel parameters, such as OSNR, PMD, Polarizationdependent loss (PDL), baud rate and CD. The features that are fed into the neural network can be derived using different approaches relaying on feature extraction from: 1) the power eyediagrams (e.g., Qfactor, closure, variance, rootmeansquare jitter and crossing amplitude, as in [40, 39, 42, 41, 26, 56]); 2) the twodimensional eyediagram and phase portrait [43]; 3) asynchronous constellation diagrams (i.e., vector diagrams also including transitions between symbols [40]); and 4) histograms of the asynchronously sampled signal amplitudes [41, 42]. The advantage of manually providing the features to the algorithm is that the NN can be relatively simple, e.g., consisting of one hidden layer and up to 10 hidden units and does not require large amount of data to be trained. Another approach is to simply pass the samples at the symbol level and then use more layers that act as feature extractors (i.e., performing deep learning) [38]. Note that this approach requires large amount of data due to the high dimensionality of the input vector to the NN.
V Detailed survey of ML in Networking Problems
Va Traffic Prediction and Virtual Topology Design
Traffic prediction in optical networks is an important phase, especially in planning for resources and upgrading them optimally. Since one of the inherent philosophy of ML techniques is to learn a model from a set of data and ‘predict’ the future behavior from the learned model, ML can be effectively applied for traffic prediction.
For example, the authors in [62], [63] propose Autoregressive Integrated Moving Average (ARIMA) method which is a supervised learning method applied on time series data [83]. In both [62] and [63] the authors use ML algorithms to predict traffic for carrying out virtual topology reconfiguration. The authors propose a network planner and decision maker (NPDM) module for predicting traffic using ARIMA models. The NPDM then interacts with other modules to do virtual topology reconfiguration.
Since, the virtual topology should adapt with the variations in traffic which varies with time, the input dataset in [62] and [63] are in the form of timeseries data. More specifically, the inputs are the realtime traffic matrices observed over a window of time just prior to the current period. ARIMA is a forecasting technique that works very well with time series data [83] and hence it becomes a preferred choice in applications like traffic predictions and virtual topology reconfigurations. Furthermore, the relatively low complexity of ARIMA is also preferable in applications where maintaining a lower operational expenditure as mentioned in [62] and [63].
In general, the choice of an ML algorithm is always governed by the tradeoff between accuracy of learning and complexity. There is no exception to the above philosophy when it comes to the application of ML in optical networks. For example, in [64] and [65], the authors present traffic prediction in an identical context as [62] and [63], i.e., virtual topology reconfiguration, using Neural Networks (NNs). A prediction module based on NNs is proposed which generates the sourcedestination traffic matrix. This predicted traffic matrix for the next period is then used by a decision maker module to assert whether the current virtual network topology (VNT) needs to be reconfigured. According to [65], the main motivation for using NNs is their better adaptability to changes in input traffic and also the accuracy of prediction (less than 3% as reported in the reference) of the output traffic based on the inputs (which are historical traffic).
Reference [84] reports a cognitive network management module in relation to the ApplicationBased Network Operations (ABNO) framework, with specific focus on MLbased traffic prediction for VNT reconfiguration. However, [84] does not mention about the details of any specific ML algorithm used for the purpose of VNT reconfiguration. On similar lines, [85] proposes bayesian inference to estimate network traffic and decide whether to reconfigure a given virtual network.
While most of the literature focuses on traffic prediction using ML algorithms with a specific view of virtual network topology reconfigurations, [68] presents a general framework of traffic pattern estimation from call data records (CDR). [68] uses real datasets from service providers and operates matrix factorization and clustering based algorithms to draw useful insights from those data sets, which can be utilized to better engineer the network resources. More specifically, [68] uses CDRs from different base stations from the city of Milan. The dataset contains information like cell ID, time interval of calls, country code, received SMS, sent SMS, received calls, sent calls etc. in the form of a matrix called CDR matrix. Apart from the CDR matrix, the input dataset also include a pointofinterest (POI) matrix which contains information about different points of interests or regions most likely visited corresponding to each base station. All these input matrices are then applied to a ML clustering algorithm called nonnegative matrix factorization (NMF) and a variant of it called collective NMF (CNMF). The output of the algorithms factors the input matrices into two nonnegative matrices one of which gives the different types basic traffic patterns and the other gives similarities between base stations in terms of the traffic patterns.
While many of the references in the literature focus on one or few specific features when developing ML algorithms for traffic prediction and virtual topology (re)configurations, others just mention a general framework with some form of ‘cognition’ incorporated in association with regular optimization algorithms. For example, [66] and [67] describes a multiobjective Genetic Algorithm (GA) for virtual topology design. No specific machine learning algorithm is mentioned in [66] and [67], but they adopt adaptive fitness function update for GA. Here they use the principles of reinforcement learning where previous solutions of the GA for virtual topology design are used to update the fitness function for the future solutions.
VB Failure Detection
ML techniques can be adopted to either identify the exact location of a fault or malfunction within the network or even to infer the specific type of failure. In [72], network kriging is exploited to localize the exact position of fault along network links, under the assumption that the only information available at the receiving nodes (which work as monitoring nodes) of already established lightpaths is the number of faults encountered along the lightpath route. If unambiguous localization cannot be achieved, lightpaths probing may be operated in order to provide additional information, which increases the rank of the routing matrix. Depending on the network load, the number of monitoring nodes necessary to ensure unambiguous localization is evaluated. Similarly, in [69] the measured time series of BER and received power at lightpath end nodes are provided as input to a Bayesian network which individuates whether a fault is occurring along the lightpath and try to identify the cause (e.g., tight filtering or channel interference), based on specific attributes of the measurement patterns (such as maximum, average and minimum values, presence and amplitude of steps). The effectiveness of the Bayesian classifier is assessed in an experimental testbed: results show that only 0.8% of the tested instances were misclassified.
Other instances of application of Bayesian models to detect and diagnose faults in optical networks, especially GPON/FTTH, are reported in [70] and [71]. In [70], the GPON/FTTH network is modeled as a Bayesian Network using a layered approach identical to one of their previous works [86]. The layer 1 in this case actually corresponds to the physical network topology consisting of ONTs, ONUs and fibers. Fault propagation, between different network components depicted by layer1 nodes, is modeled in layer 2 using a set of directed acyclic graphs interconnected via the layer 1. The uncertainties of fault propagation are then handled by quantifying strengths of dependencies between layer 2 nodes with conditional probability distributions estimated from network generated data. However, some of these network generated data can be missing because of improper measurements or nonreporting of data. An Expectation Maximization (EM) algorithm is therefore used to handle missing data for rootcause analysis of network faults and helps in selfdiagnosis. Basically, the EM algorithm estimates the missing data such that the estimate maximizes the expected loglikelihood function based on a given set of parameters. In [71] a similar combination of Bayesian probabilistic models and EM is used for fault diagnosis in GPON/FTTH networks.
In the context of failure detection, in addition to Bayesian networks, other machine learning algorithms and concepts have also been used. For example, in [73], two ML based algorithms are described based on regression, classification, and anomaly detection. The authors propose a BER anomaly detection algorithm which takes as input historical information like maximum BER, threshold BER at setup, and monitored BER per lightpath and detects any abrupt changes in BER which might be a result of some failures of components along a lightpath. This BER anomaly detection algorithm, which is termed as BANDO, runs on each node of the network. The outputs of BANDO are different events denoting whether the BER is above certain threshold or below it or within a predefined boundary.
This information is then passed on to the input of another ML based algorithm which the authors term as LUCIDA. LUCIDA runs in the network controller and takes historic BER, historic received power, and the outputs of BANDO as input. These inputs are converted into three features that can be quantified by time series and they are as follows: 1) Received power above the reference level (PRXhigh); 2) BER positive trend (BERTrend); and 3) BER periodicity (BERPeriod). LUCIDA computes these features’ probabilities and the probabilities of possible failure classes and finally maps these feature probabilities to failure probabilities. In this way, LUCIDA detects the most likely failure cause from a set of failure classes.
Another notable use case for failure detection in optical networks using ML concepts appear in [74]. Two algorithms are proposed viz., Testing optIcal Switching at connection SetUp time (TISSUE) and FailurE causE Localization for optIcal NetworkinG (FEELING). The TISSUE algorithm takes the values of estimated BER calculated at each node across a lightpath and the measured BER and compares them. If the differences between the slopes of the estimated and theoretical BER is above a certain threshold a failure is anticipated. While it is not clear from [74] whether the estimation of BER in the TISSUE algorithm is based on ML methods, the FEELING algorithm applies two very wellknown ML methods viz., decision tree and Support Vector Machine.
In FEELING, the first step is to process the input dataset in the form of ordered pairs of frequency and power for each optical signal and transform them into a set of features. The features include some primary features like the power levels across the central frequency of the signal and also the power around other cutoff points of the signal spectrum (interested readers are encouraged to look into [74] for further details). In context of the FEELING algorithm, some secondary features are also defined in [74] which are linear combinations of the primary features. The featureextraction process is undertaken by a module named FeX. The next step is to input these features into a multiclass classifier in the form of a decision tree which outputs a predicted class among three options: ‘Normalâ, ‘LaserDriftâ, ‘FilterFailureâ; and ii) a subset of relevant signal points for the predicted class. Basically, the decision tree contains a number of decision rules to map specific combinations of feature values to classes. This decisiontreebased component runs in another module named signal spectrum verification (SSV) module. The FeX and SSV modules are located in the network nodes. There are two more modules called signal spectrum comparison (SSC) module and laser drift estimator (LDE) module which runs on the network controller.
In the SSC module, a similar classification process takes place as in SSV. But here a signal is diagnosed based on the different classes of failures just due to filtering. Here the three classes are: Normal, FilterShift, TightFiltering. The SSC module uses Support Vector Machines to classify the signals based on the above three classes. First, the SVM classifies whether the signal is ‘Normal’ or has suffered a filterrelated failure. Next, the SVM classifies the signal suffering from filterrelated failures into two classes based on whether the failure is due to tight filtering or due to filter shift. Once these classifications are done, the magnitude of failures related to each of these classes are estimated using some linear regression based estimator modules for each of the failure classes. Finally, all these information provided by the different modules described so far, are used in the FEELING algorithm to return a final list of failures.
VC Flow Classification
Another popular area of ML application for optical networks is flow classification. In [75] for example, a framework is described that observes different types of packet loss in optical burstswitched (OBS) networks. It then classifies the packet loss data as congestion loss or contention loss using a Hidden Markov Model (HMM) and EM algorithms.
Another example of flow classification is presented in [76]. Here a NN is trained to classify flows in an optical data center network. The feature vector includes a 5tuple (source IP address, destination IP address, source port, destination port, transport layer protocol). Packet sizes and a set of intraflow timings within the first 40 packets of a flow, which roughly corresponds to the first 30 TCP segments, are also used as inputs to improve the training speed and to mitigate the problem of ‘disappearing gradients’ while using gradient descent for backpropagation.
The main outcome of the NN used in [76] is the classification of mice and elephant flows in the data center (DC). The type of neural network used is a multilayer perceptron (MLP) with four hidden layers as MLPs are relatively simpler to implement. The authors of [76] also mentions about the high levels of true negative classification associated with MLPs, and comment on importance of ensuring that mice do not flood the optical interconnections in the DC network. In general, mice flows do actually outnumber elephant flows in a practical DC network and therefore, the authors in [76] suggest to overcome this class imbalance between mice and elephant flows by training the NN with a nonproportional amount of mice and elephant flows.
VD Path Computation
Path computation or selection, based on different physical and network layer parameters, is a commonly studied problem in optical networks. In Section IV for example, physical layer parameters like QoT, modulation format, OSNR, etc. are estimated using ML techniques. The main aim is to make a decision about the best optical path to be selected among different alternatives. The overall path computation process can therefore be viewed as a crosslayer method with application of machine learning techniques in multiple layers. In this subsection we identify references [77] and [78] that addresses the path computation/selection in optical networks from a network layer perspective.
In [77] the authors propose a path and wavelength selection strategy for Optical Burst Switching (OBS) networks to minimize burstloss probability. The problem is formulated as a multiarm bandit problem (MABP) and solved using Qlearning. An MABP problem comes from the context of gambling where a player tries to pull one of the arms of a slot machine with the objective to maximize sum of rewards over many such pulls of arms. In the OBS network scenario, the authors in [77] use the concept of path selection for each sourcedestination pair as pulling of one of the arms in a slot machine with the reward being minimization of burstloss probability. In general the MABP problem is a classical problem in reinforcement learning and the authors propose Qlearning to solve this problem because other methods does not scale well for complex problems. Furthermore, other methods of solving MABP, like dynamic programming, Gittins indices, and learning automata proves to be difficult when the reward distributions (i.e., the distributions of the burstloss probability in case of the OBS scenario) are unknown. The authors in [77] also argue that the Qlearning algorithm has a guaranteed convergence compared to other methods of solving the MABP problem.
In [78] a control plane decision making module for QoSaware path computation is proposed using a Fuzzy CMeans Clustering (FCM) algorithm. The FCM algorithm is added to the softwaredefined optical network (SDON) control plane in order to achieve better network performance, when compared with a noncognitive control plane. The FCM algorithm takes traffic requests, lightpath lengths, set of modulation formats, OSNR, BER etc., as input and then classifies each lightpath with the best possible parameters of the physical layer. The output of the classification is a mapping of each lightpath with a different physical layer parameters and how closely a lightpath is associated with a physical layer parameter in terms of a membership score. This membership score information is then utilized to generate some rules based on which real time decisions are taken to set up the lightpaths.
As we can see from the overall discussion in this section, different ML algorithms and policies can be used based on the use cases and applications of interest. Therefore, one can envisage a concise control plane for the next generation optical networks with a repository of different ML algorithms and policies as shown in Fig. 6. The envisaged control plane in Fig. 6 can be thought of as the ‘brain’ of the network that interacts constantly with the ‘network body’ (i.e., different components like transponders, amplifies, links etc.) and react to the ‘stimuli’ (i.e., data generated by the network) and perform certain ‘actions’ (i.e., path computation, virtual topology (re)configurations, flow classification etc.).
Vi Discussion and Future Directions
In this section we summarize some lessons learned from our literature analysis and we discuss our vision on how this research area will expand in next years.
First, we notice how the vast majority of existing studies adopting ML at the networking level use offline supervised learning methods, i.e. assume that the ML algorithms are trained with historical data before being used to take decisions on the field. This assumption is often unrealistic for optical communication networks, where scenarios dynamically evolve with time due, e.g., to traffic variations or to changes in the behavior of optical components caused by aging. Moreover, in practical assets, it is difficult to collect extensive datasets during faulty operational conditions, since networks are typically dimensioned and managed via conservative design approaches which make the probability of faults negligible (at the price of underutilization of network resources). We thus envisage that, after learning from a batch of available past samples, other types of algorithms, in the field of semisupervised and/or unsupervised ML, could be implemented to gradually take in novel input data as they are made available by the network control plane. Moreover, scarce attention has so far been devoted to the fact that different applications might have very different timescales over which monitored data show observable and useful pattern changes (e.g., aging would make component behaviour vary slowly over time, while traffic varies quickly, and at different time scales (e.g., burst, daily, weekly, yearly level).
Another important consideration is that previouslyproposed MLbased solutions have addressed specific and isolated issues in optical communications and networking. Considering that software defined networking has been demonstrated to be capable of successfully converging control through multiple network layers and technologies, such a unified control could also coordinate (orchestrate) several different applications of ML, to provide a holistic design for flexible optical networks. In fact, as seen in the literature, ML algorithms can be adopted to estimate different system characteristics at different layers, such as QoT, fault occurrences, traffic patterns, etc., some of which are mutually dependent (e.g., the QoT of a lightpath is highly related to the presence of faults along its links or in the traversed nodes), whereas others do not exhibit dependency (e.g., traffic patterns and fluctuations typically do not show any dependency on the status of the transmission equipment). More research is needed to explore the applicability and assess the benefits of MLbased unified control frameworks where all the estimated variables can be taken into account when making decisions such as where to route a new lightpath (e.g., in terms of spectrum assignment and core/mode assignment), when to reroute an existing one, or when to modify transmission parameters such as modulation format and baud rate.
Another promising and innovative area for ML application paired with SDN control is network fault recovery. Stateoftheart optical network control tools are tipically configured as rulebased expert systems, i.e., a set of expert rules (IF conditions THEN actions) covering typical fault scenarios. Such rules are specialized and deterministic and usually in the order of a few tens, and cannot cover all the possible cases of malfunctions. The application of ML to this issue, in addition to its ability to take into account relevant data across all the layers of a network, could also bring in probabilistic characterization (e.g., making use of Gaussian processes, output probability distributions rather than single numerical/categorical values) thus providing much richer information with respect to currently adopted thresholdbased models.
Finally, an interesting, though speculative, area of future research is the application of ML to alloptical devices and networks. Due to their inherent nonlinear behaviour, optical components could be interconnected to form structures capable of implementing learning tasks [87]. This approach represents an alloptical alternative to traditional software implementations. In [88] for example, semiconductor laser diodes were used to create a photonic neural network via timemultiplexing, taking advantage of their nonlinear reaction to power injection due to the coupling of amplitude and phase of the optical field. In [89], a ML method called “reservoir computing” is implemented via a nanophotonic reservoir constituted by a network of coupled crystal cavities. Thanks to their resonating behavior, power is stored in the cavities and generates nonlinear effects. The network is trained to reproduce periodic patterns (e.g., sums of sine waves).
To conclude, the application of ML to optical networking is a fastgrowing research topic, which sees an increasingly strong participation from industry and academic researchers. While in this section we could only provide a short discussion on possible future directions, we envisage that many more research topic will soon emerge in this area.
Vii Conclusion
Over the past decade, optical networks have been growing ‘smart’ with the introduction of software defined networking, coherent transmission, flexible grid, no name few arising technical directions. The combined progress towards highperformance hardware and intelligent software, integrated through an SDN platform provides a solid base for promising innovations in optical networking. Advanced machine learning algorithms can make use of the large quantity of data available from network monitoring elements to make them ‘learn’ from experience and make the networks more agile and adaptive.
Researchers have already started exploring the application of machine learning algorithms to enable smart optical networks and in this paper we have summarized some of the work carried out in the literature and provided insight into new potential research directions.
Glossary
API  Application Programming Interface 
ARIMA  Autoregressive Integrated Moving Average 
BER  Bit Error Rate 
BPSK  Binary Phase Shift Keying 
BVT  Bandwidth Variable Transponders 
CBR  Case Based Reasoning 
CD  Chromatic Dispersion 
CDR  Call Data Records 
CO  Central Office 
DC  Data Center 
DP  Dual Polarization 
DQPSK  Differential Quadrature Phase Shift Keying 
EDFA  Erbium Doped Fiber Amplifier 
ELM  Extreme Learning Machine 
EM  Expectation Maximization 
EON  Elastic Optical Network 
FCM  Fuzzy CMeans Clustering 
FTTH  Fibertothehome 
GA  Genetic Algorithm 
GF  Gain flatness 
GMM  Gaussian Mixture Model 
GMPLS  Generalized MultiProtocol Label Switching 
GPON  Gigabit Passive Optical Network 
GPR  Gaussian processes nonlinear regression 
HMM  Hidden Markov Model 
IP  Internet Protocol 
LDE  Laser drift estimator 
MABP  Multiarm bandit problem 
MDP  Markov decision processes 
MF  Modulation Format 
MFR  Modulation Format Recognition 
ML  Machine learning 
MLP  Multilayer perceptron 
MPLS  MultiProtocol Label Switching 
NF  Noise Figure 
NLI  Nonlinear Interference 
NN  Neural Network 
NPDM  Network planner and decision maker 
NRZ  NonReturn to Zero 
NWDM  Nyquist Wavelength Division Multiplexing 
OBS  Optical Burst Switching 
ODB  Optical Dual Binary 
OFDM  Orthogonal Frequency Division Multiplexing 
ONT  Optical Network Terminal 
ONU  Optical Network Unit 
OOK  OnOff Keying 
OPM  Optical Performance Monitoring 
OSNR  Optical SignaltoNoise Ratio 
PDL  PolarizationDependent Loss 
PM  Polarizationmultiplexed 
PMD  Polarization Mode Dispersion 
POI  Point of Interest 
PON  Passive Optical Network 
PSK  Phase Shift Keying 
QAM  Quadrature Amplitude Modulation 
Qfactor  Quality factor 
QoS  Quality of Service 
QoT  Quality of Transmission 
QPSK  Quadrature Phase Shift Keying 
RF  Random Forest 
RL  Reinforcement Learning 
RWA  Routing and Wavelength Assignment 
RZ  Return to Zero 
SBVT  Sliceable Bandwidth Variable Transponders 
SDN  Softwaredefined Networking 
SDON  Softwaredefined Optical Network 
SLA  Service Level Agreement 
SNR  SignaltoNoise Ratio 
SSC  Signal spectrum comparison 
SSV  Signal spectrum verification 
SVM  Support Vector Machine 
VNT  Virtual Network Topology 
VT  Virtual Topology 
VTD  Virtual Topology Design 
WDM  Wavelength Division Multiplexing 
References
 [1] S. Marsland, Machine learning: an algorithmic perspective. CRC press, 2015.
 [2] A. L. Buczak and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, Oct. 2015.
 [3] T. T. Nguyen and G. Armitage, “A survey of techniques for internet traffic classification using machine learning,” IEEE Communications Surveys & Tutorials, vol. 10, no. 4, pp. 56–76, 4th Q 2008.
 [4] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machinelearning techniques in cognitive radios,” IEEE Communications Surveys & Tutorials, vol. 15, no. 3, pp. 1136–1159, Oct. 2012.
 [5] B. Mukherjee, Optical WDM networks. Springer Science & Business Media, 2006.
 [6] K. Zhu and B. Mukherjee, “Traffic grooming in an optical WDM mesh network,” IEEE Journal on Selected Areas in Communications, vol. 20, no. 1, pp. 122–133, Jan. 2002.
 [7] S. Ramamurthy and B. Mukherjee, “Survivable WDM mesh networks. Part IProtection,” in Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM) 1999, vol. 2, Mar. 1999, pp. 744–751.
 [8] [Online]. Available: http://www.lightreading.com/artificialintelligencemachinelearning/cspsembracemachinelearningandai/a/did/737973
 [9] [Online]. Available: http://wwwfile.huawei.com//media/CORPORATE/PDF/white%20paper/WhitePaperonTechnologicalDevelopmentsofOpticalNetworks.pdf
 [10] O. Gerstel, M. Jinno, A. Lord, and S. B. Yoo, “Elastic optical networking: A new dawn for the optical layer?” IEEE Communications Magazine, vol. 50, no. 2, Feb. 2012.
 [11] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.
 [12] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 2012.
 [13] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer series in statistics, New York, 2001, vol. 1.
 [14] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [15] I. Macaluso, D. Finn, B. Ozgul, and L. A. DaSilva, “Complexity of spectrum activity and benefits of reinforcement learning for dynamic channel selection,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 11, pp. 2237–2248, Nov. 2013.
 [16] H. Ye, G. Y. Li, and B.H. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Wireless Communications Letters, Sep. 2017.
 [17] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Deep learning based MIMO communications,” arXiv preprint arXiv:1707.07980, July 2017.
 [18] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.
 [19] T. Hastie, R. Tibshirani, and J. Friedman, “Unsupervised Learning,” in The elements of statistical learning. Springer, 2009, pp. 485–585.
 [20] O. Chapelle, B. Scholkopf, and A. Zien, Semisupervised learning. MIT Press, 2006.
 [21] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, June 2014.
 [22] J. Wass, J. Thrane, M. Piels, R. Jones, and D. Zibar, “Gaussian Process Regression for WDM System Performance Prediction,” in Optical Fiber Communication Conference (OFC) 2017. Optical Society of America, Mar. 2017, p. Tu3D.7. [Online]. Available: http://www.osapublishing.org/abstract.cfm?URI=OFC2017Tu3D.7
 [23] D. Zibar, O. Winther, N. Franceschi, R. Borkowski, A. Caballero, V. Arlunno, M. N. Schmidt, N. G. Gonzales, B. Mao, Y. Ye, K. J. Larsen, and I. T. Monroy, “Nonlinear impairment compensation using expectation maximization for dispersion managed and unmanaged PDM 16QAM transmission,” Optics Express, vol. 20, no. 26, pp. B181–B196, Dec. 2012. [Online]. Available: http://www.opticsexpress.org/abstract.cfm?URI=oe2026B181
 [24] D. Zibar, M. Piels, R. Jones, and C. G. Schaeffer, “Machine Learning Techniques in Optical Communication,” IEEE/OSA Journal of Lightwave Technology, vol. 34, no. 6, pp. 1442–1452, Mar. 2016. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7359099
 [25] I. T. Monroy, D. Zibar, N. G. Gonzalez, and R. Borkowski, “Cognitive Heterogeneous Reconfigurable Optical Networks (CHRON): Enabling technologies and techniques,” in 13th International Conference on Transparent Optical Networks (ICTON) 2011, June 2011, pp. 1–4.
 [26] J. Thrane, J. Wass, M. Piels, J. C. M. Diniz, R. Jones, and D. Zibar, “Machine Learning Techniques for Optical Performance Monitoring From Directly Detected PDMQAM Signals,” IEEE/OSA Journal of Lightwave Technology, vol. 35, no. 4, pp. 868–875, Feb. 2017.
 [27] M. Angelou, Y. Pointurier, D. Careglio, S. Spadaro, and I. Tomkos, “Optimized monitor placement for accurate QoT assessment in core optical networks,” IEEE/OSA Journal of Optical Communications and Networking, vol. 4, no. 1, pp. 15–24, 2012.
 [28] I. Sartzetakis, K. Christodoulopoulos, C. Tsekrekos, D. Syvridis, and E. Varvarigos, “Quality of transmission estimation in WDM and elastic optical networks accounting for space–spectrum dependencies,” IEEE/OSA Journal of Optical Communications and Networking, vol. 8, no. 9, pp. 676–688, Sep. 2016.
 [29] Y. Pointurier, M. Coates, and M. Rabbat, “Crosslayer monitoring in transparent optical networks,” IEEE/OSA Journal of Optical Communications and Networking, vol. 3, no. 3, pp. 189–198, Feb. 2011.
 [30] N. Sambo, Y. Pointurier, F. Cugini, L. Valcarenghi, P. Castoldi, and I. Tomkos, “Lightpath establishment assisted by offline QoT estimation in transparent optical networks,” IEEE/OSA Journal of Optical Communications and Networking, vol. 2, no. 11, pp. 928–937, Mar. 2010.
 [31] A. Caballero, J. C. Aguado, R. Borkowski, S. Saldaña, T. Jiménez, I. de Miguel, V. Arlunno, R. J. Durán, D. Zibar, J. B. Jensen et al., “Experimental demonstration of a cognitive quality of transmission estimator for optical communication systems,” Optics Express, vol. 20, no. 26, pp. B64–B70, Nov. 2012.
 [32] T. Jiménez, J. C. Aguado, I. de Miguel, R. J. Durán, M. Angelou, N. Merayo, P. Fernández, R. M. Lorenzo, I. Tomkos, and E. J. Abril, “A cognitive quality of transmission estimator for core optical networks,” IEEE/OSA Journal of Lightwave Technology, vol. 31, no. 6, pp. 942–951, Jan. 2013.
 [33] I. de Miguel, R. J. Durán, T. Jiménez, N. Fernández, J. C. Aguado, R. M. Lorenzo, A. Caballero, I. T. Monroy, Y. Ye, A. Tymecki et al., “Cognitive dynamic optical networks,” IEEE/OSA Journal of Optical Communications and Networking, vol. 5, no. 10, pp. A107–A118, Oct. 2013.
 [34] L. Barletta, A. Giusti, C. Rottondi, and M. Tornatore, “QoT estimation for unestablished lighpaths using machine learning,” in Optical Fiber Communications Conference (OFC) 2017, Mar. 2017, pp. 1–3.
 [35] E. Seve, J. Pesic, C. Delezoide, and Y. Pointurier, “Learning process for reducing uncertainties on network parameters and design margins,” in Optical Fiber Communications Conference (OFC) 2017. IEEE, Mar. 2017, pp. 1–3.
 [36] T. Panayiotou, G. Ellinas, and S. P. Chatzis, “A datadriven QoT decision approach for multicast connections in metro optical networks,” in International Conference on Optical Network Design and Modeling (ONDM) 2016. IEEE, May 2016, pp. 1–6.
 [37] T. Panayiotou, S. Chatzis, and G. Ellinas, “Performance analysis of a datadriven qualityoftransmission decision approach on a dynamic multicastcapable metro optical network,” IEEE/OSA Journal of Optical Communications and Networking, vol. 9, no. 1, pp. 98–108, Jan. 2017.
 [38] T. Tanimura, T. Hoshida, J. C. Rasmussen, M. Suzuki, and H. Morikawa, “OSNR monitoring by deep neural networks trained with asynchronously sampled data,” in OptoElectronics and Communications Conference (OECC) 2016. IEEE, Oct. 2016, pp. 1–3.
 [39] X. Wu, J. A. Jargon, R. A. Skoog, L. Paraschis, and A. E. Willner, “Applications of artificial neural networks in optical performance monitoring,” IEEE/OSA Journal of Lightwave Technology, vol. 27, no. 16, pp. 3580–3589, 2009.
 [40] J. A. Jargon, X. Wu, H. Y. Choi, Y. C. Chung, and A. E. Willner, “Optical performance monitoring of QPSK data channels by use of neural networks trained with parameters derived from asynchronous constellation diagrams,” Optics Express, vol. 18, no. 5, pp. 4931–4938, Mar. 2010.
 [41] T. S. R. Shen, K. Meng, A. P. T. Lau, and Z. Y. Dong, “Optical performance monitoring using artificial neural network trained with asynchronous amplitude histograms,” IEEE Photonics Technology Letters, vol. 22, no. 22, pp. 1665–1667, Nov. 2010.
 [42] F. N. Khan, T. S. R. Shen, Y. Zhou, A. P. T. Lau, and C. Lu, “Optical performance monitoring using artificial neural networks trained with empirical moments of asynchronously sampled signal amplitudes,” IEEE Photonics Technology Letters, vol. 24, no. 12, pp. 982–984, June 2012.
 [43] T. B. Anderson, A. Kowalczyk, K. Clarke, S. D. Dods, D. Hewitt, and J. C. Li, “Multi impairment monitoring for optical networks,” IEEE/OSA Journal of Lightwave Technology, vol. 27, no. 16, pp. 3729–3736, Aug. 2009.
 [44] U. Moura, M. Garrich, H. Carvalho, M. Svolenski, A. Andrade, A. C. Cesar, J. Oliveira, and E. Conforti, “Cognitive methodology for optical amplifier gain adjustment in dynamic DWDM networks,” IEEE/OSA Journal of Lightwave Technology, vol. 34, no. 8, pp. 1971–1979, Jan. 2016.
 [45] J. R. Oliveira, A. Caballero, E. Magalhães, U. Moura, R. Borkowski, G. Curiel, A. Hirata, L. Hecker, E. Porto, D. Zibar et al., “Demonstration of EDFA cognitive gain control via GMPLS for mixed modulation formats in heterogeneous optical networks,” in Optical Fiber Communication Conference (OFC) 2013. Optical Society of America, Mar. 2013, pp. OW1H–2.
 [46] E. d. A. Barboza, C. J. BastosFilho, J. F. MartinsFilho, U. C. de Moura, and J. R. de Oliveira, “Selfadaptive erbiumdoped fiber amplifiers using machine learning,” in SBMO/IEEE MTTS International Microwave & Optoelectronics Conference (IMOC), 2013. IEEE, Oct. 2013, pp. 1–5.
 [47] C. J. BastosFilho, E. d. A. Barboza, J. F. MartinsFilho, U. C. de Moura, and J. R. de Oliveira, “Mapping EDFA noise figure and gain flatness over the power mask using neural networks,” Journal of Microwaves, Optoelectronics and Electromagnetic Applications (JMOe), vol. 12, pp. 128–139, July 2013.
 [48] Y. Huang, C. L. Gutterman, P. Samadi, P. B. Cho, W. Samoud, C. Ware, M. Lourdiane, G. Zussman, and K. Bergman, “Dynamic mitigation of EDFA power excursions with machine learning,” Optics Express, vol. 25, no. 3, pp. 2245–2258, Feb. 2017.
 [49] U. C. de Moura, J. R. Oliveira, J. C. Oliveira, and A. C. César, “EDFA adaptive gain control effect analysis over an amplifier cascade in a DWDM optical system,” in SBMO/IEEE MTTS International Microwave & Optoelectronics Conference (IMOC) 2013. IEEE, Oct. 2013, pp. 1–5.
 [50] R. Boada, R. Borkowski, and I. T. Monroy, “Clustering algorithms for Stokes space modulation format recognition,” Optics Express, vol. 23, no. 12, pp. 15 521–15 531, June 2015.
 [51] N. G. Gonzalez, D. Zibar, and I. T. Monroy, “Cognitive digital receiver for burst mode phase modulated radio over fiber links,” in European Conference on Optical Communication (ECOC) 2010. IEEE, Sep. 2010, pp. 1–3.
 [52] S. Zhang, Y. Peng, Q. Sui, J. Li, and Z. Li, “Modulation format identification in heterogeneous fiberoptic networks using artificial neural networks and genetic algorithms,” Photonic Network Communications, vol. 32, no. 2, pp. 246–252, Feb. 2016.
 [53] F. N. Khan, Y. Zhou, A. P. T. Lau, and C. Lu, “Modulation format identification in heterogeneous fiberoptic networks using artificial neural networks,” Optics Express, vol. 20, no. 11, pp. 12 422–12 431, May 2012.
 [54] F. N. Khan, K. Zhong, W. H. AlArashi, C. Yu, C. Lu, and A. P. T. Lau, “Modulation format identification in coherent receivers using deep machine learning,” IEEE Photonics Technology Letters, vol. 28, no. 17, pp. 1886–1889, Sep. 2016.
 [55] R. Borkowski, D. Zibar, A. Caballero, V. Arlunno, and I. T. Monroy, “Stokes spacebased optical modulation format recognition for digital coherent receivers,” IEEE Photonics Technology Letters, vol. 25, no. 21, pp. 2129–2132, Nov. 2013.
 [56] D. Zibar, J. Thrane, J. Wass, R. Jones, M. Piels, and C. Schaeffer, “Machine learning techniques applied to system characterization and equalization,” in Optical Fiber Communications Conference (OFC) 2016, Mar. 2016, pp. 1–3.
 [57] T. S. R. Shen and A. P. T. Lau, “Fiber nonlinearity compensation using extreme learning machine for DSPbased coherent communication systems,” in OptoElectronics and Communications Conference (OECC) 2011. IEEE, July 2011, pp. 816–817.
 [58] D. Wang, M. Zhang, M. Fu, Z. Cai, Z. Li, H. Han, Y. Cui, and B. Luo, “Nonlinearity Mitigation Using a Machine Learning Detector Based on Nearest Neighbors,” IEEE Photonics Technology Letters, vol. 28, no. 19, pp. 2102–2105, Apr. 2016.
 [59] E. Giacoumidis, S. Mhatli, M. F. Stephens, A. Tsokanos, J. Wei, M. E. McCarthy, N. J. Doran, and A. D. Ellis, “Reduction of Nonlinear Intersubcarrier Intermixing in Coherent Optical OFDM by a Fast NewtonBased Support Vector Machine Nonlinear Equalizer,” IEEE/OSA Journal of Lightwave Technology, vol. 35, no. 12, pp. 2391–2397, Mar. 2017.
 [60] D. Wang, M. Zhang, Z. Li, Y. Cui, J. Liu, Y. Yang, and H. Wang, “Nonlinear decision boundary created by a machine learningbased classifier to mitigate nonlinear phase noise,” in European Conference on Optical Communication (ECOC) 2015. IEEE, Oct. 2015, pp. 1–3.
 [61] M. A. Jarajreh, E. Giacoumidis, I. Aldaya, S. T. Le, A. Tsokanos, Z. Ghassemlooy, and N. J. Doran, “Artificial neural network nonlinear equalizer for coherent optical OFDM,” IEEE Photonics Technology Letters, vol. 27, no. 4, pp. 387–390, Feb. 2015.
 [62] N. Fernández, R. J. Durán, I. de Miguel, N. Merayo, P. Fernández, J. C. Aguado, R. M. Lorenzo, E. J. Abril, E. Palkopoulou, and I. Tomkos, “Virtual Topology Design and reconfiguration using cognition: Performance evaluation in case of failure,” in 5th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT) 2013, Sep. 2013, pp. 132–139.
 [63] N. Fernández, R. J. D. Barroso, D. Siracusa, A. Francescon, I. de Miguel, E. Salvadori, J. C. Aguado, and R. M. Lorenzo, “Virtual topology reconfiguration in optical networks by means of cognition: Evaluation and experimental validation [Invited],” IEEE/OSA Journal of Optical Communications and Networking, vol. 7, no. 1, pp. A162–A173, Jan. 2015.
 [64] F. Morales, M. Ruiz, and L. Velasco, “Virtual network topology reconfiguration based on big data analytics for traffic prediction,” in Optical Fiber Communications Conference (OFC) 2016, Mar. 2016, pp. 1–3.
 [65] F. Morales, M. Ruiz, L. Gifre, L. M. Contreras, V. Lopez, and L. Velasco, “Virtual network topology adaptability based on data analytics for traffic prediction,” IEEE/OSA Journal of Optical Communications and Networking, vol. 9, no. 1, pp. A35–A45, Jan. 2017.
 [66] N. Fernández, R. J. Durán, I. de Miguel, N. Merayo, D. Sánchez, M. Angelou, J. C. Aguado, P. Fernández, T. Jiménez, R. M. Lorenzo, I. Tomkos, and E. J. Abril, “Cognition to design energetically efficient and impairment aware virtual topologies for optical networks,” in International Conference on Optical Network Design and Modeling (ONDM) 2012, Apr. 2012, pp. 1–6.
 [67] N. Fernández, R. J. Durán, I. de Miguel, N. Merayo, J. C. Aguado, P. Fernández, T. Jiménez, I. Rodríguez, D. Sánchez, R. M. Lorenzo, E. J. Abril, M. Angelou, and I. Tomkos, “Survivable and impairmentaware virtual topologies for reconfigurable optical networks: A cognitive approach,” in IV International Congress on Ultra Modern Telecommunications and Control Systems 2012, Oct. 2012, pp. 793–799.
 [68] S. Troía, G. Sheng, R. Alvizu, G. A. Maier, and A. Pattavina, “Identification of tidaltraffic patterns in metroarea mobile networks via Matrix Factorization based model,” in International Conference on Pervasive Computing and Communications Workshops (PerCom WS) 2017, Mar. 2017, pp. 297–301.
 [69] M. Ruiz, F. Fresi, A. P. Vela, G. Meloni, N. Sambo, F. Cugini, L. Poti, L. Velasco, and P. Castoldi, “Servicetriggered failure identification/localization through monitoring of multiple parameters,” in European Conference on Optical Communication (ECOC) 2016. VDE, Sep. 2016, pp. 1–3.
 [70] S. R. Tembo, S. Vaton, J. L. Courant, and S. Gosselin, “A tutorial on the EM algorithm for Bayesian networks: Application to selfdiagnosis of GPONFTTH networks,” in International Wireless Communications and Mobile Computing Conference (IWCMC) 2016, Sep. 2016, pp. 369–376.
 [71] S. Gosselin, J. L. Courant, S. R. Tembo, and S. Vaton, “Application of probabilistic modeling and machine learning to the diagnosis of FTTH GPON networks,” in International Conference on Optical Network Design and Modeling (ONDM) 2017, May 2017, pp. 1–3.
 [72] K. Christodoulopoulos, N. Sambo, and E. M. Varvarigos, “Exploiting network kriging for fault localization,” in Optical Fiber Communication Conference (OFC) 2016. Optical Society of America, 2016, pp. W1B–5.
 [73] A. P. Vela, M. Ruiz, F. Fresi, N. Sambo, F. Cugini, G. Meloni, L. PotÃ¬, L. Velasco, and P. Castoldi, “Ber degradation detection and failure identification in elastic optical networks,” IEEE/OSA Journal of Lightwave Technology, vol. 35, no. 21, pp. 4595–4604, Nov. 2017.
 [74] A. P. Vela, B. Shariati, M. Ruiz, F. Cugini, A. Castro, H. Lu, R. Proietti, J. Comellas, P. Castoldi, S. J. B. Yoo, and L. Velasco, “Soft failure localization during commissioning testing and lightpath operation,” J. Opt. Commun. Netw., vol. 10, no. 1, pp. A27–A36, Jan. 2018. [Online]. Available: http://jocn.osa.org/abstract.cfm?URI=jocn101A27
 [75] A. Jayaraj, T. Venkatesh, and C. S. Murthy, “Loss Classification in Optical Burst Switching Networks Using Machine Learning Techniques: Improving the Performance of TCP,” IEEE Journal on Selected Areas in Communications, vol. 26, no. 6, pp. 45–54, Aug. 2008. [Online]. Available: http://dx.doi.org/10.1109/JSACOCN.2008.033508
 [76] H. Rastegarfar, M. Glick, N. Viljoen, M. Yang, J. Wissinger, L. Lacomb, and N. Peyghambarian, “TCP flow classification and bandwidth aggregation in optically interconnected data center networks,” IEEE/OSA Journal of Optical Communications and Networking, vol. 8, no. 10, pp. 777–786, Oct. 2016.
 [77] Y. V. Kiran, T. Venkatesh, and C. S. Murthy, “A Reinforcement Learning Framework for Path Selection and Wavelength Selection in Optical Burst Switched Networks,” IEEE Journal on Selected Areas in Communications, vol. 25, no. 9, pp. 18–26, Dec. 2007. [Online]. Available: http://dx.doi.org/10.1109/JSACOCN.2007.028806
 [78] T. R. Tronco, M. Garrich, A. C. César, and M. d. L. Rocha, “Cognitive algorithm using fuzzy reasoning for softwaredefined optical network,” Photonic Network Communications, vol. 32, no. 2, pp. 281–292, Oct. 2016. [Online]. Available: https://doi.org/10.1007/s1110701606281
 [79] K. Christodoulopoulos et al., “ORCHESTRAOptical performance monitoring enabling flexible networking,” in 17th International Conference on Transparent Optical Networks (ICTON) 2015. Budapest, Hungary, July 2015, pp. 1–4.
 [80] D. B. Chua, E. D. Kolaczyk, and M. Crovella, “Network kriging,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 12, pp. 2263–2272, Dec. 2006.
 [81] R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network tomography: Recent developments,” Statistical science, pp. 499–517, Aug. 2004.
 [82] M. Bouda, S. Oda, O. Vasilieva, M. Miyabe, S. Yoshida, T. Katagiri, Y. Aoki, T. Hoshida, and T. Ikeuchi, “Accurate prediction of quality of transmission with dynamically configurable optical impairment model,” in Optical Fiber Communications Conference (OFC) 2017. IEEE, Mar. 2017, pp. 1–3.
 [83] “ARIMA models for time series forecasting,” Oct. [Online]. Available: http://people.duke.edu/~rnau/411arim.htm
 [84] L. Gifre, F. Morales, L. Velasco, and M. Ruiz, “Big data analytics for the virtual network topology reconfiguration use case,” in 18th International Conference on Transparent Optical Networks (ICTON) 2016, July 2016, pp. 1–4.
 [85] T. Ohba, S. Arakawa, and M. Murata, “A Bayesianbased approach for virtual network reconfiguration in elastic optical path networks,” in Optical Fiber Communications Conference (OFC) 2017, Mar. 2017, pp. 1–3.
 [86] S. R. Tembo, J. L. Courant, and S. Vaton, “A 3layered selfreconfigurable generic model for selfdiagnosis of telecommunication networks,” in 2015 SAI Intelligent Systems Conference (IntelliSys), Nov. 2015, pp. 25–34.
 [87] D. Woods and T. J. Naughton, “Optical computing: Photonic neural networks,” Nature Physics, vol. 8, no. 4, pp. 257–259, July 2012.
 [88] D. Brunner, M. Soriano, C. Mirasso, and I. Fischer, “High speed, high performance alloptical information processing utilizing nonlinear optical transients,” in The European Conference on Lasers and ElectroOptics. Optical Society of America, 2013, p. CD_10_3.
 [89] M. Fiers, K. Vandoorne, T. Van Vaerenbergh, J. Dambre, B. Schrauwen, and P. Bienstman, “Optical information processing: Advances in nanophotonic reservoir computing,” in 14th International Conference on Transparent Optical Networks (ICTON) 2012. IEEE, Aug. 2012, pp. 1–4.