Multivariate, Multistep Forecasting, Reconstruction and Feature Selection of Ocean Waves via Recurrent and SequencetoSequence Networks
Abstract
This article explores the concepts of ocean wave multivariate multistep forecasting, reconstruction and feature selection. We introduce recurrent neural network frameworks, integrated with Bayesian hyperparameter optimization and Elastic Net methods. We consider both short and longterm forecasts and reconstruction, for significant wave height and output power of the ocean waves. Sequencetosequence neural networks are being developed for the first time to reconstruct the missing characteristics of ocean waves based on information from nearby wave sensors. Our results indicate that the Adam and AMSGrad optimization algorithms are the most robust ones to optimize the sequencetosequence network. For the case of significant wave height reconstruction, we compare the proposed methods with alternatives on a wellstudied dataset. We show the superiority of the proposed methods considering several error metrics. We design a new case study based on measurement stations along the east coast of the United States and investigate the feature selection concept. Comparisons substantiate the benefit of utilizing Elastic Net. Moreover, case study results indicate that when the number of features is considerable, having deeper structures improves the performance.
Keywords: SequencetoSequence Deep Networks, Multivariate Multistep Forecasting, Feature Selection, Elastic Net, Spearmint Bayesian optimization.
1 Introduction
The intricate and everchanging character of irregular waves necessitates the existence of a framework to estimate wave features in advance. Bringing the energy generated by ocean waves into the consumer’s portfolios requires a consistent system capable of predicting ocean wave uncertainties. Most research on ocean wave prediction has focused on modelbased approaches—for example, physicsbased models that attempt to reproduce the frequency spectra of ocean waves and then to predict the resulting wave energy. However, these models are difficult to generalize.
Machine learning (ML) techniques allow modelfree predictions of ocean wave characteristics to become practical and computationally efficient via monitoring of the huge amount of records taken in diverse bodies of water all around the world. We introduce recurrent networks integrated with Bayesian optimization and Elastic Net techniques to forecast future wave features and reconstruct missing data simultaneously.
There is a substantial literature on SWH forecasting; see Section 2 for a review. These papers investigate short to longterm predictions of SWH at coastal and offshore sites with different water depths. Although there are marine applications solely based on SWH estimation, in most cases SWH is used as an input to evaluate other characteristics of an ocean wave, such as energy flux. Energy flux is the expected total sum of the wave energy available at a given location, and it plays a major role in designing marine structures from at least two broad perspectives. First, by modeling the energy flux of any potential location, one may identify remote locations where Wave Energy Converters (WEC) can be installed. Second, renewable energy of all sorts is well known to be less reliable sources with less resiliency against disruptions compared to energy produced by thermal power plants. Although wave energy is more predictable than wind or solar, it still suffers from randomness. Having a concrete framework to forecast significant wave height and ocean energy flux may directly improve the resiliency of electricity produced by this green source.
We propose a model that uses neural networks (NN) to forecast wave characteristics. Most studies that have used NNs to forecast or reconstruct wave characteristics use fully connected networks receiving inputs such as significant wave height () and/or average wave period () and outputs such as power () and/or . Recurrent Neural Networks (RNN), which contain natural temporal properties, have been considered in a few cases as well. The majority of these cases solely focus on predicting . To the best of our knowledge, no studies have used sequencetosequence NNs to predict energy flux or reconstruct features. In this paper, we investigate the concept of multistepahead univariate () and multivariate () forecasting and feature reconstruction. Our contributions are as follows:

We introduce two main networks. The first one is a simple combination of Long ShortTerm Memory plus Fully Connected Layers at the end (LSTM+FCL). The LSTM part can be single or multilayered structures. The second network is a sequencetosequence (seqtoseq) [1] method containing two LSTMs as its encoder and decoder.

We expand the framework to tackle the problem of reconstructing missing ocean wave data and feature selection based on information from their nearby buoys. We perform a detailed comparison between our models and the stateoftheart papers on wave feature reconstruction [2, 3] and demonstrate that our approach outperforms those in the literature in terms of capturing missing information.

We modify an epochscheduled training scheme that is suitable for time series analysis in general as well as an elastic net concept, and we train the models based on these proposed schemes. A comparison between single and multilayered RNN, LSTM, and seqtoseq methods is conducted.

We conduct Spearmint Bayesian optimization (GPEI) to hypertune the models’ parameters. Throughout the paper, we indicate the th parameters that must be tuned as (**).

We integrate elastic net concept into the model for the purpose of feature selection.

We investigate the performance of four different optimization algorithms—SGD, RMSProp, Adam, and AMSGrad—for the seqtoseq network based on years of ocean wave data.

We design a new dataset and a case study. The dataset is general and can be used for any feature selection and/or multivariate regression purposes. (Datasets are at the repository: NOAARefinedStations)
The remainder of this paper is structured as follows: In Section 2, we discuss the necessary background on ocean waves, and we provide a concise literature review. Then, we introduce the LSTM+FCL and seqtoseq models and epochschedule training in section 3. Section 4 is dedicated to several comparisons, including longterm forecasts. The reconstruction framework is addressed in section 5. In section 6 we discuss about the feature selection with Elastic Net technique. The paper concludes in section 7.
2 Background and Literature Review
Statistical properties of the ocean surface suggest that for a given time and location, ocean waves may be viewed as the summation of a considerable amount of independent regular waves caused by wind sources or wave interactions [4]. Therefore, the ocean surface can be modeled as a zeromean Gaussian stochastic process, considering all these waves [4, 5]. Buoys that contain wavemeasurement sensors take a large number of samples from this process over time, and then, one may recover the spectral wave density , which is the fast Fourier transform of the covariance matrix of the sea surface elevation [6, 7, 8].
Once of a surface wave has been estimated from the measurements collected, the Spectral Density Momentum (SDM) of order is calculated as:
(1) 
By definition, we have and Average wave period . The energy flux is then calculated as follows:
(2) 
where is the ocean density, and is the gravity constant.
The rest of this section contains three subsections. First, we provide a concise review of the literature about forecasting ocean wave characteristics. Second, we explore the novel work on oceanic wave feature reconstruction. Third, we discuss seqtoseq models from a machine learning point of view.
2.1 Ocean Characteristic Forecasting
Wave forecasting models can be categorized into modelbased and modelfree frameworks. Modelbased approaches aim to use physical concepts such as climatic pressure, frictional dissipation and environmental interactions to find precise equations mimicking the behavior of wind waves. These methods are further classified based on their efforts to explicitly parameterize ocean wave interactions [9] into first, second and third generations. Firstgeneration models try to construct the spectral wave structure solely based on the linear wave interactions. These models overestimate the power generated by water because of linear simplification and instead ignore any nonlinear transfer [10]. Secondgeneration models such as JONSWAP [11] examine coupled discrete spectral structures in such a way that the wave nonlinearities can be parameterized [12]. The most mature models are thirdgeneration wave models such as WAM and simulating waves nearshore (SWAN) [13], which consider all possible generation, dissipation and nonlinear wavewave interactions [10] along with current–wave interactions [9].
In contrast, datadriven, modelfree approaches have become more popular recently with the advent of ML techniques. Fuzzy Systems (FS) [14, 15, 16], Evolutionary Algorithms (EA) [2, 3], Support Vector Machines (SVM) [17], and Deep Neural Networks [18, 19, 20] are socalled “soft computing techniques,” which focus on the data structure in order to investigate possible relations and dependencies to forecast uncertain future events. Neural Networks are among the most powerful tools to approximate almost any nonlinear and complex functional relation. Recurrent Neural Networks (RNNs), a subclass of Neural Networks, exploit their internal memories to express temporal dynamic behavior, which makes them a suitable framework for forecasting complex systems. [21] explore the efficacy of several ML methods in terms of their accuracy in forecasting the significant wave height.
The general goal of any forecasting method is to find an accurate short or longterm forecast of the variable under study. (See [22, 23] for new forecasting studies.) However, there exist ML approaches particularly designed for forecasting of wave characteristics. For more thorough discussion of the literature on ocean power and significant wave height forecasting one may refer to past reviews [24, 25, 26, 27, 28, 29, 30]. [31] implements a Nonlinear Autoregressive (NAR) neural network to forecast exponentially smoothed ocean wave power using Irish Marine Institute data. A Minimal Resource Allocation Network (MRAN) and a Growing and Pruning Radial Basis Function (GAPRBF) network are implemented and tested on three geographical locations by [20]. The significance of a node in the GAPRBF network is measured as its contribution to the network output, and the node is added or pruned accordingly. Cascadeforward and feedforward neural networks are implemented in [32] to predict the wave power itself in the absence of spectral wave data. An ensemble of Extreme Learning Machines (ELM) is presented in [33] to predict the daily wave height. The authors use the previous hours’ of wave heights along with features such as air to sea temperature difference, atmospheric pressure, wind speed to predict the next 6hour wave height. Two computationally efficient supervised ML approaches are introduced by [34] and compared with the SWAN model [35]. An integrated numerical and ANN approach is introduced by [36] to predict waves 24 hours in advance at different buoys along the Indian Coastline. Nonlinear and nonstationary s are studied in [37] based on integrated Empirical Model Decomposition Support Vector Regression (EMDSVR). Forecasting of extreme events such as hurricanes is examined by Dixit et al. [38] via a Neuro Wavelet Technique (NWT). A recent work [39] aims at predicting 48 hours into the future utilizing a hybrid model combining neural network with wavelets (WLNN). As mentioned before non of these work take into consideration sequencetosequence networks to investigate its short to long term forecasting performance yet alone its combination with Spearmint Bayesian optimization and elastic net technique.
2.2 Reconstruction of Ocean Characteristics
In contrast to forecasting frameworks, in which the goal is typically to estimate wave features such as and in the future based on historical data, reconstruction models aim to use available information about wave features to reconstruct , , or other (usually missing) features. Here, the assumption is that the model has access to uptodate measurements of the wave features, except for the one(s) they want to reconstruct. This is commonly due to missing measurement data. Therefore, the prediction continues one step ahead (or a few steps ahead) into the future. Hence, the aim is to tackle the problem of extracting the ocean wave information of a location purely based on other available features. The framework is useful for estimating the missing data of a station using the knowledge obtained from its neighbors but any available information of the same station can be utilized as well.
The paper [2] and its subsequent improvement [3] are dedicated to reconstructing ocean waves based on Evolutionary Algorithms (EA). The authors address the problem primarily via a Grouping Genetic Algorithm (GGA) and Bayesian optimization Grouping Genetic Algorithm (BOGGA). They utilize GGA and/or BOGGA to select the wave features of nearby stations suitable to predict the desired wave feature of the location with missing data, and then obtain their predictions via simple ELM or SVM. The paper reconstructs the significant wave height of the NOAA buoy 46069 for the year 2010 solely based upon the information provided from two adjacent buoys.
2.3 SequencetoSequence Neural Networks
Deep Neural Networks (DNN) are among the most successful tools for classification and regression. DNNs achieve stateoftheart performance in many applications. Although a simple feedforward DNN can be applied in many systems, they require the system to have fixed input and output size. Recurrent Neural Networks such as LongShortTerm Memory (LSTM) networks tackle this limitation in the sense that they do not need a fixed input size [40]. LSTMs can observe an input sequence of arbitrary size sequentially (one time step at a time) to provide the rest of the network with a large, fixedsized vector representing the input [41, 42]. Furthermore, LSTMs remember longrange feature propagation based on a sigmoid layer called “forget gate.” Seqtoseq models use two separate recurrent structures. They vary from basic recurrent networks in the sense that the network fully reads the input sequences before it generates any outputs. The first network is usually an LSTM which reads (encodes) the input of any size and maps them to a fixedsized output. The second structure generally receives the fixedsized output vector of the first LSTM and maps (decodes) them to a desirable output space [1]. Having encoding and decoding as separate steps gives the model flexibility and stability when handling complex sequence structures [43].
An important technique for training recurrent networks is Teacher Forcing (TF), which forces the network to observe the previous ground truth output instead of the one it already predicted. In other words, TF, keeping the network complex structure, converts any longterm prediction structures to onestepahead forecasts. This can greatly increase the network’s ability to learn, thereby reducing its learning time [44, 45]. On the other hand, one can argue that the new model is not solving the same problem anymore. Therefore, there often exists a large gap between the testing error and training error for the models trained by TF; that is, the model encounters a major overfitting problem. Hence, [46] introduces a modification of scheduled training that captures the benefit TF while avoiding overfitting. Scheduled training is a soft technique. That is, it starts with a TF scheme and, after the model passes the warmup stage and the network’s weights have gone in the correct update direction, it alters the original network with a specified scheme. Therefore, the model enjoys stability and the probability of overfitting decreases. In our work, we revisit scheduled training and introduce epochscheduled training for forecasting.
3 The Models
3.1 Overview
In this section, we discuss the proposed models. We denote an input sequence to the model as and its output sequence as , where and need not be equal. In the case of predicting , we have and .
The purpose of the models is to calculate the conditional distribution . indicates how far into the future the forecast should go. For example, for hourlyresolution data, if the model is a dayahead energy flux forecast, then . , on the other hand, is a model parameter and must to be tuned (*1*).
(3)  
(4)  
(5)  
(6)  
(7)  
(8) 
where , , , , , and for all are the input, hidden state, forget activation, input activation, output activation, cell state and auxiliary cell state vectors, respectively, for the LSTM network. is the weight matrices corresponding to the dimensions of the gate vectors and . The sigmoid function is given by ; its return value is monotonically increasing in the open interval . The notation indicates elementwise (Hadamard) matrix product, which exists only if the matrix dimensions are the same.
The weight matrices can be initialized randomly with Gaussian distribution. Further, and are initialized by zero vectors. In energy flux forecasting, , and refers to the hidden vector size, which needs to be tuned (*2*). The biases are omitted in the formulas because one may incorporate them easily through the weight matrices by adding one extra element to each vector mentioned.
A forget gate responds to the question, “which data should the model keep from the previous cell state vector?” In (3), the sigmoid function outputs a number between 0 and 1, which indicates the importance of the previous plus current inputs. This value is directly multiplied with the cell state vector of the previous cell state in the first term of (6). The input gate (4) decides which new information should be collected. The second term of (6) updates this new information. The updated cell state vector is ready to go through the next part of the network. Furthermore, the hidden state is merged into the cell state through (8).
The purpose of the encoder is to find a representation of the sequence as a fixedsized vector . For the ”LSTM+FCL” network, we create a single fully connected layer that receives a fixedsized vector as its input and outputs .
For the seqtoseq network, we have another recurrent network as the decoder. The decoder considers vector as its input and the last encoder hidden state () as its initial hidden state and runs a similar recurrent construction to decode the output sequence. We omit the equations, as they are similar to those above. Furthermore, we let the decoder update its weight parameters independently. In other words, the encoder and decoder do not share parameters. This doubles the number of parameters to be tuned and multiplies the training time by a constant, but we allow this in the hope of achieving better performance. The decoder operates a static recurrent structure rather than a dynamic one. That is, the decoder creates an unrolled computational graph of fixed length due to the fixedsized input vector . In addition, both encoder and decoder can be deep structures. That is, the number of stacked LSTM layers (*3*) for each of them can vary.
The seqtoseq model attempts to find the output sequence distribution as follows:
(9)  
(10) 
where represents all the model weights to be tuned and is a binary indicator representing whether the model sees the actual previous measurements or their predictions. The first equality (9) relies on the encoder and emphasizes once again that the model encodes the whole input into the vector before starting to decode and drops the input thereafter. The concept of multiplying conditional probabilities in the second equality (10) comes from the recurrent structure of the decoder. In other words, based on the trained weights and the vector , the model tries to forecast the first token (element) of the output sequence. In general, at time step , the model has access to of the output sequence. In the TF strategy [44, 45], the model has access to the actual true outputs at training time (i.e., ) and then to the predicted ones (i.e. ) at testing time. In the scheduled strategy, however, one may flip a coin [46] with probability for all to decide whether to use the actual output (with probability ) or its prediction by the model itself (with probability ). For instance, the probability of having all true outputs during training is . This scheme has been introduced for Machine Translation (MT) tasks, where the number of possible output tokens is as large as the dictionary size. Furthermore, in the MT framework the model faces embedded input and sequence. In contrast, in a forecasting framework, each output token belongs to . Therefore, using for each token may result in a combination of true outputs along with predicted ones, which is not particularly useful. Instead, we flip a coin at the beginning of each epoch and stick to the plan for the entire epoch. So, from now on we denote the sequence as , where is the training epoch number. Intuitively, the sequence should be decreasing, which encourages the model to use the predicted output towards the end of the training. [46] introduces Linear, Exponential and Inverse Sigmoid Decay sequences. We modify the Inverse Sigmoid Decay to use for epochscheduled training as follows:
where is a parameter to tune (*4*). Increasing increase the probability of receiving true values. For example, for a training scheme of size 20 epochs, means that we receive the true outputs with at least 0.86 probability. Hence, in the tuning process, we compare the magnitude of with the number of epochs.
3.2 Loss Function & Optimizer
We use Mean Squared Error (MSE) plus a 2norm regularizer as the loss function for the whole network:
(11) 
where is the size of the measurements and is a vector of size representing the model’s prediction of , which is the actual power outputs. The data, however, is chopped up and fed via mini batches, the size of which is tunable (*5*) as a power of 2, which is a common practice in optimization algorithms. Regardless of the algorithm used, MSE is clearly convex with respect to . But we do not have direct control over . Instead, where is the seqtoseq network structure and is the th network input. The loss function is generally nonconvex with respect to , which justifies our use of a 2norm regularizer [48, 49]. Hence, a term is added to the loss function, where is a vector that stacks all the trainable parameters within the weight matrices and (*6*) is a regularization parameter to be tuned. Therefore, the loss function equals
Through backpropagation, we calculate the gradient of the loss function and update the weights via a firstorder optimization algorithm. Stochastic Gradient Descent (SGD) directly updates in the direction of the negative of the minibatches’ gradient in an iterative manner. That is, if is the th minibatch, then we have
where
and is the learning rate (*7*) at step , which needs to be tuned. There exist several modifications of SGD, such as RMSProp. Adaptive moment estimation (Adam) [50] has been successfully implemented as another firstorder method. In Adam, the model’s weight is updated as follows:
(12) 
where
(13)  
(14)  
(15)  
(16) 
and are momentumlike parameters, and serves to reduce numerical issues. Adam aims to update each element of with respect to its size (adaptively); it stores only an exponentially decaying average of past squared gradients. The authors of [50] propose , , and to be efficient. In some cases, such as machine translation, in which we are dealing with RNN structures, similar to our forecasting context, Adam suffers from using exponential averaging over the past squared gradients [51]. AMSGrad has been proposed to tackle this issue by simply using the maximum of the past squared gradients. In other words, we update . Below, we first compare the efficiency of the SGD, RMSProp, Adam, and AMSGrad algorithms, and then optimize the energy flux forecasting model mostly with Adam and AMSGrad.
3.3 Data
Satellite altimeters and buoy measurements are the two most common sources of data for wave feature forecasting [52]. In this study, we primarily use buoy measurements from the National Oceanic and Atmospheric Administration (NOAA). Each buoy provides measurements of significant wave height, wind speed, wind direction, average wave period (), sea level pressure, gust speed, air temperature, and sea surface temperature, at resolutions of 10 minutes to 1 hour.
The National Data Buoy Center (NDBC) maintains three major data sets, consisting of moored buoys, drifting buoys, and CoastalMarine Automated Network (CMAN) stations, which are located alongside U.S. coastal structures. Along with other air and water features, they monitor wave energy spectra, from which and can be obtained via equation (1). Each site has an identifier (ID). IDs are in the form of five digits, except for CMAN stations, which have alphanumeric IDs. The first two digits are assigned to a continental region, and the last three indicate a specific location. For instance 41001, 41002 and 41004 are Atlantic Ocean sites near the southeastern U.S. The sites typically have hourly resolution, which produces 8760 and data points per year. The data sets use 99.00 to indicate missing measurements. Some sites, however, aimed to report 10minuteresolution data in 2017. For instance, among the abovementioned stations, 41002 continues to provide hourly data, while 41001 and 41004 try to provide 10minuteresolution data, but for the most part, their data is still hourly, with the nonhourly values filled by 99.00. Quite a few refined stations have been collected for this study. We consider a buoy to be a “refined station” when it is active and has a year with at least 1000 meaningful data points (approximately 11.4%).
We calculate energy flux via equation (2) for the datasets that remain after this process.
Table 1 displays the buoys investigated in this research, along with some information about them. The locations are chosen from nearshore South Pacific Ocean, the Gulf of Mexico, and the North Atlantic Ocean, where the water depths range from approximately 100 m to 5 km. We divided the data into 3 parts: 60% for training, 20% for hyperparameter tuning and 20% for testing unless we are comparing with alternative approaches and they used another division. The model has no access to the testing data in any manner. Table 1 indicates the training and testing data sizes as well. The validation (tuning) is similar to testing hence we did not show that.
Usage  Station  Lon  Lat  Water Depth  Years  Training Points  Test Points 

ID  (W)  (N)  (m)  Interval  #  #  
Forecasting:  41043  64.830  21.124  5271  [2007,2017]  52911  17904 
41040  53.045  14.554  5112  [2005,2017]  54794  18617  
42056  84.938  19.918  4526  [2005,2017]  58206  19548  
32012  85.078  19.425  4534  [2007,2017]  49280  16428  
41049  62.938  27.490  5459  [2009,2017]  44816  14996  
41060  51.017  14.824  5021  [2012,2017]  24817  8272  
41001  72.617  34.625  4453  [1976,2017]\{1979,2008}  134018  44791  
51002  157.696  17.037  4934  [1984,2017]\{2013}  132560  44507  
46086  118.036  32.507  1838  [2003,2017]  69044  23233  
46013  123.307  38.238  122.5  [1981,2017]  164022  54980  
41048  69.590  31.860  5340  [2007,2017]  49820  16705  
46084  136.102  56.622  1158  [2002,2017]\{2009}  64186  21606  
46083  137.997  58.300  136  [2001,2017]\{2014}  58440  18273  
Reconstruction:  46042  122.398  36.785  1645.9  [2009,2010]  4687  4576 
46025  119.049  33.761  888  [2009,2010]  4687  4576  
46069  120.229  33.663  986  [2009,2010]  4687  4576  
Feature Selection:  44007  70.141  43.525  26.5  [2018]  2737  685 
44008  69.248  40.504  74.7  [2018]  2737  685  
44009  74.702  38.457  30  [2018]  2737  685  
44013  70.651  42.346  64  [2018]  2737  685  
44014  74.840  36.606  47  [2018]  2737  685  
44017  72.049  40.693  48  [2018]  2737  685  
: These years are excluded based on unsatisfactory amount of available data
We argue that some of the studies in the literature are insufficient from a data point of view in two broad senses. First, for papers that use real buoy measurements, there is usually extensive data manipulation and preprocessing. Sometimes the proposed models see only subsamples of measurements in order to produce smaller forecasting metric errors. For example, [53, 54] choose subsample of 50 daily points out of 8 months of data (Jan 1–Aug 30, 2015) and report Root Mean Squared Error (RMSE) as an error measurement metric. Not only is this metric scaledependent and can be changed by shrinking the data size, but this selection of points can break the temporal notion of the data as well. Therefore it has a direct effect on the nonlinear functional relation that we aim to predict. Moreover, [31] produce the data using the power matrix method. Hence, the accuracy of the method heavily depends on the accuracy of the dataproducing procedure. In our study, after selecting a buoy, we do not conduct any data manipulation or preprocessing, other than cleaning notanumber (NaN) values from the NOAA data. Otherwise, the input to the model is the raw data collected from NOAA. (See https://www.ndbc.noaa.gov/wave.shtml to see how spectral wave data are derived from buoy motion measurements.)
3.4 Error Metrics
We use the following forecasting Error Metrics:

Root Mean Square Error (RMSE) is one of the most commonly used regression error metrics. It equals the square root of the MSE, given in (11), and it provides an useful measure of forecasting quality. The closer the RMSE gets to 0, the better the fit the prediction gives. In general, it is difficult to choose an appropriate threshold such that the prediction is deemed “accurate” if the RMSE is less than that threshold, because because the RMSE is scale dependent and not robust to outliers.

HUBER loss, given two sets of points and , is defined as
(17) where
(18) Generally, is an acceptable threshold. HUBER loss is scale dependent but is less sensitive to outliers than RMSE is.

Pearson Correlation Coefficient (CC) is defined as:
(19) where is covariance. We have . Although CC captures linear similarities of its inputs very well, several characteristics of nonlinear relations are ignored.

Mean Arctangent Absolute Percentage Error (MAAPE), given two sets of points and , is defined as:
(20) MAAPE is scaleindependent and, unlike MAPE, overcomes problematic cases as goes to zero for all .
3.5 Hyperparameter Tuning
In previous sections we identified tunable parameters (), . Here, we use Spearmint Bayesian optimization [55] as a tuning tool. The algorithm is capable of handling integer parameters, as in our case [56]. The algorithm treats the seqtoseq forecasting model as a random black box function and places a Gaussian Process (GP) prior over it. After collecting function evaluations, it extracts posterior information based on the Expected Improvement (EI) observed. Hence, GPEIOptChooser with Monte Carlo approximation is selected, which tells Spearmint which of the candidates to execute next. The values of each parameter that we consider are listed in Table 2.
hidden size  {32,64} 

time step  {10,20,30,40,50,60} 
batch size  {16,32,64,128,256} 
learning rate  {0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009,0.010} 
# stacked layers  {1,2,4} 
2norm regularizer  {0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009,0.010} 
scheduled training  {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0} epochs 
Spearmint only has access to the validation dataset and the model that has already been trained on the training dataset. Any hyperparameter tuning algorithm such as Spearmint can have its own objective function. That is, Spearmint’s objective function during validation can be separate from the seqtoseq model’s during training (which, recall, is MSE plus a 2norm regularizer). That is, the model enjoys (MSE plus 2norm regularization) convergence properties in training while enjoys using MAAPE scale independency and upper bound () during the tuning process. Figure 3 illustrates the process of using Spearmint for hyperparameter tuning. We call the number of function evaluations as the “algorithm budget.”
4 Experimental Results
In this section, we first report on the model’s performance utilizing several optimization algorithms and compare them to those of standard Neural Networks on buoy measurements. Next, we demonstrate the performance of our model for longterm forecasting (48 steps ahead) of . Finally, we tackle the problem of constructing wave characteristics such as for unknown locations based on information from known locations. We wrote the code for our method and its alternatives in Python 3.6 using the TensorFlow package (running on a GeForce 1050 GTX GPU), except the code for hyperparameter tuning via Spearmint, whose most stable version for our framework is written in Python 2.7.
4.1 Comparison of Optimization Algorithms
In this section, we aim to compare the effectiveness of the SGD, RMSProp, Adam, and AMSGrad optimization schemes based on test error for four of the refined buoys. Figure 4 displays the algorithm test errors versus the number of training epochs, considering all five error metrics introduced earlier. For the cases of RMSE and HUBER losses, the values on the yaxis do not carry a clear meaning. In other words, we only hope to reach the smallest possible values for these two loss functions, but the actual values do not tell us much. One can observe that, even at the end of the training epochs, RMSE and HUBER losses are not negligible. Note that the error we experience during the calculation of is of the third power of the error we encounter forecasting . A similar difference has been pointed out by [3]. Moreover, RMSE and HUBER are both scaledependent, and the test sets include years of data, which may result in errors piling up over time. In contrast, MAPE and MAAPE are scale independent, and therefore their values have more intuitive meaning. In addition, their behavior is similar to each other, considering that MAAPE values are consistently lower than those of MAPE. Therefore, we mainly focus on MAPE and MAAPE for subsequent experiments. SGD and, in a few cases, AMSGrad require more epochs to reach their best test accuracy, while Adam usually reaches its peak faster. RMSProp and Adam perform similarly, with Adam experiencing a marginal lead. AMSGrad and Adam exhibit excellent performance in terms of RMSE error. We used a learning rate of with a decay of for all the algorithms. This is to make the algorithms more stable; otherwise, having a fixed learning rate would cause most algorithms, especially SGD, to encounter drastic changes towards the end of the optimization process. HUBER losses are the most challenging loss functions to deal with. One can see sizable fluctuations, even near the end of the training, for stations 41049 and 32012. SGD shows inferiority in terms of test error; however, based on its simple update, it enjoys the best results in terms of time needed for training the algorithms. SGD is especially sensitive with respect to its parameters. Adam, however, expresses stability and is among the first algorithms to converge to its best result (usually within its first 10 training epochs). Although RMSProp performs particularly well for station 42056, there are cases where the algorithm cannot reach the desired test error. Based on these results, from now on we use either Adam or AMSGrad as the optimizer.
4.2 Networks Comparison
In this section, we conduct an experiment to evaluate the performance of different neural network structures on the performance of the algorithm. We compare Single (Multi) layered RNN, Single (Multi) layered LSTM and epochscheduled seqtoseq model with Adam and AMSGrad optimization algorithms. RNN and LSTM networks contain a fully connected last layer. We consider seqtoseq model with both Adam and AMSGrad optimization algorithms.
Tables 3 and 4 demonstrate the performance of these neural network structures for 5 and 10stepahead forecasting of energy flux over data gathered from 5 refined buoys (collected from Table 1), using MAPE and MAAPE as the error metrics.
To evaluate the networks fairly, during hyperparameter tuning we used the same budget in Spearmint for each network. We set this budget equal to 100 function evaluations. One may argue that this experimental design might favor the alternate networks. Because they have many fewer hyperparameters to tune but for seqtoseq framework Spearmint deals with a more complex function to extract information from, having the same number of function evaluations.
Multilayered networks consist of four layers. The first three layers are recurrent of structure [time steps, time steps, hidden size] with REctified Linear Unit (RELU) activation functions and the last layer is a fully connected one attaching the previous output to a layer of the last node. Training process has done with the budget of at most 15 epochs. In other words, if a structure reaches its best performance anywhere before 15th epoch, that would be collected. The best parameters for epochscheduled seqtoseq are in same order as shown in table 2.
From Table 3, one can see that seqtoseq networks result in smaller MAPE and MAAPE values, but singlelayered LSTM plus a fully connected last layer also performs very well. We believe a fully connected last layer assists LSTM and simple RNN networks in capturing all learned features and translating them into superior forecasting. In addition, Multi layered structures are not superior to single layered ones. This happens in the Spearmint tuning process as well. Spearmint mostly prefers single or double layer structures for energy flux forecasting of any structures. Hence, when we force the stacked layers to be four, we can see a tiny reduction in the performance of the models. Moreover, Adam and AMSGrad perform closely to each other, and one cannot claim the superiority of any.
Station ID  SLRNN+FCL  SLLSTM+FCL  MLRNN+FCL  MLLSTM+FCL  Seqtoseq(Adam)  Seqtoseq (AMSGrad)  

MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  
46013  0.296  0.259  0.295  0.258  0.303  0.266  0.301  0.265  0.269  0.248  0.297  0.271 
41048  0.336  0.279  0.308  0.259  0.334  0.278  0.354  0.289  0.217  0.205  0.221  0.211 
46084  0.395  0.306  0.392  0.304  0.397  0.307  0.399  0.308  0.311  0.279  0.281  0.258 
46083  0.389  0.310  0.402  0.318  0.397  0.314  0.464  0.356  0.291  0.266  0.271  0.253 
41060  0.215  0.193  0.221  0.197  0.209  0.189  0.201  0.180  0.174  0.169  0.163  0.159 
In Table 4, we can observe that when we move from 5 to 10stepahead forecasting, epochscheduled seqtoseq fits the actual outputs even more accurately. For instance, at station 41048, seqtoseq provides a improvement over the best alternative method in terms of MAPE for 5stepahead forecasting, whereas for 10stepahead the improvement is . Similarly, for MAAPE, the 5 and 10step improvements are and , respectively. These extra improvements may be a direct result of scheduled training managing deeper forecasting horizons.
Station ID  SLRNN+FCL  SLLSTM+FCL  MLRNN+FCL  MLLSTM+FCL  Seqtoseq(Adam)  Seqtoseq (AMSGrad)  

MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  MAPE  MAAPE  
46013  0.487  0.360  0.459  0.347  0.473  0.359  0.473  0.359  0.376  0.328  0.402  0.345 
41048  0.470  0.336  0.459  0.333  0.707  0.470  0.691  0.450  0.436  0.345  0.299  0.278 
46084  0.734  0.434  0.742  0.437  0.769  0.454  0.727  0.432  0.494  0.377  0.393  0.341 
46083  0.592  0.399  0.569  0.393  0.624  0.410  0.602  0.406  0.425  0.360  0.388  0.341 
41060  0.353  0.282  0.375  0.295  0.374  0.298  0.428  0.331  0.207  0.199  0.209  0.201 
4.3 LongTerm Forecasting
Next, we investigate 2dayahead forecasting of . As noted by [38], 2 days can be considered an acceptable range for longterm forecasting of . Table 5 presents 48stepahead test error comparison over 5 refined buoys for forecasting using the seqtoseq model.
Station ID  Seqtoseq (Adam)  Seqtoseq (AMSGrad)  

RMSE (m)  MAPE  MAAPE  HUBER  RMSE (m)  MAPE  MAAPE  HUBER  
41043  0.593  0.249  0.235  0.147  0.598  0.247  0.233  0.150 
41040  0.431  0.182  0.177  0.089  0.431  0.183  0.176  0.090 
41001  1.197  0.459  0.343  0.457  1.197  0.459  0.343  0.457 
51002  0.513  0.186  0.180  0.126  0.511  0.186  0.179  0.125 
46086  0.658  0.238  0.226  0.177  0.666  0.240  0.230  0.181 
Note the large gap between RMSE and HUBER losses that the model experiences when it predicts energy flux compared to significant wave height. The reason is the difference in the value range of output power and significant wave height. is on the order of 10 kW/m, while is usually on the order of 1 m. Hence, we focus on MAAPE. The results indicate that the model is robust for longterm prediction. The values presented in Table 5 are comparable to those for 5 and 10stepahead predictions for in Section 4.2. Further, by definition, the HUBER metric should behave robustly with regard to outliers. The results support this fact and the model enjoys minimal HUBER loss values.
From the table, it is clear that Adam and AMSGrad produce very similar errors. Therefore, we conclude that one will not miss useful details by considering only the Adam optimization algorithm. Hence, in the last section, we optimize the models using Adam.
5 Reconstruction of Significant Wave Height
Our framework, on its surface, is not designed to reconstruct missing measurements of a station based on information from other stations; rather, it uses historical data at the same station. Therefore, to address this goal, we modify the input of the neural networks in such a way that the models will no longer require historical data. Figure 5 illustrates three buoys—46069, 46025 and 46042—that are close to one another. For more details on buoys you can see the Table 1. Similar to existing methods [2, 3], we aim to reconstruct the significant wave height of buoy 46069 at different time steps, treating them as missing data. Networks have access to all information from the two adjacent buoys, 46025 and 46042, as their input, as well as buoy 46069’s SWH as their labels in the training process, which consists of data from the entire year 2009. Then, the networks predict the entire 2010 year SWH of buoy 46069, based on the inputs from the adjacent buoys. In other words, at time step , each model is allowed to see the first measurements of at the two nearby buoys.
Table 6 compares the performance of our seqtoseq and SL LSTM networks with that of benchmark methods from the literature. The methods are [2, 3]: 1) Allfeatured Extreme Learning Machine, (ELM) 2) Allfeatured Support Vector Regression (SVR), 3) Grouping Genetic Algorithm Extreme Learning Machine with final prediction with ELM (GGAELMELM), 4) Grouping Genetic Algorithm Extreme Learning Machine with final prediction with SVR (GGAELMSVR), 5) Bayesian optimization Grouping Genetic Algorithm Extreme Learning Machine with final prediction of ELM (BOGGAELMELM) and 6) Bayesian optimization Grouping Genetic Algorithm Extreme Learning Machine with final prediction of SVR (BOGGAELMSVR). The table reports the RMSE, mean absolute error (MAE), and (which is CC) as error metrics; these are the metrics used in the benchmark papers.
Note that [2, 3] uses buoys 46025, 46042, and 46069, and claims that the NOAA dataset has no missing measurements for these buoys in the years 2009–10. However, we found this to be untrue. First, there are significant gaps in the 46069 station data, which result in fewer than 5000 data points instead of the 8760 hourly points that should be present in a given year. Second, like the other refined stations, roughly 1–2% of the values contain 99.00, which indicates missing data. Therefore, we carefully cleaned the missing data points from all three stations in such a way as to preserve the position of each time step. That is, if only there is a missing point in a station measurement, we exclude that time step from all stations. The resulting data set has 9263 data points for each buoy, where 4687 points are for training (2009) and 4576 are for testing (2010).
Methods  RMSE (m)  MAE (m)  

All featuresELM  0.4653  0.3582  0.6624 
All featuresSVR  0.6519  0.4986  0.3949 
GGAELMELM  0.3650  0.2858  0.7049 
GGAELMSVR  0.3599  0.2727  0.7056 
BOGGAELMELM  0.3324  0.2519  0.7429 
BOGGAELMSVR  0.3331  0.2461  0.7396 
Seqtoseq (Adam)  0.3419  0.2528  0.8106 
LSTM+FCL (Adam)  0.2784  0.2392  0.7922 
From the table, it is clear that the performance of the seqtoseq network is very promising, as it obtains the best values among the methods. Moreover, LSTM+FCL significantly outperforms all the alternatives and even the seqtoseq network in terms of RMSE and MAE simultaneously. It is worth mentioning that both of our methods have access only to of the stations, while the benchmark methods use more measurements. For example, the GAA methods obtain their best performance using wind speed, significant wave height, average wave period, air temperature and the atmospheric pressure at buoy 46025 and the significant wave height and average wave period of buoy 46042 [2]. This makes the superiority of our methods even clearer, but brings about another question, as well, namely: Would using proper feature selection improve the performance of deep networks even further? We conducted some preliminary experiments to answer this question, considering combinations of the features that the GAA approaches use, and found an inferior accuracy compared to solely using . In Section 6, we discuss a more efficient approach for feature selection.
Figure 6 illustrates the performance of both the LSTM and seqtoseq networks in reconstructing the significant wave height for buoy 46069. Note that the models never “saw” the solid green line, which is the year 2010 data (i.e., the test data).
We also illustrate the RMSE behavior of LSTM+FCL and seqtoseq. Figure 7 displays the RMSE test errors based on the number of training epochs up to 50. Both networks reach their best performance around tenth epoch with sequencetosequence reaching slightly faster. This happens even though there are more weights in the sequencetosequence model. We argue that epochscheduling training may be responsible for this quicker training performance, because by definition it allows the network to see ground truth outputs to update the parameters better while simultaneously avoiding excessive overfittings.
6 Feature Selection
In this section, we modify the proposed framework to investigate feature selection via elastic net [57] methodology. We alter the loss function to include two separate regularizers, “ norm” and “ norm”, as follows:
where and is the set of all the network weights. The “ norm” inherits a naturally sparse collection of variables. has its minimum when all the weights are zero (uncollected) so every single nonzero weight should improve the first term of the objective function. Hence, the model chooses only those weights for which the decrease in the first term outweighs the increase in the last term. Using only the “ norm” regularizer would result in instability among the multiple solutions because a tiny change in the parameters may change some weights from zero to nonzero values or vice versa [58]. Therefore, the “ norm” is used in the objective.
One important difference between our feature selection method and those used in the wave energy literature (e.g., [3]) is that the models in the literature either use all of the data related to a specific feature, like , from a nearby buoy, or decide to ignore that feature entirely; however, our model allows us to partially utilize any of the features based on nonzero weights resulting in the optimal solution.
To investigate the performance of our proposed framework for feature collection, we design a new experiment which has been recently explored for feature selection via ensemble structures as well [59]. Figure 8 illustrates six stations—44007, 44008, 44009, 44013, 44014 and 44017—from the east coast of the United States. To see their exact locations and water depths, you can refer to Table 1. We reconstruct the significant wave height and power output of buoy 44008 based on available features of the other nearby buoys. We chose these buoys for two main reasons. First, on average they are rich in features with usable data. Second, the resolution and the datacollection times for these stations are the same up to 1minute accuracy. This is crucial when we want to reconstruct a feature of a buoy based on others. We consider wind direction (WDIR), wind speed (WSPD), gust speed (GST), significant wave height (WVHT), dominant wave period (DPD), average wave period (APD), direction of dominant period waves (MWD), sea level pressure (PRES), air temperature (ATMP), sea surface temperature (WTMP) and dewpoint temperature (DEWP). The datasets are for the year 2018. We include all the features for all the buoys, except DEWP for buoys 44009 and 44013 because those data are unusable. We exclude a measurement for all the stations only if a feature is missing in at least one of them. In the end, the dataset consists of 3422 measurements of 53 features. We divide the dataset into train (60%), validation (20%) and test (20%). We do not change the real measurements in any way. In addition, the methods never “see” the test set during training and validation. In this experiment, for both the LSTM+FCL and seqtoseq networks we have one more parameter to tune, namely, the coefficient of 1norm) during the validation.
Table 7 compares the performance of all of the methods, along with an alternative method, Random Forest, for reconstructing both and wave features. There is a column dedicated to the percentage of nonzero trainable variables. We consider a single weight or bias variable to be nonzero if its absolute value exceeds the threshold of , and we report the result as the percentage of nonzero variables among all trainable variables. When we use the elastic net concept for selecting weights and biases, we observe a significant drop in the nonzero values. This means exactly what we discussed at the beginning of this section. The models invest only in those variables which are beneficial to the objective value. Here, we report the best results found for each measurement error RMSE, HUBER and MAAPE independently. In other words, the reported values of RMSE, HUBER or MAAPE are not necessarily gathered from a single experiment, but in