Improving Route Choice Models by Incorporating Contextual Factors via Knowledge Distillation
Route Choice Models predict the route choices of travelers traversing an urban area. Most of the route choice models link route characteristics of alternative routes to those chosen by the drivers. The models play an important role in prediction of traffic levels on different routes and thus assist in development of efficient traffic management strategies that result in minimizing traffic delay and maximizing effective utilization of transport system. High fidelity route choice models are required to predict traffic levels with higher accuracy. Existing route choice models do not take into account dynamic contextual conditions such as the occurrence of an accident, the socio-cultural and economic background of drivers, other human behaviors, the dynamic personal risk level, etc. As a result, they can only make predictions at an aggregate level and for a fixed set of contextual factors. For higher fidelity, it is highly desirable to use a model that captures significance of subjective or contextual factors in route choice. This paper presents a novel approach for developing high-fidelity route choice models with increased predictive power by augmenting existing aggregate level baseline models with information on drivers’ responses to contextual factors obtained from Stated Choice Experiments carried out in an Immersive Virtual Environment through the use of knowledge distillation.
It is widely known that traffic congestion has significant environmental, economic, and public health consequences. The total cost and time loss associated with traffic congestion in the US has been reported to be more than $121 billion per year and 38 hours per person, respectively . In the US, people mostly prefer to use freeways and highways. However in case of traffic congestions, alternative routes are also taken to avoid travel delay [2, 3, 4]. Mainstream research shows growing interest and need for better understanding driversâ route choice behavior [5, 6, 7, 8, 9].
Route Choice Models [10, 11, 12, 13, 5, 14] predict the route choices of travelers traversing an urban area. Most of the route choice models link route characteristics of alternative routes to those chosen by the drivers. The models play an important role in prediction of traffic levels on different routes and thus assist in development of efficient traffic management strategies that result in minimizing traffic delay and maximizing effective utilization of transport system.
High fidelity route choice models are required to predict traffic levels with higher accuracy. Existing route choice models use revealed preference behavior to model route choice. The use of revealed choice data limits the accuracy of the prediction as it fails to capture subjective factors of drivers at individual level and allows prediction only at an aggregate level. Fig.1 shows the route choice predictions made by a basic aggregate level route choice model (blue line) compared with real data collected from the field (red line). More precisely, Fig.1 shows the probability of drivers exiting a freeway segment through one of the four available exits as predicted by a baseline aggregate route choice model (blue line); the red line in Fig.1 shows the ground truth. It can be seen from Fig.1 that the predictions made by the basic model deviate widely from the ground truth. Existing route choice models do not take into account dynamic contextual conditions such as the occurrence of an accident, the socio-cultural and economic background of drivers, other human behaviors, the dynamic personal risk level, etc. As a result, they can only make predictions at an aggregate level and for a fixed set of contextual factors. Therefore, for higher fidelity, it is highly desirable to use a methodology that captures significance of subjective or contextual factors in route choice.
Adding subjective or contextual requires availability of the data at individual or disaggregate level. Stated Choice Experiments (SCEs) are a scientific methodology to capture the effect of context sensitive factors in route choice. The current advancements in virtual reality technology can enhance stated choice experiments by allowing researchers to present them in a realistic manner that enhances the realism of the experiments and allows one to elicit information about route choice made by a driver. Integrated Virtual Environments (IVEs) [15, 16] provide a good platform to conduct SCE and elicit responses to route choice experiments as realistically as possible. The promise of IVE applications in collecting data includes, but is not limited to, providing a safe and user-friendly experimental platform, being inexpensive and highly portable, as well as generating context-aware and high-fidelity data.
This paper presents a novel approach for developing high-fidelity route choice models with increased predictive power by augmenting existing aggregate level baseline models with information on drivers’ responses to contextual factors obtained from SCE carried out in an IVE through the use of knowledge distillation. Our approach uses the prior knowledge acquired by a teacher neural network pretrained on data about drivers’ responses to contextual factors to augment a student neural network (a baseline model) in a guided fashion. We demonstrate experimentally that the predictions of the augmented model are much closer to reality than that of the baseline.
The paper makes the following contributions.
It presents a novel approach using knowledge distillation for developing high-fidelity route choice models by augmenting existing baseline models with information about drivers’ reaction to contextual factors acquired from SCEs in IVEs.
We present a general end-to-end knowledge distillation framework that uses a multilayer perceptron as a feature extraction network to provide a feature learning architecture for teacher and student networks and then transfers knowledge from the former to the latter by optimizing distillation loss.
Ii Related Work
In this section, we discuss related work on route choice models and knowledge distillation frameworks.
Route Choice Models
Transportation engineers have been studying commuter route choice behavior for four decades now. Engineers developing route choice models theorized that travel time plays a crucial and important role in the selection of a route. Route choice behavior theories began to evolve in the late eighties and early nineties as engineers’ understanding of route choice behavior improved by studying data about empirical route choice behavior. Pursula and Talvite  developed a mathematical route model by postulating that drivers do consider other factors apart from travel time in making a route choice. In [13, 17], the authors discovered that commuters prefer to use habitual routes when traveling in familiar areas as opposed to choosing a route that provides them with maximum utility. Other researchers such as Doherty and Miller  investigating route choice found that apart from travel time, factors such as residential location, familiarity with the route, and employment locations are significant in the route choice process.
Deep learning techniques have achieved success in a variety of tasks [19, 20, 21, 22, 23, 24]. Authors in [25, 26] showed that it is possible to compress heavy models of ensembles, that require large amount of storage and computational power, to a single small and efficient model without significant loss. Hinton et. al.  proposed a different compression technique for knowledge distillation in a neural network. This method showed excellent performance in distilling knowledge from heavy models.
Adaptive distillation loss (ADL) was proposed in  for single-stage detectors in the knowledge distillation setting. It magnifies distillation loss for “harder” examples while reducing the same for “easier” ones.
In , the authors proposed a method for selective knowledge distillation for solving the problem of low-resolution face recognition in the wild with low computational and memory cost. In , the authors showed that knowledge distillation can help optimize neural networks. In , the authors present a method for training smaller neural networks by distilling knowledge from a relatively bigger network. In , the authors present a framework for training models with specialized information in the question-answering space through knowledge distillation.
Iii Proposed Method
In this section, we present an overview of our proposed method. This is followed by a description of the IVE for SCE. Finally, we present our knowledge distillation-based framework for developing high-fidelity route choice models in Sections III-C and III-D.
We present a framework for developing high-fidelity route choice models with increased predictive power by augmenting existing aggregate level baseline models with contextual information obtained from SCE carried out in an IVE through the use of knowledge distillation.
The overall architecture of our framework is shown in Fig.2. As shown in the figure, for both the teacher and the student we use feature extraction networks (see below). The teacher is first pre-trained on data acquired from SCE in IVE (called VR data; see below). The basic data (see below) acquired from predictions by the baseline route choice model is partitioned into a training set and a test set. The training set is used for training a student as well as for distilling knowledge to the student from a teacher through knowledge distillation. The student is evaluated on the test set.
Basic Route Choice Model and Data We considered a basic mathematical route choice model adapted from one of the common route choice models  in literature that serves as the baseline model.
The baseline model predicts the probability () of exiting a highway through a given exit using the following equation,
where the constant is 0.601 , and is the travel time on the alternative route to a fixed destination after exiting the highway. We uniformly sampled 10,000 driving record samples from the probability distribution predicted by the baseline model for a highway segment with four exits, along with associated travel times for the alternative routes after exiting the highway through the available exits, for a fixed destination. We call this dataset the basic data.
Iii-B Immersive Virtual Environment and VR Data
Immersive virtual environments (IVEs) are popular for simulating environment spaces and objects in 3D and provide immersion in interactive surroundings with abundant multimedia information. We created multiple scenarios for a highway segment having four exits, together with embedded contextual factors (such as traffic conditions: normal, medium, heavy) to collect data from participant volunteer drivers about their interactions with these factors. These interactions together with driver demographic backgrounds, are hard to acquire in the real world.
The dataset acquired from the IVE environment was augmented by using a Gaussian mixture model to learn the overall distribution from which IID (independent and identically distributed) samples are drawn. We call this set of samples the VR Data.
Iii-C Multilayer Perceptron Models for Feature Extraction
Let be the input features to a fully connected MLP (see Fig.3) with hidden layers (not including a softmax layer; only two hidden layers shown in Fig.3), called the feature extraction network. Let the parameters of be denoted by . Let the output response vector for be . Let the neurons in the hidden layers of have the nonlinear activation function . Then the feature vector from output layer of can be described by the following equation.
For the feature extraction network , we adopt softmax regression (as shown in Fig.3 as the function layer with parameters ) to obtain probabilities of exiting through the available “highway exits”. The probability that an individual driver exits through highway exit () for a feature vector (that characterizes the individual as well as the contextual factors they are subjected to) is given by the following equation.
For training , we use the cross-entropy loss, described by the following equation.
where is the feature vector, and is the response vector.
Iii-D Knowledge Distillation
Knowledge distillation, also known as model compression, aims to learn a small or shallow neural network (normally called the student model, denoted as ) with limited training examples and computational power by transferring the generalization ability from a large well-trained neural network (called the teacher model, denoted as ), as shown in Fig.2 and Fig.4. During training, the student model will be guided by the teacher. The student attempts to match its softened softmax outputs with that of the teacher, and its hard softmax outputs with the ground truth. Given the output before the last layer of a neural network, usually called logits, the softmax transforms to a probability using the following equation.
Normal softmax tends to set the probability for one class to one and that for the rest to zero. This makes it hard to distill hidden knowledge to the student with the teacher as the source. To improve the generalization ability of student model and efficiently use the hidden knowledge, Hinton et. al.  proposed high temperature softmax function in lieu of using the normal softmax (temperature=1 in this case). Then the probability is given by the following equation.
Let the input features be denoted by . Let the output from the teacher model with parameters be . Let the output from the student model with parameters be . Knowledge distillation from the teacher to the student is achieved by minimizing the distillation loss , between the networks, given by the following equation,
where is a batch of training data that has training examples, is the vector of corresponding labels for the training data . The distillation loss comprises of two losses, and (both cross entropy loss functions), which correspond to supervised loss constrained by the ground truth data and softened loss constrained by the teacher model, respectively. The constants and are non-negative and are used for balancing these two losses. Here, we adopt the cross-entropy loss () given by the following equation,
where is the th element of the th training example, is the size of training data, and is the input size.
Both teacher and student models use feature extraction networks for extracting features. The teacher and student models are MLPs with softmax layer for classification. During each training iteration, the training set from the basic data is fed into both the teacher model and student model. The student model is trained using backpropagation with the ground truth as the hard target and the output of the teacher as the soft target. The algorithm for knowledge distillation is shown in Algorithm 1.
As shown in Algorithm 1, an end-to-end fashion training is used for our framework. For each iteration, the algorithm computes the logits for the input data. The cross-entropy loss is computed for backpropagating the gradients in the student network. The parameters in the student network are updated using gradient descent.
After training, the student network is tested standalone on the test set.
|1||Traffic||1 Normal, 2 Medium, 3 Heavy|
|2||Urgency||1 Urgency, 2 Non-Urgency|
|3||Social impact||1 No, 2 Yes|
|4||Age||1 less than 25, 2 greater than or equal to 25|
|5||Gender||1 Male, 2 Female|
|6||Race||1 Middle Eastern, 2 White, 3 Other|
|7||Education||1 Post graduate degree, 2 High school graduate, 3 College graduate|
|8||Employment status||1 Employed part time, 2 Employed full time, 3 Student, 4 Unemployed looking for work|
|9||Concern while stuck in traffic||1 Hours of extra travel time, 2 Chaos, 3 Monetized value of delay, 4 Speed reduction due to congestion|
|10||Familiarity with the environment||1 Once a week, 2 Once a year, 3 Once a month, 4 More than once a week, 5 Never|
|11||Financial Concerns||1 Sometimes, 2 Always, 3 Most of the time, 4 About half the time, 5 Never|
|12||Choice||1 First exit, 2 Second exit, 3 Third exit, 4 Fourth exit, 5 Fifth exit|
Iv Experimental Evaluation
Experimental evaluation will illustrate the performance of our framework with respect to predicting the probability that an individual driver takes each of the four exits in the highway segment considered. We start with the details of data collection.
Iv-a Data Collection
The highway segment chosen for our experiment corresponds to the route of I-10 in in Baton Rouge, Louisiana, United States, between Horace Wilkinson Bridge and the intersection of Perkins Rd and Staring Ln, that has four exits in the middle. The travel time measured on the alternative route after taking each exit provided by Google Maps 111https://www.google.com/maps/ on September 20th, 2018, are min, min, min, and min.
IVE Experimental Setting In this study, we used a driving environment (see Fig.5) that is based on the I-10, starting off the Mississippi River (Horace Wilkinson) bridge all the way to the intersection of Perkins Rd and Staring Ln. Along the way, four alternate routes were introduced to the participants, Exits 1, 2, 3, and 4, the latter of which would be College Dr.
In our Immersive Virtual Environments (IVEs), we constructed ten experimental scenarios combining sets of contextual factors (see Table I for a description of the contextual factors).
We had forty one volunteers (20 male and 21 females; age: ) participate in the SCE. At the beginning of the experiment, the following information was elicited from the participants through a survey: 1) demographic characteristics (age, gender, race, education, employment status); 2) top concerns while they were stuck in the traffic congestion. Their choices included hours of extra travel time, speed reduction, monetized value of delay, additional vehicle operating cost; 3) familiarity with the area; 4) socio-economic status (having concerns about spending less money on your gas).
Participants were presented with the same origin and the destination in all the scenarios. However, distinct contextual factor(s) were presented in each scenario and participants were required to choose their preferred route.
Each participant encountered each of the driving scenarios apart from a baseline scenario. The baseline scenario was designed to collect information about a participantâs route choice in a normal traffic situation with non-urgent bound condition. The contextual factors of traffic flow were varied over three levels, i.e., normal, medium, and heavy density, in the context of a work-bound and home-bound trips; during the work-bound trip, participants were asked to consider the importance of meeting the time of arrival commitment, while no such factor was considered in the home-bound trip. The social impact factor considered the impact of other driversâ route choices on a driver’s own route choice.
VR Data From the SCE in IVE involving volunteers, a total of 410 driving records collected. Since the data collected from SCE in IVE is limited, to better train the teacher network, we augmented it using a Gaussian mixture model (GMM). The data collected was categorical. We preprocessed and transformed it to the ordinal data before augmenting it using a GMM. After data augmentation, 10,000 synthetic driving records were generated. Each driving record was associated with its travel time corresponding to the alternative route for the exit taken by a driver. The 10,000 synthetic driving records together with their associated travel time is called the VR data. The VR data was divided by for training and for testing and the training set was used to train the teacher.
For each exit, the probability of taking it was calculated based on the data collected in the IVE as shown in Fig.6. From Fig.6(a), it can be seen that the majority of drivers act consistently under heavy traffic, but use more options when they encounter medium traffic (Fig.6(b)). As indicated in Fig. 6(c), we did not consider the social impact factor in the normal traffic scenario (that is the impact on an individual driver on seeing a large number of drivers taking an exit) in our experiment due to the high cost in creating such a scenario.
Basic Data The probabilities for the four exits are then computed according to the formula 1. Furthermore, for the basic data, we uniformly randomly sampled 10,000 driving records based on the probability distribution predicted by the baseline route choice model. Each record was associated with its travel time corresponding to the alternative route for the exit taken by a driver. For each driving record, we assigned the value one to the Urgency variable if its actual value is less than or equal to 13 in the scale of 1 to 60. Otherwise it was assigned to two. The contextual variables, which are present in the VR data, but do not occur in the basic data are set to zero. We used this augmented dataset for knowledge distillation. We divided this dataset into training () and testing sets ().
During knowledge distillation, the teacher model, pretrained on the augmented VR data, provides the prior knowledge for our framework.
During training the student model on basic data, our framework incorporated the pretrained teacher model for guiding the student. Specifically, it computes the cross-entropy loss in softened softmax function between the teacher and the student in the backpropagation procedure for distilling the knowledge to student model. During inference, and we executed the student model as a standalone. It extracts features of each test data point from dense layers then predicts the route choice probability distribution.
Real Data We calculated the real probabilities of taking the exits from the data provided by the (name omitted to preserve anonymity). Given the traffic volumes captured at the four exits, the probability of taking an exit is computed by,
where is the set of exits, is the traffic volume at exit . The real probabilities for the four exits are shown in Fig.7(d). The real probabilities will be used for evaluating the accuracy of the predictions by our framework.
Iv-B Implementation details
Table II shows the architectures of the different teacher networks that we considered in our experiments. During experimentation, we varied the number of the neurons in each layer activated by ReLU function as well as the number of Dropout and hidden layers. In each architecture, the output from the last layer of the feature extraction network is input into a 4-way softmax layer that transforms the logits to a probability distribution over the four exits. From Table II, we can see that the teacher network architecture described in the last row has the best prediction result of 94.95% for VR data. The notation like ”10n-0.25DP” indicates that the network has a dense layer with 10 neurons followed by a Dropout layer with a ratio of 0.25. For student network, we implemented a small feature extraction network with two dense layers of 10 and 20 neurons, both layers activated by ReLU functions. No dropout layer was used in the student network. Instead we added a Batch Normalization layer to its second dense layer, and a 4-way softmax layer on the top of the last layer. The inputs for the teacher and student networks have both 12 dimensions.
In our experiments, we computed the softened logits output from the last layer of our feature extraction networks, and then the softened softmax output is obtained by applying the softmax layer on the softened logits. The original ground truth data concatenated with the softened softmax outputs from teacher network is used for training and testing the student network. The predictions from the student network concatenated with its softened softmax outputs are used to compute the distillation loss in each iteration. Then we use backpropagation for updating the parameters of the student network using gradient descent. Finally, the standalone trained student network is used for inference.
We first trained our teacher network on VR data. The performance of the teacher network on the testing set of the VR data is shown in Fig.7(a) and in Table II. From Fig.7(a), we can see that the prediction accuracy of teacher network trained on the VR data improves fast along with increased epochs before stabilizing after epoch . Table III shows that the teacher network trained on the VR data achieves the best prediction accuracy of 97.93% on the test set of the basic data. For the student network in Fig.7(b), we can see that its prediction accuracy on the test set of the basic data improves with increasing number of epochs. The accuracy peaks at epoch 4, decreases after that before achieving a positive slope again at epoch 5. The best accuracy achieved by the student is 77.45% (Table II). The early peaking is due to the smaller size and the simplicity of the student network. Our framework uses knowledge distillation to direct the training of the student network on the prior knowledge acquired by the teacher from the VR data. Fig.7(c) shows that after the 6th epoch, the prediction accuracy of the augmented model (teacher-student network) on the test set of the basic data increases with a higher slope with respect to number of epochs than the student network. Initially, the prediction accuracy increases with increasing number of epochs as the ground truth loss (loss from hard target) dominates. Starting at epoch 4, where we see a dip, the loss from soft target starts dominating as the teacher guides the student.
After knowledge distillation from the teacher network to the student network, the prediction accuracy of the student network on the test set of the basic data abruptly improves: it achieves a classification accuracy of 95.2% on the basic data (see Fig.7(d)). Based on the prediction results from our framework on basic data, the probabilities an individual driver taking the exits is computed and plotted in Fig.7(d). It can be seen from Fig.7(d) the prediction accuracy of our framework is better than the baseline model: our results are closer to the real data and have similar trend except at Exit 1. The prediction at Exit 1 was heavily dominated by a large number of discrepancies in drivers actions as seen from the VR data. However, overall, our framework shows better fidelity than the baseline route choice model. Thus, using knowledge distillation, we have augmented a baseline model with contextual information acquired from SCE to obtain a model with higher fidelity.
In this paper, we proposed a novel approach for developing high-fidelity route choice models with increased predictive power by augmenting existing aggregate level models with contextual information obtained from SCE carried out in an IVE through the use of knowledge distillation. To this end, we presented a general end-to-end knowledge distillation framework that uses a multilayer perceptron as a feature extraction network to provide a feature learning architecture for teacher and student networks and then transfers knowledge from the former to the latter by optimizing distillation loss. Experimental results have demonstrated that that the predictions of the augmented models produced by our approach are much closer to reality than that of the baseline.
This research was supported by Transportation Consortium of South-Central States (Tran-SET) Award No 18ITSLSU09/69A3551747016. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsor.
-  D. Schrank, T. Lomax, and S. Turner, “Urban mobility report texas transportation institute,” Texas: Texas Transportation Institute, 2009.
-  V. P. Ravindra Gudishala, Chester Wilmot, “Diverted traffic measurement,” n.d.
-  S. Shojaat, J. Geistefeldt, S. A. Parr, C. G. Wilmot, and B. Wolshon, “Sustained flow index: Stochastic measure of freeway performance,” Transportation Research Record: Journal of the Transportation Research Board, no. 2554, pp. 158–165, 2016.
-  E. Ben-Elia and Y. Shiftan, “Which road do i take? a learning-based model of route-choice behavior with real-time information,” Transportation Research Part A: Policy and Practice, vol. 44, no. 4, pp. 249–264, 2010.
-  C. G. Prato, “Route choice modeling: past, present and future research directions,” Journal of choice modelling, vol. 2, no. 1, pp. 65–100, 2009.
-  Q. Liu, S. Kumar, and V. Mago, “Safernet: Safe transportation routing in the era of internet of vehicles and mobile crowd sensing,” in 2017 14th IEEE Annual Consumer Communications & Networking Conference (CCNC). IEEE, 2017, pp. 299–304.
-  M. Ben-Akiva and M. Bierlaire, “Discrete choice models with applications to departure time and route choice,” in Handbook of transportation science. Springer, 2003, pp. 7–37.
-  A. Lima, R. Stanojevic, D. Papagiannaki, P. Rodriguez, and M. C. González, “Understanding individual routing behaviour,” Journal of The Royal Society Interface, vol. 13, no. 116, 2016.
-  K. Park, M. Bell, I. Kaparias, and K. Bogenberger, “Learning user preferences of route choice behaviour for adaptive route guidance,” IET Intelligent Transport Systems, vol. 1, no. 2, pp. 159–166, 2007.
-  M. E. Ben-Akiva, S. R. Lerman, and S. R. Lerman, Discrete choice analysis: theory and application to travel demand. MIT press, 1985, vol. 9.
-  X. Di and H. X. Liu, “Boundedly rational route choice behavior: A review of models and methodologies,” Transportation Research Part B: Methodological, vol. 85, pp. 142–179, 2016.
-  M. Pursula and A. Talvitie, “Urban route choice modelling with multinomial logit models,” LIIKENNETEKNIIKKA, TIEDOTE, no. 28, 1993.
-  A. J. Khattak, J. L. Schofer, and F. S. Koppelman, “Commuters’ enroute diversion and return decisions: analysis and implications for advanced traveler information systems,” Transportation Research Part A: Policy and Practice, vol. 27, no. 2, pp. 101–111, 1993.
-  P. H. Bovy and E. Stern, Route Choice: Wayfinding in Transport Networks: Wayfinding in Transport Networks. Springer Science & Business Media, 2012, vol. 9.
-  F. Weidner, A. Hoesch, S. Poeschl, and W. Broll, “Comparing vr and non-vr driving simulations: An experimental user study,” in Virtual Reality (VR), 2017 IEEE. IEEE, 2017, pp. 281–282.
-  Q. C. Ihemedu-Steinke, R. Erbach, P. Halady, G. Meixner, and M. Weber, “Virtual reality driving simulator based on head-mounted displays,” in Automotive User Interfaces. Springer, 2017, pp. 401–428.
-  P. Bonsall and T. Parry, “Drivers’ requirements for route guidance,” in Road Traffic Control, 1990., Third International Conference on. IET, 1990, pp. 1–5.
-  S. T. Doherty and E. J. Miller, “A computerized household activity scheduling survey,” Transportation, vol. 27, no. 1, pp. 75–97, 2000.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  S. Basu, S. Mukhopadhyay, M. Karki, R. DiBiano, S. Ganguly, R. R. Nemani, and S. Gayaka, “Deep neural networks for texture classification - A theoretical analysis,” Neural Networks, vol. 97, pp. 173–182, 2018.
-  S. Basu, S. Ganguly, R. R. Nemani, S. Mukhopadhyay, G. Zhang, C. Milesi, A. R. Michaelis, P. Votava, R. Dubayah, L. Duncanson, B. D. Cook, Y. Yu, S. Saatchi, R. DiBiano, M. Karki, E. Boyda, U. Kumar, and S. Li, “A semiautomated probabilistic framework for tree-cover delineation from 1-m NAIP imagery using a high-performance computing architecture,” IEEE Trans. Geoscience and Remote Sensing, vol. 53, no. 10, pp. 5690–5708, 2015.
-  M. Karki, Q. Liu, R. DiBiano, S. Basu, and S. Mukhopadhyay, “Pixel-level reconstruction and classification for noisy handwritten bangla characters,” in 16th International Conference on Frontiers in Handwriting Recognition, ICFHR 2018, Niagara Falls, NY, USA, August 5-8, 2018, 2018, pp. 511–516.
-  H. Nguyen, L. Kieu, T. Wen, and C. Cai, “Deep learning methods in transportation domain: a review,” IET Intelligent Transport Systems, vol. 12, no. 9, pp. 998–1004, 2018.
-  S. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano, M. Karki, and R. R. Nemani, “Deepsat: a learning framework for satellite imagery,” in Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, Bellevue, WA, USA, November 3-6, 2015, 2015, pp. 37:1–37:10.
-  C. BuciluÇ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 535–541.
-  Q. Liu and S. Mukhopadhyay, “Unsupervised learning using pretrained cnn and associative memory bank,” in 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 01–08.
-  G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
-  S. Tang, L. Feng, W. Shao, Z. Kuang, W. Zhang, and Y. Chen, “Learning efficient detector with semi-supervised adaptive distillation,” arXiv preprint arXiv:1901.00366, 2019.
-  S. Ge, S. Zhao, C. Li, and J. Li, “Low-resolution face recognition in the wild via selective knowledge distillation,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 2051–2062, 2019.
-  J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 7130–7138.
-  Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” arXiv preprint arXiv:1706.00384, vol. 6, 2017.
-  J. Mun, K. Lee, J. Shin, and B. Han, “Learning to specialize with knowledge distillation for visual question answering,” in Advances in Neural Information Processing Systems, 2018, pp. 8092–8102.
-  M. E. Ben-Akiva, M. S. Ramming, and S. Bekhor, “Route choice models,” in Human Behaviour and Traffic Networks. Springer, 2004, pp. 23–45.