ExpertMatcher: Automating ML Model Selection for Users in Resource Constrained Countries

ExpertMatcher: Automating ML Model Selection for Users in Resource Constrained Countries

Vivek Sharma
MIT
vvsharma@mit.edu
&Praneeth Vepakomma
MIT
vepakom@mit.edu
&Tristan Swedish
MIT
tswedish@mit.edu
\ANDKen Chang
MIT
kenchang@mit.edu
&Jayashree Kalpathy-Cramer
MGH/Harvard Medical School
kalpathy@nmr.mgh.harvard.edu
&Ramesh Raskar
MIT
raskar@mit.edu
Abstract

In this work we introduce ExpertMatcher, a method for automating deep learning model selection using autoencoders. Specifically, we are interested in performing inference on data sources that are distributed across many clients using pretrained expert ML networks on a centralized server. The ExpertMatcher assigns the most relevant model(s) in the central server given the client’s data representation. This allows resource-constrained clients in developing countries to utilize the most relevant ML models for their given task without having to evaluate the performance of each ML model. The method is generic and can be beneficial in any setup where there are local clients and numerous centralized expert ML models.

1 Introduction

In developing countries, there are often scenarios where (1) the amount of data is scarce and training a deep model from scratch is not feasible; (2) there is a lack of computational resources to train effective deep learning models; or (3) there are beginner users who have minimal or no expertise in ML. In contrast, the abundance of pre-trained models makes inference accessible to users who do not have access to data, computational resources, or domain expertise. Thanks to modern cloud infrastructure and distributed learning methods (Vepakomma et al. , 2018b, 2019, a; Chang et al. , 2018), “expert model hubs” can make inference available to any remote client connected to the internet. This allows the remote client to utilize powerful expert models for their local applications.

Model selection is an important problem that can not be simply sidestepped by training a large capacity multi-task model to perform the inference directly. In practice, training a single model for multiple tasks while maintaining performance across tasks is a challenging problem due to catastrophic forgetting, a phenomenon in which sequential learning of a new task leads to lower performance on previously learned tasks (Goodfellow et al. , 2013). As such, most deep learning models are trained to be domain-specific. Even within a domain, there may be differences in the performance of models trained with different datasets due to variations in demographics, class prevalence, data collection instrument, and data acquisition settings (Zech et al. , 2018; Tomašev et al. , 2019; AlBadawy et al. , 2018; Ting et al. , 2017).

The key idea in this paper is to match input data based on its likelihood of being drawn from the distribution of the expert model training data. To this end, we seek a general representation of our dataset that can be easily compared to an input data sample. We propose an autoencoder (AE) based expert matcher that learns the underlying representation for a given dataset and automatically triggers the expert network when a clients data representation matches the AE representation. This allows the client to effectively utilize centralized expert networks for the given task. Figure 2 sketches an overview of the proposed method. Our main contribution of the paper are, (1) we describe the landscape of expert matching; and (2) we propose an expert matcher for automatic ML model selection for users in resource constrained settings when sharing client data with the server is not a concern. Our proposed approach is evaluated on 6 benchmark datasets: STl-10 (Coates et al. , 2011), MNIST (LeCun et al. , 1998), HAR (Anguita et al. , 2013), Reuters (Lewis et al. , 2004), Non Line of Sight (Tancik et al. , 2018) and Diabetic Retinopathy (Graham, 2015). We show that ExpertMatcher can be used to match both the task (coarse-Level) and class (fine-grained). Matching the class mimics the scenario where you have multiple trained models for the same task, each with a deferentially biased training set. Fine-grained matching allows for finding of the model trained on data with a class distribution most similar to the local data.

The rest of the paper is organized as follows. In Section 2, we discuss related work. Section 3 describes our proposed approach. Experimental results and their analysis are presented in Section 4.

2 Related Work

Jacobs et al. (Jacobs et al. , 1991) proposed the first examples of using multiple expert models, each expert model handling a subset of tasks. They trained an adaptive mixture of experts for speaker vowel recognition and used a gating network to determine which of the networks should be used for each sample. The gating function learned what training sample needed to be passed to which expert. For a more detailed review, please see Jacobs (1995, 1997). In order to avoid the gating function, in (Hinton et al. , 2015; Ahmed et al. , 2016) trained a mixture of one oracle model that provides common knowledge to many specialist experts in the form of shared features. The oracle model acts as a gating function for passing the sample to the expert network. The oracle model sees the whole training data to do an accurate assignment and also needs to be retrained when a new dataset is added. Similar is the case with the (Aljundi et al. , 2017), where the author also learns a gating function to make expert network assignments. We are inspired by all of these works. In contrast to these prior works, our work differs substantially in scope and technical approach. We use a simple autoencoder and do the expert network assignment based on the reconstruction loss for each sample.

Figure 1: Landscape of ExpertMatcher

3 ExpertMatcher Problem

We aim to assign an expert network to a given sample from the client data with the goal to dynamically handle sample-specific correction.

Figure 1 shows the landscape of the ExpertMatching problem. Given the distributed nature of the problem, it is important to consider what data/model the client and server share with each other and the respective privacy trade-offs. In general, there are three main aspects that guides the ExpertMatcher, (1) Resolution: coarse and fine level assignment; (2) Fusion: using top-1 or top-K expert models; (3) Metric: adhoc (e.g. MSE or cosine similarity) or learnable assignment metric. In this work we consider the first sharing scenario (row one in Figure 1), where the client and server share the data and model.

Figure 2: ExpertMatcher. Overview of the proposed unified distributed learning using expert matcher. The ExpertMatcher works hierarchically, where it first triggers the best model for the clients data representation in coarse level assignment (CA), and then followed by fine-level handing the clients data in fine-grained assignment.

Our approach has three key qualities: (i) modularity: the client can easily benefit when new expert models are added on the server; (ii) efficiency: it does not need the client to train a model; and (iii) expert-free: specialized or expert knowledge can help the clients without any overhead to solve their task at hand.

We describe our proposed solution to the ExpertMatcher problem as follows, considering coarse and fine matching resolution: (1) Coarse-level expert matcher (CA), and (2) Fine-grained expert matcher (FA). Figure 2 sketches the pipeline.

Coarse-level expert matcher (CA). We assume, we have pre-trained autoencoders (AE) for datasets on the server. The AEs are trained using reconstruction loss i.e. using mean squared error (MSE) loss between the target data and the network’s predicted output data , . The reconstruction loss is given by . The intermediate features extracted from a hidden layer for a sample is given as . To compare the reconstruction output with the ideal , we use MSE loss as a measure of quality, although we note that more complex loss functions could be used.

For expert assignment of client data, we assign an AE, which has minimum reconstruction error, that means the semantic feature space of client data matches to that of the underlying AE space.

Fine-grained expert matcher (FA). Given that an autoencoder is trained on a dataset with classes e.g. MNIST we have 10 letters (). Once an AE is trained for each dataset, we obtain a hidden representation of each class and compute an average representation of each class in the dataset e.g. for MNIST we have 10 mean representations one for each class , where is hidden layer dimension, and is the number of object classes that varies for each dataset.

For fine level expert assignment of client data , we assign an expert model which has the highest cosine similarity of with .

This mimics a scenario in which there are different centralized models that have biased training sets (ie class imbalances) and you want to locate the model trained on data that has the most similar class distribution to your local dataset.

In the current setup, privacy is not a concern. The client shares their dataset with online services.

4 Experiments

Our experimental setting is illustrated in Figure 2: a random sample is selected by the client from a collection of data sources, and the ExpertMatcher “Server” must select correct corresponding model. We demonstrate that the ExpertMatcher can distuinguish between a set of data sources which cover domains such as text, digits, objects, sensor, and medical images. We first introduce the datasets and evaluation metric, followed by a thorough comparison of the proposed method for both coarse assignment (CA) and fine-grained assignment (FA).

Figure 3: Example images for a few samples from our Non-Line of Sight (NLOS) and Diabetic Retinopathy (DB) datasets. We show one sample per class for both datasets. The coarse similarity between classes makes the datasets challenging.

Datasets. We present a summary of the dataset used in this work in Table 1. We report the number of samples used by the server to train an autoencoder for each dataset. We also report the number of samples for both clients. We divide the dataset into non-overlapping splits 50/25/25% for Server, Client A, and Client B correspondingly. Additionally, some datasets have varying class distribution, indicated by the cluster skew between the largest class (LC) to the smallest class (SC). Figure 3 shows a few examples Non-Line of Sight (NLOS) and Diabetic Retinopathy (DB) samples included in our dataset.

STL-10 MNIST HAR REUTERS NLOS DB ALL
Type Object Digits Sensors Text Sensor Biological
#C 10 10 6 4 3 3
#S 13k 10k 10299 10k 45096 3540
Dim. 32px 28px 561 2000 512px
LC/SC (%) 10/10 11.35/8.92 19/14 43.12/8.14 33.33/33.33 33.33/33.33
Server 6500 5000 5151 5000 22548 1770
Client A 3250 2500 2574 2500 11274 885 22983
Client B 3250 2500 2574 2500 11274 885 22983
Table 1: Datasets. “#S” denotes the number of samples, “#C” denotes the true number classes/clusters, and largest class (LC) / smallest class (SC) is the class balance percent of the given data.

Evaluation Metric. For CA, we use minimum reconstruction error (i.e. MSE) for making a model assignment and then computing the accuracy between predicted dataset and target dataset. For FA, we use maximum cosine similarity for making a class assignment and then computing the accuracy between predicted class and target class.

Multiple Clients The performance is evaluated for two subsets of non-overlapping client data, providing a rough measure of the expected variance of ExpertMatcher performance if clients draw different samples from the same underlying distribution.

Implementation Details. We adopt the input images by resizing to , then flattening it to 784 dimensions. For HAR and Reuter, we apply 1D adaptive average pooling (AdaptiveAvgPool1d) to transform the input data to 784 dimensions. AE: Our AE uses a single-layer MLP encoder-decoder (. MLP: The network comprises of fully-connected layers () using -way softmax layer, where is the number of dataset categories. The AE and MLP model parameters are trained using Adam optimizer. We initialize the learning rate with and manually decrease by a factor of 10 every 15 epochs. The maximum number of epochs is set to 45. We use batch normalization.

4.1 Coarse-level dataset assignment (CA)

Method Client A Client B Samples
AE-MSE 99.94 99.91 10824
MLP-Softmax 99.95 99.97 10824
Table 2: ExpertMatcher. Average accuracy (%) for assigning STL-10, MNIST, HAR, REUTERS datasets.

In Table 3, we show the results for the coarse level dataset assignment. We can observe that autoencoders are very effective in learning the semantics of the underlying data representation even with relatively low input resolution (e.g. 28 28 for images). We can notice that both Client A and Client B are 99% accurately assigned to their corresponding Autoencoders.

In Table 2, we compare results of autoencoder using MSE loss based assignment to MLP using softmax classification for dataset assignment. It is interesting to observe that AE based assignment is equally good as MLP assignment. However, note that with MLP it is not possible to do fine-grained assignment until and unless we have learned MLP to do both dataset and class recognition as a multi-class classification problem.

MNIST STL-10 HAR REUTERS NLOS DB Average
Client A 100.0 100.0 100.0 99.64 99.92 96.49 99.34
Client B 100.0 100.0 100.0 99.56 99.89 95.36 99.13
Table 3: Coarse level dataset assignment using MSE loss as the assignment metric for computing accuracy (%).
Dataset #Classes Client A ClientB
MNIST 10 84.36 83.40
NLOS 3 71.78 71.26
DB 3 41.47 44.41
Table 4: Fine-grained class assignment using cosine-similarity as the assignment metric for computing accuracy (%).

4.2 Fine-grained class assignment (FA)

In Table 4, we show the results for the fine-grained class assignment. In Table 4, we observed that autoencoder seem to learn the underlying dataset representation, here in Table 4 we observe that autoencoders are effective at learning the class identity representation too. Note that fine-grained recognition is a very hard problem, in Figure 3 some sample examples are shown, which primarily shows that the class categories are very similar.

5 Conclusion

We demonstrate ExpertMatcher, a model selection method for deep learning models in a distributed setting. Our method matches data inputs from remote clients to pre-trained models on a central server for inference. ExpertMatcher trains autoencoders for each model’s training dataset, and uses either reconstruction error or cosine similarity of the autoencoder’s bottleneck layer for either coarse or fine-grained matching respectively. Our method is able to distinguish between many common benchmark datasets at a coarse and fine level with reasonable accuracy. We also demonstrate that our method can be applied to practical domain specific models for Non-line-of-sight and Diabetic Retinopathy.

Acknowledgements V. Sharma would like to thank Karlsruhe House of Young Scientists (KHYS) for funding his research stay at MIT.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393264
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description