Privacy-Preserving Deep Visual Recognition: An Adversarial Learning Framework and A New Dataset

Privacy-Preserving Deep Visual Recognition:
An Adversarial Learning Framework
and A New Dataset

Haotao Wang*, Zhenyu Wu*, Zhangyang Wang, Zhaowen Wang, and Hailin Jin Haotao Wang, Zhenyu Wu, and Zhangyang Wang are with the Department of Computer Science and Engineering, Texas A&M University, College Station, TX, 77840.
E-mail: {htwang,wuzhenyu_sjtu,atlaswang}
The first two authors contributed equally to this work. Zhaowen Wang and Hailin Jin are with Adobe Research, San Jose, CA, 95110. E-mail: {zhawang,hljin} Correspondence to: Zhangyang Wang (

This paper aims to boost privacy-preserving visual recognition, an increasingly demanded feature in smart camera applications, using deep learning. We formulate a unique adversarial training framework, that learns a degradation transform for the original video inputs, in order to explicitly optimize the trade-off between target task performance and the associated privacy budgets on the degraded video. We carefully analyze and benchmark three different optimization strategies to train the resulting model. Notably, the privacy budget, often defined and measured in task-driven contexts, cannot be reliably indicated using any single model performance, because a strong protection of privacy has to sustain against any possible model that tries to hack privacy information. In order to tackle this problem, we propose two strategies: model restarting and model ensemble, which can be easily plug-and-play into our training algorithms and further improve the performance. Extensive experiments have been carried out and analyzed.

On the other hand, few public datasets are available with both utility and privacy labels provided, making the power of data-driven (supervised) learning not yet fully unleashed on this task. We first discuss an innovative heuristic of cross-dataset training and evaluation, that jointly utilizes two datasets with target task and privacy labels respectively, for adversarial training. To further alleviate this challenge, we have constructed a new dataset, termed PA-HMDB51, with both target task (action) and selected privacy attributes (gender, age, race, nudity, and relationship) labeled on a frame-wise basis. This first-of-its-kind video dataset further validates the effectiveness of our proposed framework, and opens up new opportunities for the research community. Our codes, models, and the PA-HMDB51 dataset will be made all available at:

Visual privacy, action recognition, privacy-preserving learning, adversarial learning.

1 Introduction

Smart surveillance or smart home cameras, such as Amazon Echo and Nest Cam, are now found in millions of locations to remotely link users to their homes or offices, providing monitoring services to enhance security and/or notify environment changes, as well as lifelogging and intelligent services. The convenience and benefits, however, come at the heavy price of privacy intrusion from time to time. Due to their computationally demanding nature, not all visual recognition tasks can be run the resource-limited local device end, making transmitting (part of) data to the cloud indispensable. However, users have expressed growing concerns towards the abuse of their uploaded data, by the cloud service provider - a potentially malicious authorized party. That is different from the traditional privacy concerns, which mostly arise from the unsecured channel between cloud and device (e.g, malicious third-party eavesdropping), therefore demanding new solutions.

Is it at all possible to alleviate the privacy concerns, without compromising on user convenience? At the first glance, the question itself is posed as a dilemma: we would like a camera system to recognize important events and assist human daily life by understanding its videos, while preventing it from obtaining sensitive visual information (such as faces, gender, skin color, etc.) that can intrude individual privacy. It thus becomes a new and appealing problem, to find an appropriate transform to “sanitize” the collected raw visual data at the local end, so that the transformed data itself will only enable certain target tasks while obstructing undesired privacy-related tasks. Recently, some new video acquisition approaches [1, 2, 3] proposed to intentionally capture or process videos in extremely low-resolution to create privacy-preserving ”anonymized videos”, and showed promising empirical results.

This paper seeks to take a first step towards addressing this brand new challenge of privacy-preserving visual recognition, via the following multi-fold contributions:

  • A General Adversarial Training and Evaluation Framework. We formulate the privacy-preserving visual recognition in a unique adversarial training framework. The framework explicitly optimizes the trade-off between target task performance and associated privacy budgets, by learning a transform from the original videos to anonymized videos. To mitigate the training instability, we then carefully design and experimentally compare three different optimization strategies. We eventually define a novel two-fold protocol, to evaluate the trained models in terms of not only the utility, but also the privacy protection generalizability against unseen privacy hackers.

  • Practical Approximations of “Universal” Privacy Protection. The privacy budget in our framework cannot be simply defined w.r.t. one single privacy attribute prediction model, as the ideal protection of privacy has to be universal and model-agnostic, i.e., obstructing every possible attacker model from predicting privacy information. To resolve the so-called challenge”, we propose two strategies, i.e., restarting and ensembling, to enhance the generalization capability of the learned degradation to defend against unseen models.

  • A New Dataset for this New Problem. There are few off-the-shelf datasets that have both utility and privacy attribute annotations. To alleviate the lack of appropriate training data, we (in our previous work [4]) introduce a “cross-dataset training” alternative, based on the hypothesis that privacy attributes have good “transferability”. In this paper, we construct the very first dataset named PA-HMDB51 (Privacy Attribute HMDB51), for the task of privacy-preserving action recognition from video. The dataset consists of 580 videos originally from HMDB51. For each video, we have annotated both utility (action) and privacy (five attributes including skin color, face, gender, nudity and relationship) labels on a frame-wise basis. We benchmark our proposed framework on the new PA-HMDB51 and validate its effectiveness.

The paper is a significant extension from the previous conference version [4]. We have included: (1) a detailed discussion and comparison on three optimization strategies for the proposed framework (only one heuristic was presented in [4]); (2) a much expanded experimental and analysis section; and most importantly (3) the construction of the new PA-HMDB51 dataset, and the associated benchmarking efforts.

2 Related Work

2.1 Privacy Protection in Computer Vision

With pervasive camera for surveillance or smart home devices, privacy-preserving visual recognition has draw increasing interests from both industry and academia. Most classical cryptographic solutions secure the communication against unauthorized access from attackers. However, they are not immediately applicable to preventing authorized agents (such as the backend analytics) from the unauthorized abuse of information, that causes privacy breach concerns. A few encryption based solutions, such as Homomorphic Encryption (HE) [5, 6], were developed to locally encrypt visual information. The server can only get access to the enciphered data and conduct utility task on it. However, many encryption based solutions will incur high computational costs at the local platforms. It is also challenging to generalize the cryptosystems to more complicated classifiers. [7] combined the detection of regions of interest and the real encryption techniques to improve privacy while allowing general surveillance to continue. A seemingly reasonable, and computationally cheaper option is to extract feature descriptors from raw images, and transmit those features only. Unfortunately, a previous study [8] revealed that considerable details of original images could still be recovered from standard HOG or SIFT features (even they look visually distinct from natural images).

An alternative toward a privacy-preserving vision system concerns the concept of anonymized videos. Such videos are intentionally captured or processed to be in special low quality conditions, that only allow for the recognition of some target events or activities, while avoiding the unwanted leak of the identity information for the human subjects in the video [3, 2, 1]. [1] showed that even at the extreme low resolutions, reliable action recognition could be achieved by learning appropriate downsampling transforms, with neither unrealistic activity-location assumptions nor extra specific hardware resources. The authors empirically verified that conventional face recognition easily failed on the generated low-resolution videos. [2] uses image operations like blurring and superpixel clustering to get anonymized videos while [3] uses low resolution camera hardware to get extreme low resolution (e.g., ) videos as anonymized videos. [9] uses cartoon-like effects with a customized version of mean shift filtering. [10, 11] proposed to use privacy preserving optics to filter sensitive information from the incident light-field before sensor measurements are made, by k-anonymity and defocus blur. Earlier work [12] explored privacy-preserving tracking and coarse pose estimation using a network of ceiling-mounted time-of-flight low-resolution sensors.[13] adopted a network of ceiling-mounted binary passive infrared sensors. However, both works [12, 13] handled only a limited set of activities performed at specific constrained areas in the room. The usage of low-resolution anonymized videos [3, 1] is computationally cheaper, and is also compatible with sensor and bandwidth constraints. However, [3, 1] remain empirical in protecting privacy. In particular, neither were their models learned towards protecting any visual privacy, nor were the privacy-preserving effects carefully analyzed and evaluated. In other words, privacy protection in [3, 1] came as a ”side product” of down-sampling, and was not a result of any optimization. The authors of [3, 1] also did not extend their efforts to studying deep learning-based recognition, making their task performance less competitive. The recent progress of low-resolution object recognition [14, 15] also put their privacy protection effects in jeopardy.

2.2 Privacy Protection in Social Media/Photo Sharing

User privacy protection is also a topic of extensive interests in the social media field, especially for photo sharing. The most common means to protect user privacy in a uploaded photo is to add empirical obfuscations, such as blurring, mosaicing or cropping out certain regions (usually faces) [16]. However, extensive research showed that such an empirical means can be easily hacked too [17, 18]. A latest work [19] described a game-theoretical system in which the photo owner and the recognition model strive for antagonistic goals of dis-enabling recognition, and better obfuscation ways could be learned from their competition. However, it was only designed to confuse one specific recognition model, via finding its adversarial perturbations. That can caused obvious overfitting as simply changing to another recognition model will likely put the learning efforts in vain: such perturbations even cannot protect privacy from human eyes. Their problem setting thus deviated far away from our target problem. Another notable difference is that in social photo sharing, we usually hope to cause minimum perceptual quality loss to those photos, after applying any privacy-preserving transform to them. The same concern does not exist in our scenario, allowing us to explore much more free, even aggressive image distortions.

A useful resource to us was found in [20], which defined concrete privacy attributes and correlated them to image content. The authors categorized possible private information in images, and then run a user study to understand the privacy preferences. They then provided a sizable set of 22k images annotated with 68 privacy attributes, on which they trained privacy attribute predictors.

3 Method

3.1 Problem definition

Assume our training data X (raw visual data captured by camera) are associated with a target task and a privacy budget . Since is usually a supervised task, e.g. action recognition or visual tracking, a label set is provided on , and a standard cost function (e.g. cross-entropy) is defined to evaluate the task performance on . And usually there is a state-of-the-art deep neural network which takes as input and predicts the target labels. On the other hand, we need to define a budget cost function to evaluate the privacy leakage of its input data: the smaller is, the less privacy information its input contains.

We seek an optimal degradation function to transform the original to as the common input for both and , and an optimal target model such that:

  • has filtered out the privacy information contained in , i.e.

  • the performance of is minimally affected when using the degraded visual data compared to when using the original data , i.e.

To achieve these two goals, we mathematically formulate the problem as solving the following optimization problem:


The definition of the privacy budget cost is not straightforward. Practically, it needs to be placed in concrete application contexts, often in a task-driven way. For example, in smart workplaces or smart homes with video surveillance, one might often want to avoid a disclosure of the face or identity of persons. Therefore, to reduce could be interpreted as to suppress the success rate of identity recognition or verification. Other privacy-related attributes, such as race, gender, or age, can be similarly defined too. We denote the privacy-related annotations (such as identity label) as , and rewrite as , where denotes the budget model who takes (degraded or original) visual data as input and predicts the corresponding privacy information. Different from , minimizing will encourage to diverge from .

Such a supervised, task-driven definition of poses at least two challenges: (1) Dataset challenge: The privacy budget-related annotations, denoted as , often have less availability than target task labels. Specifically, it is often challenging to have both and ready on the same ; (2)  challenge: Considering the nature of privacy protection, it is not sufficient to merely suppress the success rate of one model. Instead, define a privacy prediction function family , the ideal privacy protection of should be reflected as suppressing every possible model from . That diverts from the common supervised training goal, where only one model needs to be found to successfully fulfill the target task.

We address the Dataset challenge by two ways: (1) cross dataset training and validation: see section 3.6; and more importantly (2) building a new dataset containing both utility and privacy labels: see section 5. We defer their discussion to respective experimental paragraphs.

Handling the challenge is more challenging. First we re-write the general form 1 with the task-driven definition of as follows:


The challenge in essence is the infeasibility to directly solve 2, due to the existence of the superior operation. So we propose to solve the following problem instead:


where has a fixed form and is parameterized by . Similarly, and are parameterized by and correspondingly. Note that problem 3 is a pretty wild approximation to problem 2. But experimental results demonstrate that optimizing equation 3 can already achieve results much better than the baseline methods. We further propose ”model ensemble” and ”model restarting” (see section 3.4) to handle the challenge better and further boost the experimental results.

Without loss of generality, we assume both and to be classification models and output class labels. To optimize the target task performance, could be simply chosen as the KL divergence: . Definition of , however, is not as trivial. We discuss different forms of , and correspondingly different optimization strategies, in sections 3.3.1-3.3.3.

3.2 Basic framework

Figure 1 depicts the basic framework implementing the proposed formulation 2. The framework consists three parts: the degradation model , the target model and the budget model . takes raw video as input, filters out privacy information contained in and outputs the anonymized video . takes as input and carries out the target task. also take as input and try to predict the privacy information from . All three models are implemented with deep neural networks and their parameters are learnable during the training procedure. The entire pipeline is trained under the guidance of the hybrid loss of and . The goal of the training procedure is to find a degradation model which can filter out the privacy information contained in the original video while keeping useful information for the utility task, and to find a target model which can achieve good performance on target task using degraded videos . Similar frameworks have been used in feature disentanglement[21] [22] [23] [24]. After training, the learned degradation model can be applied on local device (e.g. smart camera). We can convert raw video to degraded video locally and only transfer the degraded video through Internet to the backend (e.g. cloud) for target task analysis, so that the privacy information contained in the raw videos will be invisible on the backend.

Fig. 1: Basic adversarial training framework for privacy-preserving visual recognition.

Specifically, is implemented using the model in [25], which can be taken as a 2D convolution based frame-level filter. In other words, converts each frame in into a feature map of the same shape as the original frame. We use state-of-the-art human action recognition model C3D[26] as and state-of-the-art image classification models, such as ResNet[27] and MobileNet[28], as . Since the action recognition model we use is C3D, we need to split the videos into clips with fixed frame number. Each clip is 4D tensor of shape , where is the number of frames in each clip and , , are the width, height and color channel number of each frame correspondingly. Unlike who takes a 4D tensor as an input data sample, takes a 3D tensor (i.e. a frame) as input. We average the logits over the temporal dimension of each video clip to calculate and predict the budget task label.

3.3 Optimization Strategies

Similar to GANs [29], our model is prone to collapse and get stuck at bad local minimums during training. We thus carefully designed three different optimization schemes to solve this hard optimization problem.

3.3.1 Gradient reverse layer (GRL)

We can consider problem 3 as a saddle point problem:

where and is KL divergence loss function.

GRL[30] is a state-of-the-art algorithm to solve such a saddle point problem. The underlying mathematical gist is simply solving the problem by following these update rules,


We denote this method as Ours-KL in the following parts and the formal description is in algorithm 1

Input: Target labels , Budget labels , visual data , step size , , , accuracy threshhold , maximum iteration
Output: Degradation model parameter , target model parameter and budget model parameter
1 initialize , and ;
2 for  to max_iter do
3       Update using equation 4a
4       while target task validation accuracy  do
5             Update using equation 4b
6       end while
7      while budget task training accuracy  do
8             Update using equation 4c
9       end while
11 end for
Algorithm 1 Ours-KL algorithm

3.3.2 Maximize entropy

Updating to maximize cross-entropy loss might not be the best choice in our setting. We don’t need the budget model to misclassify data to a false class with high confidence. For example, if a sample has ground truth label , one of the global optimum for maximizing is . We don’t need to go that far to this point so that will misclassify this training sample with high confidence. Instead, a more reasonable output for is somewhere near , which means has filtered most information that is necessary for the budget task. Based on these intuitions, we formulate a new optimization scheme:

where and are still KL divergence loss functions as in equation 4, and is the entropy of .

In practice, we update and in an end-to-end way when minimizing . That’s to say, the optimization scheme is actually as follows:


We denote this method as Ours-Entropy in the following parts and the formal description is in algorithm 2.

Input: Target labels , Budget labels , visual data , step size , , , accuracy threshhold , maximum iteration
Output: Degradation model parameter , target model parameter and budget model parameter
1 initialize , and ;
2 for  to max_iter do
3       Update using equation 6a
4       while target task validation accuracy  do
5             Update , using equation 6b
6       end while
7      while budget task training accuracy  do
8             Update using equation 6c
9       end while
11 end for
Algorithm 2 Ours-Entropy algorithm

3.3.3 Alternative optimization of two loss functions

The goal in equation 3 can also be formulated as alternatively solving the following two optimization problems:

With a little abuse of notations, we rewrite these two loss functions in terms of neural network parameters:


Equation 8a is an ordinary minimization problem which can be solved by training and in an end-to-end fashion. Equation 8b is a minimax problem which we solve by the recent state-of-the-art minimax algorithm -beam [31]. -beam keeps tracking different sets of budget model parameters, denoted as , during training time and alternatively updates and . More specifically, each iteration of the training procedure can be divided into two phases: step and step. Suppose at the -th iteration, the parameter of degradation and budget models are and correspondingly. During step, it first select where and update using gradient descend on to get . During step, it updates all budget model parameters separately by gradient descend on to get . We suggest the readers to refer to the original paper for more details.

Based on -beam algorithm, we design algorithm 3 to alternatively solve the two loss functions in equation 8: We denote this method as Ours--beam in the following parts.

Input: Target labels , Budget labels , visual data , step size , , , budget model beam number , accuracy thresholds , number of iteration
Output: Degradation model parameter , target model parameter and budget model parameter
1 initialize , and ;
2 for  to max_iter do
3       step:
5       for  to d_iter do
7       end for
8       step:
9       for  to K do
10             while budget task training accuracy  do
12             end while
14       end for
15       step:
16       while target task validation accuracy  do
18       end while
20 end for
Algorithm 3 Ours--beam algorithm

3.4 Addressing the Challenge

To improve the generalization ability of learned over all possible (i.e, privacy cannot be reliably predicted by any model), we hereby discuss two simple and easy-to-implement options. Other more sophisticated model re-sampling or model search approaches, such as [32], will be explored in future work.

3.4.1 Budget Model Restarting

At certain point of training (e.g., when the privacy budget stops decreasing any further), we replace the current weights in with random weights. Such a random re-starting aims to avoid trivial overfitting between and (i.e., is only specialized at confusing the current ), without incurring more parameters. We then start to train the new model to be a strong competitor, w.r.t. the current : specifically, we freeze the training of and , and change to minimizing , until the new has been trained from scratch to become a strong privacy prediction model over current . We then resume adversarial training by unfreezing and , as well as replacing the loss for back to the negative entropy. It can repeat several times.

3.4.2 Budget Model Ensemble

The other strategy proposes to approximate the continuous with a discrete set of sample functions. Assuming the budget model ensemble , we turn to minimizing the following discretized surrogate of 2:


The previous basic framework is a special case of equation 9 with . The ensemble strategy can be easily combined with re-starting.

3.4.3 Combine Budget Model Restarting and Budget Model Ensemble with Ours-Entropy

Budget Model Restarting and Budget Model Ensemble can be easily combined with all three optimization schemes described in section 3.3.1-3.3.3. We take Ours-Entropy as an example here.

When model ensemble is used, we denote , and take in equation 6a. That’s to say we only suppress the model with the largest privacy leakage , e.g. the ”most confident” one about its current privacy prediction, when updating degradation model. But we still update all budget models in equation 6c. So the parameter updating scheme is:


The formal description of Ours-Entropy algorithm with model restarting/ensemble is given in algorithm 4.

Input: Target labels , Budget labels , visual data , step size , , , model ensemble number , accuracy threshold , maximum iteration , restarting iteration .
Output: Degradation model parameter , target model parameter and budget model parameter
1 initialize , and ;
2 for  to max_iter do
3       if  then
4             Reinitialize
6       end if
7      Update using equation 10a
8       while target task validation accuracy  do
9             Update using equation 10b
10       end while
11      for  to  do
12             while budget task training accuracy  do
13                   Update using equation 10c
14             end while
16       end for
18 end for
Algorithm 4 Ours-Entropy algorithm (with model restarting and model ensemble)

3.5 Two-Fold Evaluation Protocol

As a balance between two task models, the evaluation protocol for privacy-preserving visual recognition is inevitably more intricate than classical visual recognition tasks.

After we get , and by solving problem 3, we need to evaluate the performance in two folds: (1) whether the learned target task model maintains satisfactory performance on degraded videos; (2) whether the performance of an arbitrary privacy prediction model on degraded videos will deteriorate. Suppose we have training dataset with target and budget task ground truth labels and and evaluation dataset with target and budget task ground truth labels and . The first fold can follow the traditional evaluation routine: compare with to get the evaluation accuracy on target task, denoted as , which we expect to be as high as possible.

For the second fold, it is apparently insufficient if we only observe that the learned and lead to poor classification accuracy on , because of the challenge: the attacker can select any budget model to try to steal privacy information from degraded videos . To empirically verify that prohibits reliable privacy prediction for other possible budget models, we propose a novel procedure: we randomly re-sample models from . Then we train these models on degraded training videos to make correct predictions on privacy information, i.e. for . (Note that is fixed during this training procedure.) After that, we apply each on degraded evaluation videos and compare the outputs with to get evaluation budget accuracy of the -th budget model. We select the highest accuracy among all budget models and use it as the final budget accuracy , which we expect to be as low as possible.

3.6 Cross-Dataset Training and Evaluation: An initial step towards alleviating the dataset challenge

An ideal dataset to train our framework would be, for example, a set of human action videos with both action class and privacy attributes labeled. However, to the best of our knowledge, no public dataset well satisfies this condition. We propose to use cross-dataset training and evaluation as a workaround method, details of which can be found in section 3.6. In brief, we train action recognition (target task) on human action datasets, such as UCF101[33] and HMDB51[34], and train privacy protection (budget task) on visual privacy dataset VISPR[20], while letting the two interact via their shared component - the learned degradation model.

More specifically, during training, we have two pipelines: one is + trained on UCF101 or HMDB51 for action recognition; the other is + trained on VISPR to suppress multiple privacy attribute prediction. The two pipelines share the same parameters for . The initialization and alternating training strategy remain unchanged from SBU experiments, as shown in algorithm 2. During evaluation, we perform the first part of two-fold evaluation, i.e. action recognition, on UCF101 or HMDB51 testing set. We then evaluate privacy protection performance of budget models using the VISPR testing sets. Such cross-dataset training and evaluation sheds new possibilities on training privacy-preserving recognition models, even under the practical shortages of datasets that have been annotated for both tasks.

Beyond this initial step forward, we further construct a new datatset dedicated to the privacy-preserving visual recognition task. It will be presented in section 5.

4 Simulation Experiments

We show the effectiveness of our framework on privacy-preserving action recognition on existing datasets. The target task is human action recognition since it is a highly demanded feature in smart home and smart workplace application. Experiments are carried out on three widely used human action recognition datasets: SBU Kinect Interaction Dataset[35], UCF101[33] and HMDB51[34]. The budget task varies in different settings. In the experiments on SBU dataset, the budget task is to prevent the videos leaking human identity information. In the experiments on UCF101 and HMDB51, the budget task is to protect visual privacy attributes as defined in [20]. We emphasize that general framework proposed in section 3.2 can be used for a large variety of target task and budget task combinations, not only limited to these two settings mentioned above.

4.1 Identity-Preserving Action Recognition on SBU: Single-Dataset Training

We compare our framework with three baseline methods proposed in [1] and [16] to show our methods’ significant superiority in balancing privacy protection and model utility. We use three different optimization schemes as described in section 3.3.1-3.3.3 on our framework and empirically shows all three largely outperform the baseline methods. We also show that adding the model ensemble and model restarting, as described in section 3.4, to the optimization procedure can further improve the performance of our method.

4.1.1 Dataset and Problem Setting

SBU Kinect Interaction Dataset[35] is a two-person interaction dataset for video-based action recognition. Seven participants performed actions and the dataset is composed of 21 sets, each containing videos of a pair of different persons performing all 8 interactions. However, some sets contain the same two actors but with different person acting and reacting. For example, in set 1, actor 1 is acting and actor 2 is reacting; in set 4, actor 2 is acting and actor 1 is reacting. These two sets have the same actors, so we combine them as one single class to fit with our experimental setting better. In this way, we combine all sets with the same actors and finally get 13 different actor pairs. The target task on this dataset is action recognition, which could be taken as a classification task with 8 different classes. The budget task is to recognize the actor pairs of the videos, which could be taken as a classification task with 13 different classes. Following the notations in section 3.2, in this dataset, we have , , , and we set . Note that the original resolution for SBU is . We first downsample video frames to resolution and then crop each frame to .

4.1.2 Implementation Details

We compare the following six methods:

  • Naive Downsample: using raw RGB frames under different down-sampling rates, following [1].

  • Cropping-Face: detecting and cropping out faces from RGB frames, following [16].

  • Cropping-Whole: detecting and cropping out whole actor bodies from RGB frames, following [16].

  • Ours-KL: as described in section 3.3.1.

  • Ours-Entropy: as described in section 3.3.2.

  • Ours-Kbeam: as described in section 3.3.3.

In all three algorithms 1-3, we set step sizes , , , accuracy thresholds , and . In algorithm 3, we set to be and we tried . In algorithm 4, we set to be and experimented with . MobileNet[28] with different depth multipliers ranging from to are used as ensembled models. Other hyper-parameters of algorithm 4 are identical with those in algorithm 2. We set = in loss function 3 and use Adam optimizer[36] to update all parameters. For Naive Downsample method, we use down-sampling rate ranging from 1 (i.e. no down-sampling) to 56.

Fig. 2: Target and Budget Task Performance Trade-off on SBU Dataset. For Naive Downsample method, a larger marker means a larger down sampling rate is adopted. For Ours-Kbeam method, a larger marker means a larger in algorithm 3. Neither model restarting nor model ensemble is used.

4.1.3 Result Comparison and Discussion

We present the experimental results in figure 2, which displays the trade-off between the action recognition accuracy and the actor pair recognition accuracy . In order to interpret this figure, we should note that a desirable trade-off should incur minimal target accuracy (y-axis) while reducing budget accuracy (x-axis). Therefore, a point closer to the top-left corner represents a degrade model with more desirable performance. The magenta dotted line suggests the target accuracy on original unprotected videos. This can be roughly considered as the upper bound for all privacy protection methods, under the assumption that will unavoidably filter out some useful information for the target task.

As we can see, all three of our methods largely outperform the baselines. Crop-Face and Naive Downsample with low downsample rate can lead to decent action accuracy, but the budget accuracy is still very high, meaning these methods fail to protect privacy. On the other hand, Crop-Whole and Naive Downsample with downsample rate as high as 56 can effectively suppress to a low level, but also suffers a huge negative impact, which means the degraded videos are of little practical utility. Our methods, in constrast, achieves a great balance between utility and privacy protection. Ours-Entropy and Ours-Kbeam with can both decrease by around with nearly no harm on . Ours-KL and Ours-Kbeam with achieve slightly worse trade-off compared with Ours-Entropy, but they still largely outperforms the baseline methods. Due to its ease of implementation and low complexity, we use Ours-Entropy as the default option in our framework, unless otherwise noted.

4.1.4 Effectiveness of Model Restarting and Ensemble

In this section, we add model restarting and model ensemble to Ours-Entropy, as shown in algorithm 4, to further improve the performance. Note that model restarting and model ensemble can be easily combined with all three of our methods, and we just pick Ours-Entropy here to show their effectiveness. The results are shown in figure 3. As we can see, using model restarting can suppress much further, with no additional harm on . Model ensemble also helps to improve the trade-off.

Fig. 3: Target and Budget Task Performance Trade-off on SBU Dataset. For Naive Downsample method, a larger marker means a larger down sampling rate is adopted.

4.2 Action Recognition on UCF101 with Multiple Privacy Attributes Protected: Cross-Dataset Training

4.2.1 Dataset and Problem Setting

UCF101 is an action recognition dataset with 13,320 real-life human action videos collected from YouTube. It contains videos of 101 different actions. We use official train-test split for this dataset. Similar to what we did with SBU, we spatially downsample and crop videos frames to resolution, and then we temporally clip the videos to video clips with shape . The target task is to do human action recognition on UCF101, which can be taken as a video classification task with 101 classes.

VISPR is a dataset with 22,167 images annotated with 68 privacy attributes, e.g. semi-nudity, hobbies, face, race, gender, skin color and so on. Each attribute of an image is labeled as ’present’ and ’non-present’ depending whether the certain privacy attribute information is contained in the image. We select 7 out of 68 attributes, which frequently appear in UCF101 datasets and well fit the smart home application settings, to protect in our experiment. That’s to say, we only consider these 7 attributes during training and evaluation. These 7 attributes are: semi-nudity, occupation, hobbies, sports, personal relationship, social relationship and safe. The budget task is to predict privacy attributes on VISPR dataset, which can be taken as a multi-label image classification task (7 labels, each is a binary classification task). We use class-based mean average precision (cMAP), whose definition can be referred to [20], to measure the performance of this budget task. Official train-test split is used on VISPR dataset.

Fig. 4: Performance Trade-off on UCF-101/VISPR dataset.

4.2.2 Implementation Details

Experiments on SBU has shown Ours-Entropy achieves slightly superior performance than Ours-KL and Ours-kbeam, so we only compare Ours-Entropy with baseline methods on this dataset. In algorithm 4, we set step sizes , , , accuracy thresholds , and , and experimented with . Models from {ResNet-V1-50, ResNet-V2-50, MobileNet-1.0, MobileNet-0.75} are used as . We set in loss function 3 and use Adam optimizer to update all parameters. For Naive Downsample method, we use four different down-sampling rates: 1, 2, 4, 6.

4.2.3 Results Analyses

We present the experimental results in figure 4. Naive Downsampling causes to drop dramatically while only drops a little bit, which means the utility of videos is greatly damaged while the privacy information is hardly filtered out. On the contrary, with the help of model restarting and model ensemble, Ours-Entropy can decrease by while keeping as high as we can get on the original undegraded videos, meaning the privacy is protected at almost no cost on the utility. Hence, Ours-Entropy outperforms Naive Downsampling in this experiment.

5 PA-HMDB51: A New Benchmark

5.1 Motivation

To the best of our knowledge, there is no public dataset containing both human action and privacy attribute labels on the same videos. The lack of available datasets has not only made it difficult to employ a data driven joint training method, but more importantly making it impossible to directly evaluate the performance of a learned model to keep privacy while not harming utility on a single dataset. To solve this problem, we annotate and present the very first human action video dataset with privacy attributes labeled, named PA-HMDB51 (Privacy Attribute HMDB51). We evaluate our method on this newly built dataset and further demonstrate our method’s effectiveness.

5.2 Selecting and Annoating Privacy Attributes

A recent work [20] has defined 68 privacy attributes which could be disclosed by images. However, most of them seldom make occurrence in public human action datasets. We carefully selected 7 privacy attributes which are most relevant to our smart home settings, out of 68 attributes from [20]. These seven attributes are: skin color, gender, face (partial), face (complete), nudity, personal relationship and social circle. We further combined those 7 attributes into 5 to better fit the human action videos: combine ”face (partial)” and ”face (complete)” into one attribute ”face” and combine ”personal relationship” and ”social circle” into ”relationship”. To this end, we have got five privacy attributes which widely appear in public human action datasets and are closely relevant to our smart home setting. The detailed description of each attribute and their possible ground truth values and their corresponding meaning are listed in table I.

Privacy attributes may vary during the video clip. For example, in some frames we may see a person’s full face, while in next frames the person may turn around and the face is no longer visible. We therefore decide to label all privacy attributes for each frame.

The annotation of privacy labels was manually performed by a group of students at the CSE department of Texas A&M University. Each video was at least annotated by three individuals, and then cross-checked.

5.3 HMBD51 as the Data Source

Now that we have defined the five privacy attributes, we need to identify a source of human action videos to label. There are a number of choices available, such as [33, 34, 37, 38, 39]. We choose HMDB51 [34] to label privacy attributes, since it consists of more diverse privacy information, especially nudity/semi-nudity.

We provide frame-level annotation of the selected 5 privacy attributes on 592 videos selected from HMDB51. In this paper, we treat all 592 videos as testing samples; however, we do not exclude the future possibility to use them for training.

5.4 Dataset Statistics

5.4.1 Action Distribution

When selecting videos from HMDB51 dataset, we consider two criteria on action labels. First, the action labels should be as balanced. Second (and more implicitly), we select more videos with non-trivial privacy labels. For example, ”brush hair” action contains many videos with ”semi-nudity” attribute; ”drink” action contains many videos with ”can tell relationship” attribute. Despite their practical importance, the two privacy attributes are relatively less seen in the entire HMDB51 dataset, so we tend to select more videos with the two attributes, regardless of their action classes.

The resultant distribution of action labels are depicted in Figure 5, showing a relative class balanace.

Fig. 5: Action distribution of PA-HMDB51. Each column shows the number of videos with a certain action. For example, the first column shows there are 25 ”brush hair” videos in PA-HMDB51 dataset.

5.4.2 Privacy Attribute Distribution

We try to make the label distribution for each attribute as balanced as possible by manually selecting those videos containing uncommon privacy attribute values in original HMDB51 to label. For instance, videos with semi-nudity are overall uncommon, so we deliberately select those videos containing semi-nudity into our PA-HMDB51 dataset. People are sensitive about releasing their privacy to the public, so the privacy attributes are highly unbalanced in any public video datasets by natural. Although we have used this method to try to relief this problem, the PA-HMDB51 is still unbalanced. Figure 6 shows the frame-level label distribution of all five privacy attributes. We can see the videos with ”cannot tell gender” and ”cannot tell skin color” still take a very small portion in the whole dataset.

Note that the total frame number of each privacy attribute is different since some vague frames may not be labeled with some certain attribute(s). For example, in some frame, it is hard to tell whether the face is complete or partial (it is somewhere in between), but you can tell the skin color of the actor in all those frames. In this situation, we would not label ’face’ attribute on those vague frames but would still label ’skin color’ attribute.

Fig. 6: Label distribution per privacy attribute in PA-HMDB51. Definitions of label values (0,1,2,3) for each attribute are described in Table I.

5.4.3 Action-Attribute Correlation

If there is a strong correlation between a privacy attribute and an action, it would be harder to remove the privacy information from the videos without much harm of the action recognition task. For example, we would expect high correlation between ’gender’ and the action ’brush hair’, since this action is carried our much more often by female than by male. Figure 7 shows the correlation between privacy attributes and actions.

Fig. 7: Action-PA correlation in PA-HMDB51 dataset. The color represents the number of frames of each action containing a specific privacy attribute value. For example, in the ”relationship” subplot, the intersection block of row ”Exist” and column ”kiss” shows the number of frames with ”relationship exist” label in all kiss videos.
Attribute Possible Values Meaning
Skin Color 0 invisible Skin color of the actor is invisible.
1 white Skin color of the actor is white.
2 brown/yellow Skin color of the actor is brown/yellow.
3 black Skin color of the actor is black.
Face 0 No face Less than 10% of the actor’s face is visible.
1 Partial face Less than 70% but more than 10% of the actor’s face is visible.
2 Whole face More than 70% of the actor’s face is visible.
Gender 0 Cannot tell Cannot tell the person’s gender.
1 Male It’s an actor.
2 Female It’s an actress.
Nudity 0 The actor/actress is wearing long sleeves and pants.
1 The actor/actress is wearing short sleeves or shorts/short skirts.
2 The actor/actress is of semi-nudity.
Relationship 0 Cannot tell Relationships (such as friends, couples, etc.) between the actors/actress cannot be told from the video.
1 Can tell Relationships between the actors/actress can be told from the video.
TABLE I: Attribute Definition on PA-HMDB51 Dataset
Frame Action Privacy Attributes
Brush hair
  • skin color: white

  • face: no

  • gender: female

  • nudity: level 2

  • relationship: no

  • skin color: white

  • face: whole

  • gender: male

  • nudity: level 1

  • relationship: no

TABLE II: Attribute Definition on PA-HMDB51 Dataset

5.5 Benchmark Results on PA-HMDB51

5.5.1 Dataset and Problem Setting

We train our models using cross-dataset training on HMDB51 and VISPR datasets as we did in section 4.2, except that we use the five attributes defined in Table I on VISPR instead of the seven used in section 4.2. The trained models are directly evaluated on PA-HMDB51 dataset for both target task and budget task ,without any re-training or adaptation. We use the rest videos in HMDB51 not included in PA-HMDB51 as training set. Similar with the UCF101 experiments, the target task (i.e. action recognition) can be taken as a video classification problem with 51 classes, and the budget task (i.e. privacy attribute prediction) can be taken as a multi-label image classification task with two classes for each privacy attribute label. Notably, although PA-HMDB51 has provided concrete multi-class labels with specific privacy attribute classes, we convert them into binary labels during testing. For example, for “gender” attribute, we have provided ground truth labels “male”, “female” and “cannot tell”, but we only use “can tell” and “cannot tell” in our experiments, via combining “male” and “female” into the one class of “can tell”. This is because we have to keep the testing protocol on PA-HMDB51 in consistency with the training protocol on VISPR (a multi-label, “either-or” type binary classification task, so that our models cross-trained on UCF101-VISPR can be evaluated directly. Meanwhile, We hope to extend training to PA-HMDB51 in the future, so that the budget task as a multi-label can be formulated and evaluated as a multi-label, multi-classification problem.

The inputs to our framework are clips of shape , just the same as in the SBU and UCF101 experiments. All implementation details are identical with the UCF101 case, except that we adjust and .

5.5.2 Results and Analysis

The results of Ours-KL method with (with and without restarting) and naive downsampling method with downsample rates are shown in figure 8. Our methods achieve much better privacy-utility trade-off compared with baseline methods. When , our methods are able to decrease privacy cMAP by around 8% with little harm on utility accuracy. Overall, the privacy gains are more limited compared to the previous two experiments, because no (re-)training is performed; but the overall comparison trends show the same consistency.

Fig. 8: Performance trade-off on PA-HMDB51 dataset.

6 Conclusion

We proposed an innovative framework to solve the newly-established task of privacy-preserving visual recognition. To tackle the challenging adversarial learning process, we investigate three different optimization schemes. To further tackle the challenge of universal privacy protection, we proposed model restarting and ensemble, which are shown to further improve the obtained trade-off. Various simulations confirmed the effectiveness of this proposal framework. Last but not least, we establish the very first dataset for privacy-preserving video action recognition, an effort that we hope could engage a broader community into this research direction.

We note there being a large room to improve the proposed framework before it can achieve practical usefulness. For example, the definition of privacy leakage risk is core to the framework. Considering the challenge, current defined with any specific is insufficient; the budget model ensemble could be only viewed as a rough discretized approximation of . More elegant ways to model this optimization may lead to further performance breakthroughs.


  • [1] M. S. Ryoo, B. Rothrock, C. Fleming, and H. J. Yang, “Privacy-preserving human activity recognition from extreme low resolution,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [2] D. J. Butler, J. Huang, F. Roesner, and M. Cakmak, “The privacy-utility tradeoff for remotely teleoperated robots,” in 10th ACM/IEEE International Conference on Human-Robot Interaction, 2015, pp. 27–34.
  • [3] J. Dai, B. Saghafi, J. Wu, J. Konrad, and P. Ishwar, “Towards privacy-preserving recognition of human activities,” in IEEE International Conference on Image Processing (ICIP), 2015, pp. 4238–4242.
  • [4] Z. Wu, Z. Wang, Z. Wang, and H. Jin, “Towards privacy-preserving visual recognition via adversarial training: A pilot study,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 606–624.
  • [5] C. Gentry et al., “Fully homomorphic encryption using ideal lattices.” in Annual ACM Symposium on the Theory of Computing (STOC), 2009, pp. 169–178.
  • [6] P. Xie, M. Bilenko, T. Finley, R. Gilad-Bachrach, K. Lauter, and M. Naehrig, “Crypto-nets: Neural networks over encrypted data,” arXiv preprint arXiv:1412.6181, 2014.
  • [7] A. Chattopadhyay and T. E. Boult, “Privacycam: a privacy preserving camera using uclinux on the blackfin dsp,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8.
  • [8] A. Mahendran and A. Vedaldi, “Visualizing deep convolutional neural networks using natural pre-images,” International Journal of Computer Vision (IJCV), vol. 120, no. 3, pp. 233––255, May 2016.
  • [9] T. Winkler, A. Erdélyi, and B. Rinner, “Trusteye. m4: protecting the sensor—not the camera,” in 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).   IEEE, 2014, pp. 159–164.
  • [10] F. Pittaluga and S. J. Koppal, “Privacy preserving optics for miniature vision sensors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 314–324.
  • [11] ——, “Pre-capture privacy for small vision sensors,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 39, no. 11, pp. 2215–2226, 2017.
  • [12] L. Jia and R. J. Radke, “Using time-of-flight measurements for privacy-preserving tracking in a smart room,” IEEE Transactions on Industrial Informatics, vol. 10, no. 1, pp. 689–696, 2014.
  • [13] S. Tao, M. Kudo, and H. Nonaka, “Privacy-preserved behavior analysis and fall detection by an infrared ceiling sensor network,” Sensors, vol. 12, no. 12, pp. 16 920–16 936, 2012.
  • [14] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang, “Studying very low resolution recognition using deep networks,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4792–4800, 2016.
  • [15] B. Cheng, Z. Wang, Z. Zhang, Z. Li, D. Liu, J. Yang, S. Huang, and T. S. Huang, “Robust emotion recognition from low quality and low bit rate video: A deep learning approach,” in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).   IEEE, 2017, pp. 65–70.
  • [16] Y. Li, N. Vishwamitra, B. P. Knijnenburg, H. Hu, and K. Caine, “Blur vs. block: Investigating the effectiveness of privacy-enhancing obfuscation for images,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1343–1351.
  • [17] S. J. Oh, R. Benenson, M. Fritz, and B. Schiele, “Faceless person recognition: Privacy implications in social media,” in European Conference on Computer Vision.   Springer, 2016, pp. 19–35.
  • [18] R. McPherson, R. Shokri, and V. Shmatikov, “Defeating image obfuscation with deep learning,” arXiv preprint arXiv:1609.00408, 2016.
  • [19] S. J. Oh, M. Fritz, and B. Schiele, “Adversarial image perturbation for privacy protection a game theory perspective,” in 2017 IEEE International Conference on Computer Vision (ICCV).   IEEE, 2017, pp. 1491–1500.
  • [20] T. Orekondy, B. Schiele, and M. Fritz, “Towards a visual privacy advisor: Understanding and predicting privacy risks in images,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3686–3695.
  • [21] X. Xiang and T. D. Tran, “Linear disentangled representation learning for facial actions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 12, pp. 3539–3544, 2018.
  • [22] G. Desjardins, A. Courville, and Y. Bengio, “Disentangling factors of variation via generative entangling,” arXiv preprint arXiv:1210.5474, 2012.
  • [23] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio, “Image-to-image translation for cross-domain disentanglement,” arXiv preprint arXiv:1805.09730, 2018.
  • [24] N. Siddharth, B. Paige, A. Desmaison, J.-W. van de Meent, F. Wood, N. D. Goodman, P. Kohli, and P. H. Torr, “Learning disentangled representations in deep generative models,” 2016.
  • [25] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [26] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [29] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
  • [30] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, pp. 1180–1189.
  • [31] J. Hamm and Y.-K. Noh, “K-beam minimax: Efficient optimization for deep adversarial learning,” in Proceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 1881–1889.
  • [32] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8697–8710.
  • [33] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [34] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011, pp. 2556–2563.
  • [35] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, “Two-person interaction detection using body-pose features and multiple instance learning,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 28–35.
  • [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference for Learning Representations (ICLR), 2015.
  • [37] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.
  • [38] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, and et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018. [Online]. Available:
  • [39] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The kinetics human action video dataset,” 2017.

Haotao Wang received the B.E. degree in electronics engineering from Tsinghua University, China, in 2018. He is working toward the PhD degree in Texas A&M University, under the supervision of Zhangyang Wang. His research interests lie in computer vision and machine learning, especially in privacy and adversarial attack and defense.

Zhenyu Wu received the M.S. and B.E. degrees from the Ohio State University and Shanghai Jiao Tong University respectively. He is currently a Ph.D. student in Texas A&M University, advised by Zhangyang Wang. His research interests include visual privacy protection, neural network compression, object detection, fairness in generative models and hand pose estimation.

Zhangyang Wang is an Assistant Professor of Computer Science and Engineering (CSE), at the Texas A&M University (TAMU). During 2012-2016, he was a Ph.D. student in the Electrical and Computer Engineering (ECE) Department, at the University of Illinois at Urbana-Champaign (UIUC), working with Professor Thomas S. Huang. Dr. Wang’s research has been addressing machine learning, computer vision and optimization problems, as well as their interdisciplinary applications. He has co-authored over 80 papers, and published 2 books and 1 chapter. He has been granted 3 patents, and has received around 20 research awards and scholarships.

Zhaowen Wang received the B.E. and M.S. degrees from Shanghai Jiao Tong University, China, in 2006 and 2009, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign, in 2014. He is currently a Senior Research Scientist with the Creative Intelligence Lab, Adobe Inc. His research has been focused on understanding and enhancing images, videos and graphics via machine learning algorithms, with a particular interest in sparse coding and deep learning.

Hailin Jin is a Senior Principal Scientist at Adobe Research. He received his Master of Science and Doctor of Science degrees in Electrical Engineering from Washington University in Saint Louis in 2000 and 2003, respectively. His advisor was Professor Stefano Soatto. Between fall 2003 and fall 2004, he was a postdoctoral researcher at the Computer Science Department, University of California at Los Angeles. His current research interests include: deep learning, natural language processing, computer vision, video, image search, 3D reconstruction, structure and motion estimation, optical flow, stereo, and image-based modeling and rendering. His work can be found in many Adobe products including Photoshop, After Effects, Premiere Pro, and Photoshop Lightroom.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description