Horizontal-to-Vertical Video Conversion
At this blooming age of social media and mobile platform, mass consumers are migrating from horizontal video to vertical contents delivered on hand-held devices. Accordingly, revitalizing the exposure of horizontal video becomes vital and urgent, which is hereby settled, for the first time, with our automated horizontal-to-vertical (abbreviated as H2V) video conversion framework. Essentially, the H2V framework performs subject-preserving video cropping instantiated in the proposed Rank-SS module. Rank-SS incorporates object detection to discover candidate subjects, from which we select the primary subject-to-preserve leveraging location, appearance, and salient cues in a convolutional neural network. Besides converting horizontal videos to vertically by cropping around the selected subject, automatic shot detection and multi-object tracking are also integrated in the H2V framework to accommodate long and complex videos. In addition, for the development of H2V systems, we publicize an H2V-142K dataset containing 125 videos (132K frames) and 9,500 cover images annotated with primary subject bounding boxes. On H2V-142K and public object detection datasets, our method demonstrates superior subject selection accuracy comparing to related solutions. Beyond that, our H2V framework is also industrially-deployed hosting millions of daily active users and exhibits favorable H2V conversion performance. Upon publicizing this dataset as well as our approach, we wish to pave the way for more horizontal-to-vertical video conversion solutions to come.
Vertical videos are created for viewing in portrait mode on hand-held devices, which are opposite from traditional horizontal formats popularized on big screens. Promoted by the unprecedented growth of social media platforms such as TikTok, Instagram, and Youku, etc.), vertical videos take over the focus of mass video consumers, leaving abundant horizontal contents less exposed. To reintegrate their exposure, the horizontal videos have been converted vertically with manual processing and cropping, which however is prohibitively labor-intensive and time-consuming. Accordingly, fully-automated horizontal-to-vertical video conversion method is in imperative need.
Nonetheless, automated horizontal-to-vertical video conversion is an uncharted territory, of which the key challenge is subject preservation, i.e. keeping the main subject (mostly human) stable in the scene through the information-losing video cropping. As illustrated in Fig. 1, conversion solutions realize subject preservation by cropping horizontal sources around the primary subject for producing vertical outputs. To achieve subject-preserving conversion, one need to develope a fully-automated pipeline assembling video shot boundary detection, subject selection, subject tracking, and video cropping components. Amongst, subject selection is of the most cardinal importance.
Notably, achieving subject selection in horizontal-to-vertical conversion is complicated for two reasons. Firstly, the primary subject in a video shifts constantly from shot-to-shot, therefore accurate shot boundary detection is indispensable as pre-processing. Secondly, in most cases, one has to select the primary subject out of numerous foreground distractors within each frame. To overcome this challenge, Salient Object Detection (SOD) [4, 65] and Fixation Prediction (FP)  have been practiced. SOD performs pixel-level foreground-background binary classification to discover all objects but fails to discriminate the primary subject from other candidates. For FP, although being able to find the primary subject by imitating human visual system, yet it can only provide point-like fixation response thus fail to obtain the subject in its entirety. A comparison between these strategies and our solution is shown in Fig. 2.
In this work, we propose the H2V framework as the first automatic horizontal-to-vertical video conversion system, which settles subject-preserving video cropping in effectively and compactly. As shown in Fig. 3, the H2V framework first incorporates shot boundary detection to separate horizontal input into disjointed shots, each containing its own set of subjects. Within each shot, the primary subject is selected at the first frame with the proposed Rank Sub-Select (Rank-SS) module. As depicted in Fig. 4, Rank-SS employs a convolutional architecture and integrates human detection, saliency detection, traditional as well as deep appearance features to discover and select the primary subject. In the following frames of the shot, the primary subject is propagated via the subject tracking module. Particularly, for the development of Rank-SS module, we start by investigating naive and deep regression approaches emphasizing on location priors, which report inadequate performance. We then extend regression into a ranking formulation for primary subject selection, where object-pair relation is taken into consideration and delivers favorable performance. Detailed insights with respect to this extension are elaborated in Section IV-C.
To build and evaluate our H2V framework, as well as to encourage further researches for horizontal-to-vertical video conversion, we collect and publicize a large-scale dataset named as H2V-142K. H2V-142K dataset contains 132K frames from 125 video sequences, each carefully labeled by human annotators with bounding boxes denoting the face and torso of the primary subject. As shown in Fig. 5, this dataset covers rich horizontal contents, including TV series, variety shows, and user-made videos. Besides, another 9,500 video cover images with heavy distractors are also provided, promoting more robust primary subject selection. Detailed statistics of the dataset please refer to Section V-A1. On top of the H2V-142K dataset, publicized detection dataset such as ASR, and extensive user feedback data collected from Youku website, comprehensive experiments are conducted where our H2V framework exhibits favorable qualitative performance, both quantitatively and qualitatively.
In summary, we highlight three main contributions in this paper: 1) We propose the H2V framework as the first unified solution to settle horizontal-to-vertical video conversion, which has been successfully commercialized on a web-scale; 2) The Sub-Select module is designed in the H2V framework, among other well-performing components, which integrates rich visual cues to select the primary subject in a ranking manner; 3) we construct and publicize the H2V-142K dataset with 125 fully-annotated videos (more than 142k images), hoping to pave the way for future endeavors in the field of horizontal-to-vertical video conversion.
Ii Related Work
This section summarizes related works of the task, which is tackled as a content-aware video cropping problem. We firstly survey traditional cropping methods and illustrate the limitation in this specific issue. Research on salient object detection and fixation prediction is introduced that closely related to subject discovery and selection. Moreover, instances localization and ranking are required to get the most eye-catching subject.
Ii-a Image Cropping
Image cropping is a significant technique for improving the visual quality of raw images. Early methods leverage practical experience from photographical experts to solve this problem, e.g rule of central, rule of thirds, rule of grid. With the development of deep learning and arising of large-scale aesthetic datasets, like AVA, AADB, researchers recently solve this task in a data-driven manner and have made great progress in this area. Modern DL-based image cropping methods could be categorized into two streams: structure-based and aesthetic-based. 1) Structure-based methods [47, 59, 6, 68] focus on preserving the most important or salient part after cropping. Attention-mechanism or salient detection ideas are usually applied in these methods. 2) Aesthetics-based methods [19, 7, 78, 49, 72, 75, 12] improves the cropping results by increasing aesthetic quality, local factors are highly considered and these methods are in favor of preserving visually attractive parts. Also, there are methods that combine both global structure and local aesthetic.  models image cropping in a determining-adjusting framework, which first uses attention-aware determining, and then applies aesthetic-based adjusting network.  designs composition-aware and saliency-aware blocks to select more reasonable cropping.
Image cropping technique is also extended to video retargeting task, which is the process of adapting a video from one screen resolution to another to fit different displays and video cropping is one of the important methods . Tradition researchers argue that important objects in the images like faces or text should be preserved and these algorithms are called content-aware algorithms. Different from single image, temporal stability is highly relevant in frames and video flickers should try to be avoided. Most approaches optimize the cropping process shot by shot and then apply a path generation algorithm to get a smooth cropping result [10, 41, 54, 63, 29]. Though great efforts have been made, there are limited methods, as far as we are concerned, that are able to handle complex scenes stably such as multiply distractive objects.
Ii-B Salient Object Detection
Salient Object Detection (SOD) has a long history date back to Itti et al’s work . The majority of SOD [4, 65] methods is designed to detect pixels that belong to the salient objects without knowing the individual instances. So it is commonly treated as a pixel-wise binary classification problem. Traditional heuristic SOD research experienced changes from pixel-based [22, 46], patch-based [44, 1] to region-based [9, 21, 51] methods. Recently, deep learning-based methods dominantly lead the state-of-the-art advances in SOD including Multi-Layer Perceptron-based [77, 81], Fully Convolutional Networks based [43, 20, 42, 79, 70] and Capsule-based [45, 52] methods.
Video salient object detection is also referred as unsupervised video object segmentation (VOS), which is similar to image SOD problem discussed above. How to encode motion saliency between frames is the central issue for video SOD. The bottom-up Strategy is common practice for heuristic methods employing background removal , points tracking and clustering [15, 50], object proposal ranking [11, 71] to tackle the problem. For dl-based model, motion encoding is achieved by optical flow [23, 8, 61, 35] or recurrent neural network [35, 61, 57]. In addition, co-saliency [24, 32] estimation searches for the common salient object regions contained in an image set.
Fixation Prediction (FP)  is another closely related area investigating the human visual system’s attention behavior. Prevalent datasets such as DHF1K  record participants’ eye movement and save it as a fixation map. From a task perspective, fixation prediction only calculates fixation points or small areas rather than inferring the primary salient objects like SOD, that is, FP models [25, 3, 31, 14] care about neither object contour nor object instance.
Ii-C Salient Instance Ranking
Both SOD and FP methods demand expensive pixel-level annotation and generalize poorly in complex scenes while inadequately able to distinguish multiple objects. Therefore, general Salient Object Subitizing [76, 77, 34] methods have been proposed to achieve so. For known object categories, object detection [80, 36] is a more accurate solution. In our H2V framework, we employ a detector for subject discovery as the target subject is human.
To solve the salient ranking task, Li et al.  found that a strong correlation between fixations and salient object exists. Similarly, Wang et al. proposed ranking video SOD  with ranking saliency module leveraging FP and SOD features, which presented promising results. However, fixation data is relatively difficult to label, which limits the application. Another solution introduced by Amirul et al.  a hierarchical representation of relative saliency and stage-wise refinement. In their novel Salient Object Subitizing dataset, prominent objects were asked to label. Furthermore, relative rank scores are computed by averaging the degree of saliency within the instance mask. We summarize it as Ranking by Global Average Pooling (RGAP) scores [2, 69]. As a baseline of the subject selection problem, RGAP-based models N-SS and D-SS (details in Section IV-B) fail to rank hard cases with strong spatial characteristics such as side face. Our proposed RCNN-based [53, 13] Rank-SS model leverages spatial features to achieve better region-based ranking.
Iii H2v Framework
H2V video conversion clips horizontal video into a vertical format while keeping the most engaging content intact. Accordingly, one needs to identify and preserve the primary subject in every frame efficiently. To meet both the production-level accuracy and speed standard, our proposed H2V framework executes in a shot-based fashion. At first, a shot boundary detector TransNet  is employed to segment a horizontal input video into consecutive shots. Within each shot, we first apply our Rank-SS module to discover and select the primary subject. As the primary subject is mostly shot-stable, we bypass frame-by-frame Rank-SS by tracking this subject throughout the shot with trajectory verification and smoothing.
Iii-a Subject Selection Criteria
Since human actors are primary subjects in most trending videos, H2V video conversion crops horizontal videos around the primary subject to reduce the loss of information and produce meaningful vertical content during the conversion process. To correctly identify the most primary human subject, we first discover all human objects in the scene using the DSFD  face detector and FreeAnchor  body detector. Meanwhile, we prefer to utilizing a face detector since it is easier to maintain the completeness of a face than a body in the cropped area. Indeed, the ablation experiment in Section V-C1 also proved that the face detector is more effective than a body detector. Then, selecting the primary subject from all is a highly empirical and subjective task, for which we summarize the following criteria under guidance from professional video editors:
The Central Criterion: The primary subject tends to reside in the center of the scene.
The Focal Criterion: The primary subject appears within the focal length and free-from out-of-the-lens blurry.
The Proportional Criterion: The primary subject tends to occupy the majority of the scene.
The Postural Criterion: The primary subject displays a more eye-catching posture rather than, for example, side-face or back-away.
The Stable Criterion: (Video only) Primary subject usually shows no abrupt displacement within the same shot.
To fully practice the above criteria effectively and compactly, we design a subject selection component, as shown in Fig. 4, which composes subject discovery, feature extraction, and our proposed Sub-Select module. Details are elaborated in Section IV.
Iii-B Subject Tracking
As each video shot usually focuses on the same primary subject, we refrain from the complicated frame-by-frame subject selection and track the primary subject selected as above throughout the shot. Moreover, when professional editors performing H2V manually, video temporal smoothness is deliberately ensured with frame calibrating and interval smoothing. By mimicking this procedure, we design a subject tracking component integrating object-tracking based on SimaMask , verification, and temporal smoothness modules. Specifically, this component simultaneously tracks all subjects in the scene, and should a subject exit the scene, we re-verify the trajectory by triggering the subject selection component to re-initialize the primary subject. Specifically, 1) Whenever a subject disappears or is contaminated by similar distractors, the tracking confidence might be below the threshold; 2) A subject exits the scene, meaning that the vertical scene cannot cover all tracked subjects. All situations will trigger the verification module to restart the sub-select module for subject re-initialization. Besides, to further smooth the trajectory and suppress motion jitters, a Kalman Filter  based motion model is also incorporated.
Iv Subject Selection
In this section we explain our subject selection component in-depth (shown in Fig. 4), emphasizing feature extraction and the novel Sub-Select module. Particularly, the Navie Sub-Select (N-SS), Deep Sub-Select (D-SS), and Rank Sub-Select (Rank-SS) modules are discussed in turns to share more insights in solving subject selection.
Iv-a Feature Extraction
To meet with the subject selection criteria described in Section III-A, we employ three different feature extraction strategies, as shown in Fig. 4. For saliency feature extraction, we adopt the cascaded Partial Decoder model (CPD ) to generate a salient feature map . Blur detection aims to detect Just Noticeable Blur (JNB) caused by defocusing that spans a small number of pixels in images. We utilize the traditional Thresholded Gradient Magnitude Maximization algorithm (Tenengrad [60, 55]) to extract the blur feature map . Additionally, we also implement an ImageNet  pre-trained ResNet-50  to extract deep semantic embedding . In all, the feature exploited in our subject selection component is a concatenation as:
Iv-B N-SS and D-SS
Depending on the extracted feature, subject selection is a problem to compute the probability of each discovered human candidate for being the primary subject, and H2V conversion is then resolved by cropping the video around the most probable subject. The Naive Sub-Select (N-SS) module settles this problem by calculating the probability as the weighted summation of saliency and blur feature vectors. For each discovered subject candidate bounding box (), features extracted from within are abstracted via Global Average Pooling  (GAP) to produce feature vector :
N-SS calculates the probability as:
here we concatenate with to incorporate position and size information into consideration, conforming to the central and proportional criteria. is the weights of the concatenated feature set manually. In the experiment, 0.3, 0.1, 0.3, 0.3 yields the best performance, each representing the weight for the saliency, blur, and bounding box size and position.
The main shortcoming of N-SS is the dependency on manually-set weights . As an improvement, the Deep Sub-Select (D-SS) module is designed with the same input as N-SS, but adopts a Multi-Layer Perception (MLP) to learn the optimal weights in a data-driven way, greatly enhance the capacity of the Sub-Select module w.r.t. feature integration. In specific, we implement 3 fully-connected layers followed by ReLU activation, and adopt Mean Squared Error (MSE) loss as the cost function to optimize D-SS, which is commonly used in the regression problem. The formulation of D-SS is:
Although effective, both N-SS and D-SS generate unary probability while overlooking the pairwise relationship among subject candidates. Concretely, prediction only considers the characteristics of the candidate itself, and the final predicted subject probability score is not related to others. In addition to larger absolute probability values, the primary subject is more distinctive from the non-primary ones w.r.t. higher probability ranking order. From this standpoint, Rank Sub-Select (Rank-SS) module extends the D-SS module from regression into a ranking formulation, striving to select the primary subject more accurately with pairwise ranking supervision.
In specific, in addition to utilizing salient feature, blur feature and bounding box size and position information as our selection basis, we design a RCNN-like [13, 53] module (shown on the left-side in Fig. 4) with deep semantic embedding. Meanwhile, to better optimize the Sub-Select module in Rank-SS, we develop a new pairwise ranking-based supervision paradigm as illustrated on the right-side in Fig. 4, and the Siamese  architecture has two identical Sub-Select module branches and is valid for pair-wise inputs. On top of a Siamese  architecture, bounding boxes for subject and are simultaneously passed onto the Rank-SS module, together with the extracted feature for the scene. Both branches in the Siamese architecture instantiate the same Sub-Select module, feature map is pooled from bounding box on with RoIAlign operation .
where is further vectorized through three cascaded ResNet  bottleneck blocks (followed by GAP ), then concatenated with the bounding box feature vector of and separately. The regression subject probability score is computed as: where and indicate the Rank-SS network and its weights, respectively, denotes the input image. For training the Rank-SS module, we implement both unary and pairwise loss functions.
Unary loss Point-wise Mean Squared Error (MSE) loss is adopted to measure the absolute difference between predicted and ground-truth probability score. The MSE loss for all candidates is calculated as:
represents the ground-truth label of candidate , which is 1 and 0 for subjects and non-subjects respectively.
Pairwise loss Our H2V-142K dataset is annotated with subject primality ranking labels, enabling pairwise supervision to improve the Sub-Select module in learning features and probabilities that better distinguish primary subject from the non-primary ones. Concretely, We adopt the margin-ranking loss  on , generated from the Siamese Rank-SS module, and the associated ranking labels. To adjust Rank-SS output ranking orders compatible the annotation, we formulate:
This pair-wise loss guide the Rank-SS ranking to the orientation of the given relative order, which is formulated as:
here is the rank label of the candidates pair. The margin controls the distance between and .
Subject probability scores of candidates are optimized with the combination of both unary and pairwise losses:
here and demote the weights of and .
In this section, we first elaborately introduce our H2V-142K dataset, including detailed statistics and evaluation metrics on both video and image data. On this dataset, we then present extensive experiments results w.r.t. our H2V framework, emphasizing the Sub-Select module with comprehensive subjective results.
The H2v-142k Dataset
The under-development of H2V conversion is partially attributed to the lack of available data, to which we provide a large scale H2V-142K dataset containing 125 videos (132K frames) and 9.5K images. This dataset is collected from Youku video-sharing website and carefully annotated by human annotators following the Sub-Select criteria, as explained in Section III-A. In specific, the video subset covers more general scenes with fewer subjects, while the image subset better examines the performance of subject selection with heavier distractors. Regarding the train/test split, we randomly select 600 video covers from the image set and all videos in the video set for testing; the remaining images are used for training. Visualisation of sample images and annotations from H2V-142K dataset is presented in Fig. 5.
Video Subset. We collect videos with diversified contents and subject scales, and detailed statistics are shown in Table I. Each video is first segmented into disjointed shots using TransNet . Within each shot, a group of three annotators, after watching all frames, separately annotate every frame with the primary subjects using pairs of the face and torso bounding boxes. The ground-truth bounding boxes are determined by cross-validating over annotators by thresholding on box Intersection-of-Union (IoU). Should a disagreement occur on one frame, it will be carefully reviewed and determined by a group of new annotators. Finally, bounding boxes are smoothed temporally utilizing Kalman Filter .
|Type||Amount||Avg #Subject||Avg #Person|
Image Subset. We choose images containing more human distractors, as shown in Table I, further to examine the performance of H2V framework w.r.t. subject selection. Statistics of subject number in H2V-142K dataset are listed in Fig. 6. For the majority of instances, only one subject is annotated. In the meantime, co-subjects appear more frequently in the image subset because of pre-filter during data preparation. Concretely, a human detector FreeAnchor  pre-trained on the COCO dataset  is applied to collect images containing more than three (at least two) human candidates, and the ground-truth subject is annotated in the same manner applied in the video subset. Besides, non-subjects (distractors) are required to rank by the same criterion as complementary labels. When crowds appear, annotators are asked to rank the top 6 non-subjects, disregarding others to generate hard-ranking labels.
The ASR Dataset
The ASR dataset  is a large-scale salient object ranking dataset based on a combination of the widely used MS-COCO dataset  with the SALICON dataset . SALICON is built on top of MS-COCO to provide mouse-trajectory-based fixations in addition to original objects’ mask and bounding box annotations. The SALICON dataset  provides two sources of fixation data: 1) fixation point sequences and 2) fixation maps for each image. The ASR dataset exploits these two sources to generate ground-truth saliency rank annotations. As the ASR dataset is not human-centered, we verify the generalization ability of the proposed method on this object-centered dataset.
V-B Evaluation Metrics
To evaluate performance on our H2V-142K dataset, we adopt the max Intersection-over-Union (max-IoU), min Central Distance Ratio (min-CDR), min Boundary Displacement Error (min-BDE), and mean Average Precision (mAP) metrics. Precisely, we measure the subject selection accuracy by calculating the max IoU over all ground-truth subject instances as: Since our new dataset has the images with multiple annotated subjects, we evaluate our subject selection result with each ground-truth subject by IoU. Then we take the max IoU to measure whether the subject is selected.
Here and denote the bounding box of the predicted subject candidate and ground-truth subject instance . Besides, min-CDR is deployed to evaluate the precision of predicted bounding boxes as:
where and indicate the center coordinates of predicted subject candidate and ground-truth subject instance , the width of image is considered for normalization. Also, we adopt the same evaluation metric as , i.e., min-BDE to measure the accuracy of predicted subject. The min-BDE is defined as the average displacement of four edges between the predicted subject bounding box and the ground-truth rectangle:
where , and denote the four edges of the predicted subject while denote the four edges of ground-truth subject .
In addition to the above image-oriented metrics, we also incorporate the average min-CDR (avg-min-CDR), Jitter Degree Ratio (JDR) and recall metrics to evaluate performance regarding videos. For average min-CDR, instead of setting as the center of bounding boxes, we set as the center of the cropped vertical frame and calculating the mean value over the whole video sequence. For JDR, we compute the sum of pair-frame pixel displacement w.r.t. cropped center coordinates as:
where indicates the total number of frames in the whole video sequence, and means the width of the frame. Recall metric  refers to the percentage of the main subject that can be displayed on the clipping screen. Ideally, the main subject can be entirely displayed on the screen instead of being clipped out. The metric is described as:
Here and are the same meaning in max-IoU.
On the H2V-142K dataset, we conduct extensive experiments to evaluate the performance of our Sub-Select module and our H2V conversion framework.
We compare our ranking-based module with state-of-the-art salient object detection CPD  and fixation prediction-based competitors , as well as our naive and deep selection-based baselines, on the image subset of the H2V-142K dataset. Specifically, for both SOD and FP methods, probability maps are generated for input images with pre-trained released models due to a lack of annotated data for our task. Then, the biggest contour in binarized probability maps is selected as the subject. The result position is represented as a bounding box and centroids of the contour.
Implementation Details. In our Rank-SS module and N-SS as well as D-SS baseline modules, we deploy DSFD  and FreeAnchor  as the face and torso detectors. As for the integrated feature extraction described above, CPD  is attached to produce saliency detection response, and Tenengrad [60, 55] algorithm is used to produce blur response in three proposed subject selection modules. Moreover, in the Rank-SS module, an ImageNet  pre-trained Resnet50  backbone is implemented to extract deep semantic embedding besides, and all feature maps are resized to stride 16 consistent with the embedding feature size. Input images are resized such that their shorter side is 600 pixels, during training and testing. Regional feature size pooled by RoIAlign  layer is 14x14.
For training dl-based modules, we employ SGD as an optimizer with an initial learning rate of 1e-2 that decays 0.1 ratios per 10 epochs. The batch size of the input image is 4, and RoIs per image is increased to 20 by randomly perturbing subject candidates. To warm up the subject probability predictor, is set as 0 for the first 30 epoch. Then, the pairwise loss is added to train the Rank-SS module, which takes another 50 epochs to reach convergence.
Comparisons. From this point forward, N-SS and D-SS denote the naive and deep subject selection baselines, Rank-SS is our final ranking-based module. As shown in Table II, even our naive baseline with traditional features outperforms SOD  and FP  competitors in all four metrics, achieved at least 0.86% improvement in mAP. The deep regression baseline further improves upon N-SS by 19.81% in max-IoU and 23.8% in mAP, demonstrating the efficacy of deep features in the H2V conversion task. The soft label refers to the label information that does not contain the ranking order in the training data. We utilize the correlation between the primary subject and the non-subject to construct a relative ranking order for training. The hard label refers to the annotated absolute ranking order between all human proposals. Our final ranking-based subject selection module reports the best overall selection accuracy with hard-label training data, surpassing D-SS by 4.65%, exceeding the Rank-SS module with soft-label training data 2.24%, and largely outperforming the SOD based model by 29.31% in term of mAP. Also, our final ranking-based module reports the best min-CDR and min-BDE, which means that we achieve the best accuracy in predicting the subject bounding box. As for the speed tests, our naive baseline with traditional features reports the best FPS cause of naive liner feature combinations, and our ranking-based module maintains real-time processing performance while improving the accuracy.
In addition to testing in H2V-142K Image Subset, we also conduct experiments in the ASR dataset  to test the modules’ generalization ability. As shown in Table III, there are three kinds of training settings based on our training data, which include only H2V-142K Image Subset, only ASR dataset, and both two datasets. Our Rank-SS trained with H2V-142K Image Subset achieves 59.59% in mAP, which significantly outperforms SOD  and FP  competitors. The Rank-SS trained with ASR dataset further improves upon the module trained with H2V-142K Image Subset by 5.72% in max-IoU and 7.57% in mAP, which benefits from the homogeneity of the dataset. The Rank-SS reports the best overall selection accuracy with both two datasets training data, surpassing the module trained with ASR dataset by 0.09%, and largely outperforming the SOD based model by 19.77% in term of mAP.
Ablations. To investigate each component’s contribution in the Rank-SS module, we provide the ablation study results on our final ranking-based module in Table IV. The last row in the table resides the complete module, and the first three rows show the results of ablating the blur detection (BD), saliency detection (SD), and positional feature (PF), respectively, where each of them shows the contribution of different degrees. Notably, the PF component demonstrates up to 15.69% mAP decline upon ablation, proving that spatial position and size are the most effective clue to subject selection. BD and SD also each contribute 2.93% and 3.27% in terms of mAP. The sixth row reveals that the face detector outperforms torso detector by 9.34%, 1.92%, 1.52%, and 6.38% in max-IoU, min-CDR, min-BDE, as well as mAP accordingly, proving that the human face is more reliable evidence to support accurate and precise subject discovery. Finally, max-IoU and mAP decrease by 8.69% and 3.1% by ablating our margin-ranking loss (MRL), demonstrating that margin-ranking loss is especially valid in enforcing more spatially precise subject selection. The fifth row exposes that mean squared error loss (MSE) is equally indispensable in the Rank-SS module.
|Rank-SS (w/o BD)||83.79%||1.74%||2.02%||91.55%|
|Rank-SS (w/o SD)||83.36%||1.75%||2.09%||91.21%|
|Rank-SS (w/o SF)||72.26%||5.26%||7.34%||78.79%|
|Rank-SS (w/o MRL)||83.68%||1.72%||2.08%||91.38%|
|Rank-SS (w/o MSE)||83.74%||1.83%||2.12%||91.55%|
|Rank-SS (with body)||83.03%||2.93%||2.37%||88.10%|
|Rank-SS (with face)||92.37%||1.01%||1.89%||94.48%|
Analysis of loss weights. For our Rank-SS module, the weights of point-wise loss and pairwise loss are crucial hyper-parameters. Therefore, we further explore the effect of these two hyper-parameters by varying from 0.0 to 2.0 s.t. . Fig. 7 shows the impact of weights on the image subset of our H2V-142K dataset. From Table II, we note that the Rank-SS achieves the best performance while is set to 0.5, and setting too large or too little number of will affect the performance of subject selection. Because a too large number of cannot sufficiently utilize the ranking order of subject candidates and too little will cause the prediction score to lose its meaning as the subject probability.
We evaluate and compare the H2V framework on the video subset of our H2V-142K dataset, employing different Sub-Select variants, with other H2V frameworks based on SOD and FP subject selections. The video-based SOD anchor-diff  and FP Aclnet  modules are executed on each frame to obtain the video results. The implementation is similar to image-based SOD and FP methods.
Comparisons. As shown in Table V, the H2V framework with the ranking-based Sub-Select module performs much better than SOD and FP based approaches in both avg-min-CDR, JDR, and Recall metrics. Notably, as the FP approach is a point-based solution free from bounding boxes, it shows superior temporal stability and surpasses box-based SOD by 82.22% and 22.82% in terms of JDR and Recall. However, our Rank-SS based H2V is also region-based, but it still outperforms FP by 3.53%, 1.48% and 23.8% improvements in all avg-min-CDR, JDR, and Recall metrics, fully demonstrating the subject selection accuracy and temporal stability of our H2V framework.
|Video SOD ||14.49%||4.939||22.91%||6.3|
|Video FP ||14.47%||0.878||45.73%||15.6|
Ablations. Table VI shows the ablation results of our H2V framework, where we investigate the contributions of the shot boundary detection (SBD) and subject tracking (ST) components. By ablating SBD, both avg-min-CDR, JDR and Recall drop significantly by up to 3.24%, 61.74%, and 15.88%, which is induced by wrongful across-shot subject selection. Especially, the ground-truth subject varies from shot to shot, yet now the framework selects subject only once at the initial frame, thus rendering more errors. For ablating the ST component, on the other extreme, we execute Sub-Select at every frame, which causes FPS to decrease from 28.6 to 9.6. Consequently, avg-min-CDR increases by 6.18% due to more accurate subject selection, while JDR and Recall decrease by 93.06% and 12.8% because of the absence of tracking-based temporal smoothness.
V-D Subjective Evaluation
One-Way Repeated Measures ANOVA
An independent sample (n=50) is recruited to complete the questionnaire, which contains 30 subjective evaluation questions consisting of one original image and three vertical results generated by SOD, FP, and Rank-SS, respectively. The participants are asked to report their evaluation of the three vertical images on Likert 5-point scale (1=bad, 5=excellent). To assess the difference in the performance of the three methods, we conduct one-way repeated measures ANOVA. The results are listed in Table VII, which suggests that there is a significant difference in quality among the three methods (). The post-hoc analyses (Table VIII) reveals that the difference between Rank-SS and SOD is significant (), suggesting that Rank-SS is better than SOD. Besides, there is a significant difference between Rank-SS and SOD (), which demonstrates that Rank-SS also performs better than FP. However, no significant result is found between SOD and FP (), indicating that the quality of the cut result generated by SOD is similar to FP.
|FP - SOD||0.13||0.44||0.69||1.953||0.058*|
|RankSS - SOD||1.46||0.71||0.11||13.295||0.000***|
|RankSS - FP||1.33||0.78||0.12||11.015||0.000***|
In applications, H2V video conversion is a more user-oriented task whose performance is better evaluated by user-feedbacks. Accordingly, after deploying our H2V framework onto Youku video-sharing website, we have collected rich user-feedback data generated from converted films, TV series, and variety shows, totally covering more than 100 Occupationally-Generated Content (OGC) with 10 million views. Specifically, the audit pass rate of converted images and videos is 98% and 94%, respectively. Image-wise, vertical video cover image converted by H2V gains video exposure up to 1.5 million times per day. Video-wise, efficiency indexes such as the Click Through Rate (CTR) and Bounce Rate (BR) of vertical videos converted by H2V is on par with the ones produced manually. In addition to online data explained as above, we also invite a group of professional video practitioners to participate in an offline survey, wherein videos converted by H2V reports a 3% bad case rate, way past the 5-10% commercially available rate.
Strength and Weakness
As illustrated in the first row of Fig. 8, our H2V framework successfully selects the primary subject from background distractors (1a and 1c). Besides, it can also discard pseudo-subject, who is not facing the camera directly (1d). Moreover, H2V can also incorporate human closely-located with the selected primary subject. More visualization of results in the MSCOCO 2017 Val dataset , proposed H2V-142K dataset, and ASR dataset  is shown in Fig. 9. As can be seen, our method generalizes well on common objects not limited to humans.
The second row of Fig. 8 showcases several bad cases of our framework. By large, bad cases occur when criteria described in Section III-A contradict each other. In 2a and 2b, the central criterion overwhelms the proportional criterion, and 2d demonstrates the case when the proportional criterion overwhelms the postural criterion. Sub-figure 2c depicts missed detection.
Lastly, the limitations of our H2V framework include: 1) it is object-oriented, which cannot process scenery lens in documentary or contents with rarely-seen subjects such as plants that is not able to be detected by object detectors. 2) our framework may fail in strongly dynamic shots where the camera moves like Dolly, Truck and Zoom are utilized, resulting in picture jitters. 3) the H2V-142K dataset is a subset of H2V task containing human subjects only. It is still worth exploring more general cases in future works.
In this work, we introduce the first fully automatic and commercialized horizontal-to-vertical video conversion solution, the (H2V framework, which settles subject-preserving clipping by integrating shot detection, subject selection, object tracking, and video cropping. Among all well-performing components, we highlight the Sub-Select module which effectively discovers and selects the primary subject via multi-cue feature integration and region-based object ranking. For the development of horizontal-to-vertical solutions, we hereby publicize a large-scale H2V-142K dataset, wherein 132K frames in 125 videos and 9,500 images are carefully annotated with primary subject building boxes. Extensive experiments with H2V are conducted on H2V-142K and related object detection datasets, where both accuracy metrics and vast user-feedback data reveal the efficacy and superiority of our H2V framework. Upon the completion of this paper, our H2V framework has been successfully deployed online hosting massive throughput, and we hope this paper can pave the way for more successful endeavors to come.
- (2008) Salient region detection and segmentation. In International conference on computer vision systems, pp. 66–75. Cited by: §II-B.
- (2018) Revisiting salient object detection: simultaneous detection, ranking, and subitizing of multiple salient objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7142–7150. Cited by: §II-C.
- (2017) Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Transactions on Multimedia 20 (7), pp. 1688–1698. Cited by: §II-B.
- (2019) Saliency prediction in the deep learning era: successes and limitations. IEEE transactions on pattern analysis and machine intelligence. Cited by: §I, §II-B.
- (2005) Learning to rank using gradient descent. In ICML ’05, Cited by: §IV-C.
- (2016) Automatic image cropping: a computational complexity study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 507–515. Cited by: §II-A.
- (2010) Learning to photograph. In Proceedings of the 18th ACM international conference on Multimedia, pp. 291–300. Cited by: §II-A.
- (2017) Segflow: joint learning for video object segmentation and optical flow. In Proceedings of the IEEE international conference on computer vision, pp. 686–695. Cited by: §II-B.
- (2014) Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 569–582. Cited by: §II-B.
- (2008) Pan, zoom, scan â time-coherent, trained automatic video cropping. 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §II-A, §V-B.
- (2014) Video segmentation by non-local consensus voting.. In BMVC, Vol. 2, pp. 8. Cited by: §II-B.
- (2014) Automatic image cropping using visual composition, boundary simplicity and content preservation models. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 1105–1108. Cited by: §II-A.
- (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §II-C, §IV-C.
- (2018) Going from image to video saliency: augmenting image salience with dynamic attentional push. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7501–7511. Cited by: §II-B.
- (2010) Efficient hierarchical graph-based video segmentation. In 2010 ieee computer society conference on computer vision and pattern recognition, pp. 2141–2148. Cited by: §II-B.
- (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Fig. 4, §IV-C, §V-C1.
- (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-A, §IV-C, §V-C1.
- (2019) Understanding and visualizing deep visual saliency models. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 10206–10215. Cited by: §V-C1, §V-C1, §V-C1, TABLE II, TABLE III, TABLE VII.
- (2019) Effective aesthetics prediction with multi-level spatially pooled features. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9375–9383. Cited by: §II-A.
- (2017) Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3203–3212. Cited by: §II-B.
- (2007) Saliency detection: a spectral residual approach. In 2007 IEEE Conference on computer vision and pattern recognition, pp. 1–8. Cited by: §II-B.
- (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20 (11), pp. 1254–1259. Cited by: §II-B.
- (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 2117–2126. Cited by: §II-B.
- (2019) A unified multiple graph learning and convolutional network model for co-saliency estimation. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1375–1382. Cited by: §II-B.
- (2018) Deepvs: a deep learning based video saliency prediction approach. In Proceedings of the european conference on computer vision (eccv), pp. 602–617. Cited by: §II-B.
- (2015) Salicon: saliency in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1072–1080. Cited by: §V-A2.
- (2017) Primary object segmentation in videos based on region augmentation and reduction. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7417–7425. Cited by: §II-B.
- (2016) Photo aesthetics ranking network with attributes and content adaptation. In European Conference on Computer Vision, pp. 662–679. Cited by: §II-A.
- (2009) FSCAV: fast seam carving for size adaptation of videos. In Proceedings of the 17th ACM international conference on Multimedia, pp. 321–330. Cited by: §II-A.
- (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §IV-A, §V-C1.
- (2017) Learning gaze transitions from depth to improve video saliency estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1698–1707. Cited by: §II-B.
- (2019) Co-saliency detection based on hierarchical consistency. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1392–1400. Cited by: §II-B.
- (2018) SiamRPN++: evolution of siamese visual tracking with very deep networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4277–4286. Cited by: §IV-C.
- (2017) Instance-level salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2386–2395. Cited by: §II-C.
- (2018) Flow guided recurrent neural encoder for video salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3243–3252. Cited by: §II-B.
- (2019) DSFD: dual shot face detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5060–5069. Cited by: §II-C, §III-A, §V-C1.
- (2014) The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287. Cited by: §II-C.
- (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §IV-B, §IV-C.
- (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §V-A2.
- (2014) Microsoft coco: common objects in context. ArXiv abs/1405.0312. Cited by: Fig. 9, §V-A1, §V-D3.
- (2016) Towards perceptual video cropping with curve fitting. Multimedia Tools and Applications 75 (20), pp. 12465–12475. Cited by: §II-A.
- (2018) Picanet: learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3089–3098. Cited by: §II-B.
- (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–686. Cited by: §II-B.
- (2010) Learning to detect a salient object. IEEE Transactions on Pattern analysis and machine intelligence 33 (2), pp. 353–367. Cited by: §II-B.
- (2019) Employing deep part-object relationships for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1232–1241. Cited by: §II-B.
- (2003) Contrast-based image attention analysis by using fuzzy growing. In Proceedings of the eleventh ACM international conference on Multimedia, pp. 374–381. Cited by: §II-B.
- (2009) A framework for visual saliency detection with applications to image thumbnailing. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2232–2239. Cited by: §II-A.
- (2012) AVA: a large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2408–2415. Cited by: §II-A.
- (2009) Sensation-based photo cropping. In Proceedings of the 17th ACM international conference on Multimedia, pp. 669–672. Cited by: §II-A.
- (2011) Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In 2011 International Conference on Computer Vision, pp. 1583–1590. Cited by: §II-B.
- (2012) Saliency filters: contrast based filtering for salient region detection. In 2012 IEEE conference on computer vision and pattern recognition, pp. 733–740. Cited by: §II-B.
- (2019) Multi-scale capsule attention-based salient object detection with multi-crossed layer connections. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1762–1767. Cited by: §II-B.
- (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §II-C, §IV-C.
- (2010) A comparative study of image retargeting. In ACM SIGGRAPH Asia 2010 papers, pp. 1–10. Cited by: §II-A.
- (1983) Implementation of automatic focusing algorithms for a computer vision system with camera control.. Technical report CARNEGIE-MELLON UNIV PITTSBURGH PA ROBOTICS INST. Cited by: §IV-A, §V-C1.
- (2020) Inferring attention shift ranks of objects for image saliency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12133–12143. Cited by: Fig. 9, §V-A2, §V-C1, §V-D3, TABLE III.
- (2018) Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 715–731. Cited by: §II-B.
- (2019) TransNet: a deep network for fast detection of common shot transitions. arXiv preprint arXiv:1906.03363. Cited by: §III, §V-A1.
- (2013) Scale and object aware image thumbnailing. International journal of computer vision 104 (2), pp. 135–153. Cited by: §II-A.
- (1970) Accommodation in computer vision.. Technical report Stanford Univ Ca Dept of Computer Science. Cited by: §IV-A, §V-C1.
- (2017) Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4481–4490. Cited by: §II-B.
- (2020) Image cropping with composition and saliency aware aesthetic score map.. In AAAI, pp. 12104–12111. Cited by: §II-A.
- (2010) A survey of image retargeting techniques. In Applications of Digital Image Processing XXXIII, Vol. 7798, pp. 779814. Cited by: §II-A.
- (2018) Fast online object tracking and segmentation: a unifying approach. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1328–1338. Cited by: §III-B.
- (2019) Salient object detection in the deep learning era: an in-depth survey. arXiv preprint arXiv:1904.09146. Cited by: §I, §II-B.
- (2018) A deep network solution for attention and aesthetics aware photo cropping. IEEE transactions on pattern analysis and machine intelligence 41 (7), pp. 1531–1544. Cited by: §II-A, §V-B.
- (2019) Revisiting video saliency prediction in the deep learning era. IEEE transactions on pattern analysis and machine intelligence. Cited by: §I, §II-B, §V-C2, TABLE V.
- (2016) Stereoscopic thumbnail creation via efficient stereo saliency detection. IEEE transactions on visualization and computer graphics 23 (8), pp. 2014–2027. Cited by: §II-A.
- (2019) Ranking video salient object detection. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 873–881. Cited by: §II-C.
- (2019) Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3907–3916. Cited by: §II-B, §IV-A, §V-C1, §V-C1, §V-C1, §V-C1, TABLE II, TABLE III, TABLE VII.
- (2016) Track and segment: an iterative unsupervised approach for video object proposals. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 933–942. Cited by: §II-B.
- (2013) Learning the change for automatic image cropping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 971–978. Cited by: §II-A.
- (2019) Anchor diffusion for unsupervised video object segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 931–940. Cited by: §V-C2, TABLE V.
- (2001) Fundamentals of kalman filtering: a practical approach. Cited by: §III-B, §V-A1.
- (2019) Reliable and efficient image cropping: a grid anchor based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5949–5957. Cited by: §II-A.
- (2015) Salient object subitizing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4045–4054. Cited by: §II-C.
- (2016) Unconstrained salient object detection via proposal subset optimization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5733–5742. Cited by: §II-B, §II-C.
- (2012) Probabilistic graphlet transfer for photo cropping. IEEE Transactions on Image Processing 22 (2), pp. 802–815. Cited by: §II-A.
- (2019) Training efficient saliency prediction models with knowledge distillation. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 512–520. Cited by: §II-B.
- (2019) FreeAnchor: learning to match anchors for visual object detection. In Advances in Neural Information Processing Systems, pp. 147–155. Cited by: §II-C, §III-A, §V-A1, §V-C1.
- (2015) Saliency detection by multi-context deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1265–1274. Cited by: §II-B.