Image retargeting has been applied to display images of any size via devices with various resolutions (e.g., cell phone, TV monitors). To fit an image with the target resolution, certain unimportant regions need to be deleted or distorted and the key problem is to determine the importance of each pixel. Existing methods predict pixel-wise importance in a bottom-up manner via eye fixation estimation or saliency detection. In contrast, the proposed algorithm estimates the pixel-wise importance based on a top-down criterion where the target image maintains the semantic meaning of the original image. To this end, several semantic components corresponding to foreground objects, action contexts, and background regions are extracted. The semantic component maps are integrated by a classification guided fusion network. Specifically, the deep network classifies the original image as object or scene-oriented, and fuses the semantic component maps according to classification results. The network output, referred to as the semantic collage with the same size as the original image, is then fed into any existing optimization method to generate the target image. Extensive experiments are carried out on the RetargetMe dataset and S-Retarget database developed in this work. Experimental results demonstrate the merits of the proposed algorithm over the state-of- the-art image retargeting methods.

The S-Retarget Dataset

For comprehensive evaluation of image retargeting methods, we construct the Semantic-Retarget (S- Retarget) dataset which contains 1, 527 images.

We select images from some existing datasets as well as some pictures from web search engines. Based on the contents, all images are divided into 6 categories including single person, multiple people, single as well as multiple objects, and indoor as well as outdoor scenes. To give each images their semantic collages, we ask 5 subjects to annotate the pixel relevance score based on the semantics of an image. Figure 1(b) demonstrates the semantic collage marked by two annotators. Overall, the annotations from different persons are consistent. The relevance score of one pixel is obtained by averaging all relevance scores of the segments covering the pixel (see Figure 1(c)).

Figure 1. Some examples of semantic collage in the S-Retarget dataset. a) original images; b) annotations from two annotators; c) ground truth annotations; d) image caption.

The S-Retarget dataset can also be used as a semantic saliency dataset. Different with the saliency datasets, e.g., MSRA-5000[3] or ECSSD[4], which mainly contain dominant objects, the images in S-Retarget are quite diverse. Furthermore, as shown in Figure 1(c), the ground truth are labeled with soft rather than binary annotations.

Dataset Download [620MB]

Baidu Yun


Architecture Overview

Although the state-of-the-art modules are used, semantic components may not be extracted well in an image. Thus, we combine all the semantic component maps via a classification guided fusion network to generate the semantic collage. As object and scene images have different properties, the fusion network first classifies an image into two types. The semantic component maps are then fused by the corresponding sub-network based on the specified category. In contrast to existing methods, we exploit the semantic collages based on three defined components for image retargeting. The generated semantic collage is fed into a carrier method to generate the target image.

Figure 1. Main steps of the SP-DIR algorithm. The semantic meaning of the original image is: a boy kicks a ball on a pitch. Three semantic components including boy, ball, kick and pitch are extracted first. These are fused via a classification guided fusion network to generate a semantic collage, which is fed into the carrier to render the target image.

Figure 2. Classification guided fusion network. The inputs are the multiple semantic component maps and the output is the semantic collage. The classification sub-network predicts the image as either scene or object oriented. The the regression sub-network fuses the semantic components maps according to the classification results.


Quantified Evaluation

We compare our importance maps with state-of-the-art saliency generating methods under the four evaluation criteria of the MIT saliency benchmark[5] and the Mean Average Error (MAE). For EMD, KL, MAE, the lower the better while for CC and SIM, the higher the better.

It can be observed that our fusion network performs best on the S-Retarget dataset under all evaluation metrics.

Table 1. Evaluation on importance maps on the validation-set in the S-Retarget dataset.

Retargeting Results

We applied our system on S-Retarget dataset as well as the RetargetMe[1] dataset. The following retargeting results show that our method can better preserve the semantic meanings in images.

Figure 3. Comparisons with SOAT, ISC, Multi-operator, Warp, AAD , OSS on S-Retarget dataset. 6 rows show the results for single person, multiple people, single object, multiple objects, indoor scene and outdoor scene, respectively

Figure 4. Results on RetargetMe dataset. Target images are obtained by using 3 retargeting methods (AAD, Multi-Op, and IF, and 9 importance maps ( eDN, GC, oriIF, DNEF, RCC, fine-tuned MC, fine-tuned Mr-cnn, fine-tuned SalNet and our method).

We also conducted human evaluations on the Amazon Mechanical Turk (AMT). Our target image and the result by a baseline are shown in randomly order to the AMT workers, who are asked to select the better one. The evaluation results are shown below. Each element stands for a contrastive result, for example, the number “2985(255)” means our result is preferred by 2,985 times while the corresponding baseline method is favored by 255 times.

Table 2. Comparison between our importance map and 8 baseline maps when combined with 3 different carriers on S-Retarget dataset.

Table 3. Comparisons with state-of-the-art retargeting systems on S-Retarget dataset.

Table 4. Comparison between our importance map and 8 baseline maps when combined with 3 carriers on RetargetMe dataset.

Click here for more visualization and analysis


  • M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir, A comparative study of image retargeting, in ACM TOG, 2010.
  • A. Mansfield, P. Gehler, L. Van Gool, and C. Rother, Visibility maps for improving seam carving, in ECCV Workshops, 2010.
  • T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, Learning to detect a salient object, in CVPR, 2007.
  • Q. Yan, L. Xu, J. Shi, and J. Jia, Hierarchical saliency detection, in CVPR, 2013.
  • Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva and A. Torralba. MIT Saliency Benchmark. http://saliency.mit.edu/.
2016, by Zhen Wei