Growing Interpretable Part Graphs on ConvNets via Multi-Shot Learning

Quanshi Zhang, Ruiming Cao, Ying Nian Wu, and Song-Chun Zhu in AAAI 2017

Download the paper and the code here (a revised version).

Abstract
This paper proposes a learning strategy that extracts object-part concepts from a pre-trained convolutional neural network (CNN), in an attempt to 1) explore explicit semantics hidden in CNN units and 2) gradually grow a semantically interpretable graphical model on the pre-trained CNN for hierarchical object understanding. Given part annotations on very few (e.g., 3—12) objects, our method mines certain latent patterns from the pre-trained CNN and associates them with different semantic parts. We use a four-layer And-Or graph to organize the mined latent patterns, so as to clarify their internal semantic hierarchy. Our method is guided by a small number of part annotations, and it achieves superior performance (about 13%—107% improvement) in part center prediction on the PASCAL VOC and ImageNet datasets.

Figure 1, Semantic And-Or graph (AOG) grown on a pre-trained CNN. The AOG has four layers. In the AOG, each OR node encodes its alternative representations as children, and each AND node is decomposed into its constituents. The top OR node describes the semantic part (e.g. the head of a sheep), which lists a number of part templates as children. AND nodes in the 2^nd layer represent part templates, which correspond to different poses or local appearances for the part (e.g. a black sheep head from a front view). OR nodes in the 3^rd layer describe latent patterns, which describe sub-parts of the sheep head or a contextual region (e.g. the neck region). Terminal nodes are CNN units. A latent pattern selects a CNN unit within its deformation range in a certain conv-slice to account for shape deformation.

Figure 2, Incremental growth of AOGs on the CNN. Given a small number (e.g. 3–12) of object-part annotations based on demands on the fly, we incrementally grow new interpretable semantic AOGs on a pre-trained CNN, which associate certain CNN units with new parts. In this way, we do not need to train a new CNN to model a new semantic part. Instead, we gradually grow different AOGs on the same CNN to explain semantic hierarchies of different parts that are hidden in the CNN.

Table 1, Part center prediction accuracy of 3-shot learning on the ILSVRC 2013 DET Animal-Part dataset.

Tables 2 and 3, Part center prediction accuracy of 3-shot learning on the Pascal VOC Part dataset (left) and the CUB200-2011 dataset (right).

Figure 3, Image reconstruction based on the mined latent patterns in the AOG (for pattern visualization)

Figure 4, Heat map of CNN units within a part template. We sum up the CNN units, which are associated by the AOG throughout all the conv-slices at the 5^th conv-layers to generate the heat map.

Figure 5, Localization of semantic parts based on the AOG

Please contact Dr. Quanshi Zhang, if you have questions.