Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer Christoph H.Lampert Hannes Nickisch Stefan Harmeling
Max Planck Institute for Biological Cybernetics,T¨u bingen,Germany
{firstname.lastname}@tuebingen.mpg.de
Abstract
We study the problem of object classification when train-ing and test classes are training examples of the target classes are available.This setup has hardly been studied in computer vision research,but it is the rule rather than the exception,because the world contains tens of thou-sands of different object classes and for only a very few of them image,collections have been formed and annotated with suitable class labels.
In this paper,we tackle the problem by introducing attribute-based classification.It performs object detection based on a human-specified high-level description of the target objects instead of training images.The description consists of arbitrary semantic attributes,like shape,color or even geographic information.Because such properties transcend the specific learning task at hand,they can be pre-lea
from image datasets unrelated to the cur-rent task.Afterwards,new classes can be detected based on their attribute representation,without the need for a new training phase.In order to evaluate our method and to facil-itate research in this area,we have assembled a new large-scale dataset,“Animals with Attributes”,of over30,000an-imal images that match the50classes in Osherson’s clas-sic table of how strongly humans associate85semantic at-tributes with animal classes.Our experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes.
1.Introduction
Learning-based methods for recognizing objects in natu-ral images have made large progress over the last years.For specific object classes,in particular faces and vehicles,reli-able and efficient detectors are available,based on the com-bination of powerful low-level SIFT or HoG, with modern machine learning boosting or support vector machines.However,in order to achieve good classification accuracy,these systems require a lot of man-ually labeled training data,typically hundreds or thousands of example images for each class to be learned.
It has been estimated that humans distinguish between at least30,000relevant object classes[3].Training con-ventional object detectors for all these would require mil-otter
black:yes
white:no
brown:yes
stripes:no
water:yes
eats fish:yes
polar bear
black:no
white:yes
brown:no
stripes:no
water:yes
eats fish:yes
zebra
black:yes
white:yes
brown:no
stripes:yes
water:no
eats fish:
no
Figure1.A description by high-level attributes allows the transfer of knowledge between object categories:after learning the visual appearance of attributes from any classes with training examples, we can detect also object classes that do not have any training images,based on which attribute description a test imagefits best. lions of well-labeled training images and is likely out of reach for years to come.Therefore,numerous techniques for reducing the number of necessary training images have been developed,some of which we will discuss in Section3. However,all of these techniques still require at least some labeled training examples to detect future object instances.
Human learning is different:although humans can learn and abstract well from examples,they are also capable of detecting completely unseen classes when provided with a high-level description.  E.g.,from the phrase“eight-sided red traffic sign with white writing”,we will be able to detect stop signs,and when looking for“large gray animals with long trunks”,we will reliably identify elephants.We build on this paradigm and propose a system that is able to detect objects from a list of high-level attributes.The attributes serve as an intermediate layer in a classifier cascade and they enable the system to detect object classes,for which it had not seen a single training example.
Clearly,a large number of possible attributes exist and collecting separate training material to learn an ordinary classifier for each of them would be as tedious as for all object classes.But,instead of creating a separate training
set for each attribute,we can exploit the fact that meaning-ful high-level concepts transcend class boundaries.To learn such attributes,we can therefore make use of existing train-ing data by merging images of several object classes.To ,the attribute striped,we can use images of ze-bras,bees and tigers.For the attribute yellow,zebras would not be included,but bees and tigers would still prove use-ful,possibly together with canary birds.It is this possibility to obtain knowledge about attributes from different object classes,and,vice versa,the fact that each attribute can be used for the detection of many object classes that makes our proposed learning method statistically efficient.
2.Information Transfer by Attribute Sharing
We begin by formalizing the problem setting and our intuition from the previous section that the use of attributes allows us to transfer information between object classes. Wefirst define the problem of our interest:
Learning with Disjoint Training and Test Classes:
Let(x1,l1),...,(x n,l n)⊂X×Y be training samples where X is an arbitrary feature space and Y={y1,...,y K}consists of K discrete classes.The task is to learn a classifier f:X→Z for a label set Z={z1,...,z L}that is disjoint from Y1.
Clearly,this task cannot be solved by an ordinary multi-class classifier.Figure2(a)provides a graphical illustra-tion of the problem:typical classifiers learn one param-eter vector(or other representation)αk for each training class y1,...,y K.Because the classes z1,...,z L were not present during the training step,no parameter vector can be derived for them,and it is impossible to make predictions about these classes for future samples.
In order to make predictions about classes,for which no training data is available,we need to introduce a cou-pling between classes in Y and Z.Since no training data for the unobserved classes is available,this coupling cannot be learned from samples,but has to be inserted into the sys-tem by human effort.This introduces two severe constraints on what kind of coupling mechanisms are feasible:1)the amount of human effort to specify new classes should be minimal,because otherwise collecting and labeling training samples would be a simpler solution;2)coupling data that requires only common knowledge is preferable over special-ized expert knowledge,because the latter is often difficult and expensive to obtain.
2.1.Attribute-Based Classification:
We achieve both goals by introducing a small set of high-level semantic per-class attributes.These can lor 1The conditions that Y and Z are disjoint is included only to clarify the later presentation.The problem described also occurs if just Z⊆Y.and shape for arbitrary objects,or the natural habitat for animals.Humans are typically able to provide good prior knowledge about such attributes,and it is therefore possible to collect the necessary information without a lot of over-head.Because the attributes are assigned on a per-class ba-sis instead of a per-image basis,the manual effort to add a new object class is kept minimal.
For the situation where attribute data of this kind of available,we introduce attribute-based classification: Attribute-Based Classification:
Given the situation of learning with disjoint training and test classes.If for each class z∈Z and y∈Y an attribute representation a∈A is available,then we can learn a non-trivial classifierα:X→Z by transferring information between Y and Z through A.
In the rest of this paper,we will demonstrate that attribute-based classification is indeed a solution to the problem of learning with disjoint training and test classes, and how it can be practically used for o
bject classification. For this,we introduce and compare two generic methods to integrate attributes into multi-class classification: Direct attribute prediction(DAP),illustrated by Fig-ure2(b),uses an in between layer of attribute variables to decouple the images from the layer of labels.During training,the output class label of each sample induces a deterministic labeling of the attribute layer.Consequently, any supervised learning method can be used to learn per-attribute parametersβm.At test time,these allow the pre-diction of attribute values for each test sample,from which the test class label are inferred.Note that the classes during testing can differ from the classes used for training,as long as the coupling attribute layer is determined in a way that does not require a training phase.
Indirect attribute prediction(IAP),depicted in Fig-ure2(c),also uses the attributes to transfer knowledge be-tween classes,but the attributes form a connecting layer be-tween two layers of labels,one for classes that are known at training time and one for classes that are not.The training phase of IAP is ordinary multi-class classification.At test time,the predictions for all training classes induce a label-ing of the attribute layer,from which a labeling over the test classes can be inferred.
The major difference between both approaches lies in the relationship between training classes and test classes.Di-rectly learning the attributes results in a network where all classes are treated equally.
When class labels are inferred at test time,the decision for all classes are based only on the attribute layer.We can expect it therefore to also handle the situation where training and test classes are not disjoint. In contrast,when predicting the attribute values indirectly, the training classes occur also a test time as an intermediate
Figure 2.Graphical representation of the proposed across-class learning task:dark gray nodes are always observed,light gray nodes are observed only during training.White nodes are never observed but must be inferred.An ordinary,flat,multi-class classifier (left)learns one parameter αk for each training class.It cannot generalize to classes (z l )l =1...,L that are not part of the training set.In an attribute-based classifier (middle)with fixed class–attribute relations (thick lines),training labels (y k )k =1,...,K imply training values for the attributes (a m )m =1,...,M ,from which parameters βm are learne
d.At test time,attribute values can directly be inferred,and these imply output class label even for previously unseen classes.A multi-class based attribute classifier (right)combined both ideas:multi-class parameters αk are learned for each training class.At test time,the posterior distribution of the training class labels induces a distribution over the labels of unseen classes by means of the class–attribute relationship.
feature layer.On the one hand,this can introduce a bias,if training classes are also potential output classes during testing.On the other hand,one can argue that deriving the attribute layer from the label layer instead of from the sam-ples will act as regularization step that creates only sensible attribute combinations and therefore makes the system more robust.In the following,we will develop implementations for both methods and benchmark their performance.
2.2.Implementation
Both cascaded classification methods,DAP and IAP,can
in principle be implemented by combining a supervised classifier or regressor for the image–attribute or image–class prediction with a parameter free inference method to channel the information through the attribute layer.In the following,we use a probabilistic model that reflects the graphical structures o
f Figures 2(b)and 2(c).For simplic-ity,we assume that all attributes have binary values such
that the attribute representation a y =(a y 1,...,a y
m )for any training class y are fixed-length binary vectors.Continuous attributes can in principle be handled in the same way by using regression instead of classification.
For DAP,we start by learning probabilistic classifiers for each attribute a m .We use all images from all training classes as training samples with their label determined by the entry of the attribute vector corresponding to the sam-ple’s label,i.e .a sample of class y is assigned the binary label a y m .The trained classifiers provide us with estimates of p (a m |x ),from which we form a model for the complete
image–attribute layer as p (a |x )= M
m =1p (a m |x ).At test time,we assume that every class z induces its attribute vec-tor a z in a deterministic way,i.e .p (a |z )= a =a z  ,mak-ing use of Iverson’s bracket notation: P  =1if the con-
dition P is true and it is 0otherwise [19].Applying Bayes’
rule we obtain p (z |a )=p (z )p (a z ) a =a z
as representation of the attribute–class layer.Combining both layers,we can calculate the posterior of a test class given an image:p (z |x )=
a ∈{0,1}
M p (z |a )p (a |x )=p (z )p (a )M
m =1
p (a z m |x ).(1)In the absence of more specific knowledge,we assume iden-tical class priors,which allows us to ignore the factor p (z )in the following.For the factor p (a )we assume a facto-rial distribution p (a )= M
m =1p (a m ),using the empirical
means p (a m )=1K  K k =1a y k
m over the training classes as attribute priors.2As decision rule f :X →Z that assigns the best output class from all test classes z 1,...,z L to a test sample x ,we use MAP prediction:
f (x )=argmax l =1,...,L M  m =1p (a z l
m |x )p (a z l
m )
.(2)
In order to implement IAP,we only modify the image–attribute stage:as first step,we learn a probabilistic multi-class classifier estimating p (y k |x )for all training classes y 1,...,y K .Again assuming a deterministic dependence between attributes and classes,we set p (a m |y )= a m =a y m  .The combination of both steps yields
p (a m |x )=
K
k =1
p (a m |y k )p (y k |x ),(3)
so inferring the attribute posterior probabilities p (a m |x )re-quires only a matrix-vector multiplication.Afterwards,we
2In
practice,the prior p (a )is not crucial to the procedure and setting p (a m )=12
yields comparable results.
continue in the same way as in for DAP,classifying test samples using Equation(2).
3.Connections to Previous Work
Multi-layer or cascaded classifiers have a long tradition in pattern recognition and computer vision:multi-layer per-ceptrons[29],decision trees[5],mixtures of experts[17] and boosting[14]are prominent examples of classifica-tion systems built as feed-forward architectures with several stages.Multi-class classifiers are also often constructed as layers of binary decisions,from which thefinal output is [7,28].These methods differ in their training methodologies,but they share the goal of decomposing a difficult classification problem into a collection of simpler ones.Because their emphasis lies on the classification per-formance in a fully supervised scenario,the methods are not capable of generalizing across class boundaries.
Especially in the area of computer vision,multi-layered classification systems have been constructed,
in which inter-mediate layers have interpretable properties:artificial neu-ral networks or deep belief networks have been shown to learn interpretablefilters,but these are typically restricted to low-level properties like edge and corner detectors[27]. Popular local feature descriptors,such as SIFT[21]or HoG[6],can be seen as hand-crafted stages in a feed-forward architecture that transform an image from the pixel domain into a representation invariant to non-informative image variations.Similarly,image segmentation has been proposed as an unsupervised method to extract contours that are discriminative for object classes[37].Such pre-processing steps are generic in the sense that they still allow the subsequent detection of arbitrary object classes.How-ever,the basic elements,local image descriptors or seg-ments shapes,alone are not reliable enough indicators of generic visual object classes,unless they are used as input to a subsequent statistical learning step.
On a higher level,pictorial structures[13],the constel-lation model[10]and recent discriminatively trained de-formable part models[9]are examples of the many methods that recognize objects in images by detecting discriminative parts.In principle,humans can give descriptions of object classes in terms of such arms or wheels.How-ever,it is a difficult problem to build a system that learns to detect exactly the parts described.Instead,the identifi-cation of parts is integrated into the training of the model, which often reduces the parts to co-occurrence patterns of local feature points,
not to units with a semantic meaning. In general,parts learned this way do generalize across class boundaries.
3.1.Sharing Information between Classes
The aspect of sharing information between classes has also been recognized as an interestingfield before.A com-mon idea is to construct multi-class classifiers in a cascaded way.By making similar classes share large parts of their decision paths,fewer classification functions need to be learned,thereby increasing the system’s performance[26]. Similarly,one can reduce the number of feature calculations by actively selecting low-level features that help discrimina-tion for many classes simultaneously[33].Combinations of both approaches are also possible[39].
In contrast,inter-class transfer does not aim at higher speed,but at better generalization performance,typically for object classes with only few available training instances. From known object classes,one infers prior distributions over the expected intra-class variance in terms of distortions [22]or shapes and appearances[20].Alternatively,features that are known to be discriminative for some classes can be reused and adapted to support the detection of new classes [1].To our knowledge,no previous approach allows the direct incorporation of human prior knowledge.
Also,all methods require at least some training examples and cannot handle completely new object classes.
A noticable exception is[8]that uses high-level at-tributes to learn descriptions of object.Like our approach, this opens the possilibity to generalize between categories.
3.2.Learning Semantic Attributes
A different line of relevant research occurring as one building block for attribute-based classification is the learn-ing of high-level semantic attributes from images.Prior work in the area of computer vision has mainly stud-ied elementary properties like colors and geometric pat-terns[11,36,38],achieving high accuracy by develop-ing task-specific features and representations.In thefield of multimedia retrieval,the annual TRECVID contest[32] contains a subtask of high-level feature extraction.It has stimulated a lot of research in the detection of semantic con-cepts,including the categorization of scene ut-door,urban,and high-level sports.Typical sys-tems in this area combine many feature representations and, because they were designed for retrieval scenarios,they aim at high precision for low recall levels[34,40].
Our own task of attribute learning targets a similar prob-lem,but ourfinal goal is not the prediction of f
ew individual attributes.Instead,we want to infer class labels by combin-ing the predictions of many attributes.Therefore,we are relatively robust to prediction errors on the level of individ-ual attributes,and we will rely on generic classifiers and standard image features instead of specialized setups.
In contrast to computer science,a lot of work in cog-nitive science has been dedicated to studying the relations between object recognition and attributes.Typical ques-tions in thefield are how human judgements are influenced by characteristic object attributes[23,31].A related line of research studies how the human performance in object
b l a
c k w h i t e b l u e b r o w n g r a y o r a n g e r e
d y
e l l o w p a t c h e s s p o t s s t r i p e s
f u r r y h a i r l e s s t o u
g
h s k
i n b i g s m a l l b u l b o u s l e a n f l i p p e r s h a n d s h o o v e s p a d s p a w s l o n g l e g l o n g n e c k t a i l c h e w t e e t h m e a t t e e t h b u c k t e e t h s t r a i n t e e t h h o r n s c l a w s t u s k s
zebra
giant panda
deer bobcat
pig lion mouse polar bear
collie walrus raccoon
cow dolphin
Class–attribute matrices from [24,18].The responses of persons were averaged to determine the real-valued sociation strength between attributes and classes.The darker the boxes,the less is the att
ribute associated with the class.Binary attributes are obtained by thresholding at the overall matrix mean.
detection tasks depends on the presence or absence of ob-ject properties and contextual cues [16].Since one of our goals is to integrate human knowledge into a computer vi-sion task,we would like to benefit from the prior work in this field,at least as a source of high quality data that,so far,cannot be obtained by an automatic process.In the follow-ing section,we describe a new dataset of animal images that allows us to make use of existing class-attribute association data,which was collected from cognitive science research.
4.The Animals with Attributes Dataset
object toFor their studies on attribute-based object similarity,Os-herson and Wilkie [24]collected judgements from human subjects on the “relative strength of association”between 85attributes and 48animal classes.Kemp et al.[18]made use of the same data in a machine learning context and added 2more animals classes.Figure 3illustrates an ex-cerpt of the resulting 50×85class-attribute matrix.How-ever,so far this data was not usable in a computer vision context,because the animals and attributes are only spec-ified by their abstract names,not by example images.To overcome this problem,we have collected the Animals with Attributes data.3
4.1.Image Collection
We have collected example images for all 50Osher-son/Kemp animal classes by querying four large internet search engines,Google ,Microsoft ,Yahoo and Flickr ,using the animal names as keywords.The resulting over 180,000images were manually processed to remove outliers and du-plicates,and to ensure that the target animal is in a promi-nent view in all cases.The remaining collection consists of 30475images with at minimum of 92images for any class.Figure 1shows examples of some classes with the values of exemplary attributes assigned to this class.Altogether,animals are uniquely characterized by their attribute vector.Consequently,the Animals with Attributes dataset,formed
3Available
at attributes.kyb.tuebingen.mpg.de
by combining the collected images with the semantic at-tribute table,can serve as a testbed for the task of incorpo-rating human knowledge into an object detection system.
4.2.Feature Representations
Feature extraction is known to have a big influence in computer vision tasks.For most image datasets,e.g .Cal-tech [15]and PASCAL VOC 4,is has become difficult to judge the true performance of newly proposed classifica-tion methods,because results based on very different fea-ture sets need to be compared.We have therefore decided to include a reference set of pre-extracted features into the Animals with Attributes dataset.
We have selected six different feature types:RGB color histograms,SIFT [21],rgSIFT [35],PHOG [4],SURF [2]and local self-similarity histograms [30].The color his-tograms and PHOG feature vectors are extracted separately for all 21cells of a 3-level spatial pyramids (1×1,2×2,4×4).For each cell,128-dimensional color histograms are extracted and concatenated to form a 2688-dimensional feature vector.For PHOG,the same construction is used,but with 12-dimensional base histograms.The other feature vectors each are 2000-bin bag-of-visual words histograms.For the consistent evaluation of attribute-based object classification methods,we have selected 10test classes:chimpanzee,giant panda,hippopotamus,humpback whale,leopard,pig,racoon,rat,seal .The 6180images of those classes act as test data,whereas the 24295images of the remaining 40classes can be used for training.Addition-ally,we also encourage the use of the dataset for regular large-scale multi-class or multi-label classification.For this we provide ordinary training/test splits with both parts
con-taining images of all classes.In particular,we expect the Animals with Attributes dataset to be suitable to test hierar-chical classification techniques,because the classes contain natural subgroups of similar appearance.
5.Experimental Evaluation
In Section 2we introduced DAP and IAP,two meth-ods for attribute-based classification,that allow the learn-ing of object classification systems for classes for,which no training samples are available.In the following,we eval-uate both methods by applying them to the Animals with Attributes dataset.For DAP,we train a non-linear sup-port vector machine (SVM)to predict each binary attributes a 1,...,a M .All attribute SVMs are based the same kernel,the sum of individual χ2-kernels for each feature,where the bandwidth parameters are fixed to the five times inverse of the median of the χ2-distances over the training samples.The SVM’s parameter C is set to 10,which had been deter-mined a priori by cross-validation on a subset of the training
4/challenges/VOC/

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。