CoSy logo Cognitive Systems for Cognitive Assistants
 
 
 

Cross-Modal Learning of Visual Categories

introduction-illustration
Today's object categorization methods use either supervised or unsupervised training methods. While supervised methods tend to produce more accurate results, unsupervised methods are highly attractive due to their potential to use far more and unlabeled training data. Here we propose a new method that uses unsupervised training to obtain visual groupings of objects and a cross-modal learning scheme to overcome inherent limitations of purely unsupervised training.
The method uses a unified and scale-invariant object representation that allows to handle labeled as well as unlabeled information in a coherent way. One of a potential scenario is to learn object category models from many unlabeled observations and a few dialogue interactions that can be ambiguous or even erroneous.

System overview

System Overview

Low-Level Vision

Feature Extraction:
When a new image is grabbed from the camera, SIFT (D. Lowe IJCV 2004) descriptors are extracted at Hessian-Laplace interest points (K. Mikolajczyk and C. Schmid). An appearance-based visual codebook is created by clustering.
Object Representation:
We use a unified and scale-invariant object representation, the scale-invariant patterns (M. Fritz and B. Schiele DAGM'06)
Object Discovery (unsupervised):
An adaptive approach estimate the potential object centers based on recently observed scenes. At this level objects are defined as reoccurring patterns.

Visual Grouping (unsupervised)

An agglomerative clustering scheme is used to group object instances in an unsupervised manner. A result of such grouping is illustrated in the following image.
Visual Grouping
Our aim in the following is to label these automatically created visual groups based on language input by a human tutor.

Language sub-system (supervised input from diaglogues)

Description: the mobile is left of the bottleIn human-assisted visual learning, a human tutor provides the system with descriptions of the current visual scene. To relate these descriptions to the visual input, the system constructs a representation of the meaning of an utterance. For this analysis we use a Combinatory Categorial Grammar (J. Baldrige and G-J. M. Kruijff EACL'03) parser. This parser uses the CCG grammar to relate the syntactic structure of an utterance to the propositional meaning it expresses. Meaning is represented as an ontologically richly sorted, relational structure similar to a description logic formula, which makes it possible to use ontologies to mediate between linguistically expressed meaning, and the categories formed in the visual system.
In our system, we use
  • copulative sentences in indicative mood: X is Y,
  • and predictions encoding spatial relation, e.g. the mobile is left from the bottle (see corresponding logical from above).

Cross-Modal Associations

To visually represent spatial relations, such as left of, right of, above, and below, we employ triangular shaped distributions. Our formulation in (M. Fritz and B. Schiele, ICVS 2007) allows to incorporate information and belief from previous interactions as well as learning from scratch.
The belief about spatial associations between spatial expressions and objects is then propagated back to the visual groups created earlier by an unsupervised manner. We use simple count statistics and maximum likelihood (ML) estimates.

Further Information

A more detailed description of our system, as well as experiments can be found in:

Mario Fritz, Geert-Jan M. Kruijff, and Bernt Schiele. Cross-Modal Learning of Visual Categories using Different Levels of Supervision. In Proceedings of Internal Conference on Computer Vision Systems, Bielefeld, Germany, 2007 [pdf]
Example - Scenario #1
Example - Scenario1
Example - Scenario #2
Example - Scenario 2

Print this page

 

Last modified: 2.4.2007 20:03:02