Low-Level Vision
- Feature
Extraction:
- When a new image is grabbed from the
camera, SIFT (D.
Lowe IJCV 2004) descriptors are extracted at
Hessian-Laplace interest points (K.
Mikolajczyk and C. Schmid). An
appearance-based visual codebook is created by clustering.
- Object
Representation:
- We use a unified and
scale-invariant object representation, the scale-invariant patterns (M.
Fritz and B. Schiele DAGM'06)
- Object
Discovery
(unsupervised):
- An adaptive approach estimate the
potential object centers based on recently observed scenes. At this
level objects are defined as reoccurring patterns.
Visual
Grouping (unsupervised)
An
agglomerative clustering scheme is used to group object instances in an
unsupervised manner. A result of such grouping is illustrated in the
following image.

Our aim in the following is to
label these automatically created visual groups based on language input
by a human tutor.
Language sub-system (supervised input from diaglogues)

In
human-assisted visual learning, a human tutor provides the system
with descriptions of the current visual scene. To relate these
descriptions to the visual input, the system constructs a
representation of the meaning of an utterance. For this analysis we use
a
Combinatory
Categorial Grammar (J. Baldrige and G-J. M. Kruijff EACL'03) parser. This
parser uses the CCG grammar to relate the syntactic structure of an
utterance to the propositional meaning it expresses. Meaning is
represented as an ontologically richly sorted, relational structure
similar to a description logic formula, which makes it possible to use
ontologies to mediate between linguistically expressed meaning, and the
categories formed in the visual system.
In our system, we use
- copulative
sentences in indicative mood: X is Y,
- and
predictions encoding spatial relation, e.g. the mobile is left from the bottle
(see corresponding logical from above).
Cross-Modal
Associations
To visually represent spatial relations, such as
left of, right of,
above, and
below, we employ
triangular shaped distributions. Our formulation in (
M.
Fritz and B. Schiele, ICVS 2007) allows to incorporate
information and belief from previous interactions as well as learning
from scratch.
The
belief about spatial associations between spatial expressions and
objects is then propagated back to the visual groups created earlier by
an unsupervised manner. We use simple count statistics and maximum
likelihood (ML) estimates.
Further Information
A
more detailed description of our system, as well as experiments can be
found in:
Mario
Fritz, Geert-Jan M. Kruijff, and Bernt Schiele. Cross-Modal Learning of
Visual Categories using Different Levels of Supervision.
In Proceedings of Internal
Conference on Computer Vision Systems, Bielefeld, Germany,
2007
[pdf]Example
- Scenario #1
 | Example
- Scenario #2
 |