| |
|
|
|
|
|
|
The PlayMate: An Object Manipulation Scenario
Activities
Learning to categorise objects

A robot able to handle novel objects must be able not only to identify
specific objects, for example my mobile phone, but instances of a
category, any mobile phone. We have developed new methods that are
able to recognise objects at the category level. In addition we want
the robot to be able to learn to recognise everyday objects. We have
devised new methods for solving this problem, and shown how we can
tutor the robot via natural language. So to teach the robot we might
show it a phone, and say "This is a phone." The robot learns a visual
representation of the object that is suitable for category level
recognition. However, we do not want to teach the robot about every
member of a category of objects. Conveniently we can do without a
tutor and learn to categorise objects by grouping the objects based on
their inherent appearance. This unsupervised learning is quite good,
but does make mistakes. To get around this problem we allow the human
to tell the robot about a few examples of each object type, thus
making the learning both fast, and precise. Our methods are
appearance based, which means that to learn to recognise an object
type from a viewpoint the robot has to be trained with a broadly
similar viewpoints, but are robust enough to allow some
variation in the orientation of the object. (Show details)
Appearance based learning of object parts for recognition

Much work on object recognition in the 1980s and early 1990s focussed
on the idea of part based recognition. The core idea is that we can
describe most objects from a small number of commonly used parts. If
we can recognise these, and their configuration we can recognise a
wide range of objects. At that time much work on object recognition
attempted to reconstruct a 3D model of the object. This is a very
difficult problem, and is still essentially unsolved. Because of the
difficulty of 3D reconstruction many researchers turned to recognition
methods that encode the appearance of an object from a particular
viewpoint, rather than its 3D structure. In this project we have
developed new methods for learning the parts of which an object is
composed, but in an appearance based framework.
Situated dialogue and spatial reference

When humans make references to objects they often do so using spatial
references that employ other objects as landmarks, for example I might
say "Pass me the mug next to the phone". In addition humans are quite
efficient communicators in that we will prefer references to an object
that are easy for the listener to process visually. So I would refer
to the "red mug" in preference to "the mug to the left of the
phone". Typically we only use spatial referencing when other forms of
reference are ambiguous, if for example there is more than one red mug
in the scene. For a robot to understand references to objects in
natural dialogue it must be able to process and make the right kinds
of reference. In addition interpreting spatial references requires
that the robot has a model of what it means for an object to be to the
left of another object. We have built a system that is able to connect
match references to objects in dialogue with the objects it can see in
the scene. This means that the robot can have a relatively natural
dialogue with a human.
Planning high and low level actions

Suppose you tell a robot to "Put the fork to the left of the plate,
and the knife to the right". To plan this activity the robot has to
reason both at the level of qualitative spatial reference "left of",
and at the continuous level --- where precisely it should put the
fork. In our approach we use a mapping between separate qualitative
and continuous models of space. This allows the robot to look at
scene, and extract both the precise spatial positions of the objects
and the resulting qualitative spatial model --- the fork is behind the
plate. When carrying out a task the robot plans at the high level
first --- for example the robot plans to pick up the fork and put it
down to the left of the plate --- and then it uses the mapping to the
precise model of space to pick a precise location in which to place
the fork. We then use a probabilistic road map planner to generate
the precise trajectory for the robot arm avoiding obstacles on the
way.
Architectures for robot cognition

How should we put together the pieces of an intelligent system? Our
approach is to group processes that share representations into groups
which communicate through a shared working memory. This group,
together with its memory is called a sub-archictecture. The complete
system is composed of a number of these. In AI terms this is very
similar to what is known as a distributed blackboard architecture. The
ability of each sub-architecture --- or even the processes within a
subarchitecture --- to run concurrently, often on different computers
is central to our approach. We have found that this parallelism or
concurrency enables the robot to process information from utterances
at the same time as looking at the scene, while reasoning about
spatial relations. This model of cognition creates many challenges,
not least of which is the challenge of engineering such large
distributed real-time systems. Our demonstration systems currently
contain about 35 basic components, distributed over seven
sub-architectures. This concurrency also means that the architecture
must have techniques for managing the way information and knowledge
flow around the system. Our work has resulted not just in the
conceptual architure, and demonstrable systems, but also in a software
toolkit that enables the relatively rapid engineering of such
cognitive systems. In the third year of the project we showed
experimentally that employing a multiple workspace model where
components are grouped according to their need for shared data results
in advantages in processing speed and response to change.
Cross-modal learning of visual qualities

How can we teach a robot what the meaning of the word red is? We take
a simple approach in which information from the camera about the most
recently observed object is associated with the correct parts of the
utterances describing the object. This results in the ability to teach
the robot by showing it objects, and describing them. If I present the
robot with an object and tell it that, "This is a small yellow thing."
it will update the associations between simple visual properties of
the object --- such as its bounding perimeter, hue, saturation and
intensity --- and the qualities in the communiciation system of
"yellow" and "small". Over a small number --- tens of objects --- the
learning system can learn descriptions for objects that are quite
reliable. In the third year of the project we have now shown how
unlearning can be incorporated to deal with overgeneralisation in
reference.
Learning and Recognition of Intentional Human Actions

In order to understand or imitate human actions a robot needs to build
a model of them. We have developed a method for representing and
recognising actions that enables the robot to learn by watching video
clip examples of known actions, and then classifying new action
sequences that it sees. This will be used in later work to allow the
robot to watch a human performing some activity with objects, which it
is then able to reproduce. This will require that robot is able to
identify the intention of an action, so that achieves the purpose
rather than just slavishly copying the human.
Planning of sensory processing

Since visual scenes are so complex, and therefore beyond complete
visual analysis robots must tailor their visual processing to the task
in hand. For a flexible robot this means that it must decide on the
fly which visual processing to perform. One way to do this is to use
planning. In this piece of work we used continual planning with
assertions as a a way of generating plans for visual processing of a
scene so that the robot can answer queries. The work so far is only a
demonstration of the principle. Thus visual operators and plans are
both quite simple. If the plans fail because a step of the visual
processing does not return what is expected then replanning is
triggered. In the next period we will look at a decision theoretic
approach to planning which takes into account the degree of
unreliability of visual processing.
|
|
|