| |
|
|
|
|
|
|
The PlayMate: An Object Manipulation Scenario
Objective

Sophisticated interaction with objects is one of the activities that
distinguishes intelligent creatures. Robots that are to work in human
environments will need to be able to manipulate objects, to learn
about their properties, and to do so in conjunction with humans. This
poses not just problems in manipulation, but significant challenges in
planning, language, learning, and vision. In particular it poses very
difficult problems in the area of architectures. How should a robot
that has experience of an object via vision, manipulation and dialogue
integrate the information from those different modalities? How can
such a robot plan purposeful activity, but be flexible enough to cope
with human interventions?
In this scenario we have not focussed on research in manipulation, but
on the system level issue raised by a collaborative manipulation
task. Our work includes advances in vision for object categorisation,
part based recognition of objects, human robot interaction, planning
and execution monitoring, and architectures and representations for
binding them together.
Activities
Learning to categorise objects

A robot able to handle novel objects must be able not only to identify
specific objects, for example my mobile phone, but instances of a
category, any mobile phone. We have developed new methods that are
able to recognise objects at the category level. In addition we want
the robot to be able to learn to recognise everyday objects. We have
devised new methods for solving this problem, and shown how we can
tutor the robot via natural language. So to teach the robot we might
show it a phone, and say "This is a phone." The robot learns a visual
representation of the object that is suitable for category level
recognition. However, we do not want to teach the robot about every
member of a category of objects. Conveniently we can do without a
tutor and learn to categorise objects by grouping the objects based on
their inherent appearance. This unsupervised learning is quite good,
but does make mistakes. To get around this problem we allow the human
to tell the robot about a few examples of each object type, thus
making the learning both fast, and precise. Our methods are
appearance based, which means that to learn to recognise an object
type from a viewpoint the robot has to be trained with a broadly
similar viewpoints, but are robust enough to allow some
variation in the orientation of the object. (Show details)
Appearance based learning of object parts for recognition

Much work on object recognition in the 1980s and early 1990s focussed
on the idea of part based recognition. The core idea is that we can
describe most objects from a small number of commonly used parts. If
we can recognise these, and their configuration we can recognise a
wide range of objects. At that time much work on object recognition
attempted to reconstruct a 3D model of the object. This is a very
difficult problem, and is still essentially unsolved. Because of the
difficulty of 3D reconstruction many researchers turned to recognition
methods that encode the appearance of an object from a particular
viewpoint, rather than its 3D structure. In this project we have
developed new methods for learning the parts of which an object is
composed, but in an appearance based framework.
Situated dialogue and spatial reference

When humans make references to objects they often do so using spatial
references that employ other objects as landmarks, for example I might
say "Pass me the mug next to the phone". In addition humans are quite
efficient communicators in that we will prefer references to an object
that are easy for the listener to process visually. So I would refer
to the "red mug" in preference to "the mug to the left of the
phone". Typically we only use spatial referencing when other forms of
reference are ambiguous, if for example there is more than one red mug
in the scene. For a robot to understand references to objects in
natural dialogue it must be able to process and make the right kinds
of reference. In addition interpreting spatial references requires
that the robot has a model of what it means for an object to be to the
left of another object. We have built a system that is able to connect
match references to objects in dialogue with the objects it can see in
the scene. This means that the robot can have a relatively natural
dialogue with a human.
Planning high and low level actions

Suppose you tell a robot to "Put the fork to the left of the plate,
and the knife to the right". To plan this activity the robot has to
reason both at the level of qualitative spatial reference "left of",
and at the continuous level --- where precisely it should put the
fork. In our approach we use a mapping between separate qualitative
and continuous models of space. This allows the robot to look at
scene, and extract both the precise spatial positions of the objects
and the resulting qualitative spatial model --- the fork is behind the
plate. When carrying out a task the robot plans at the high level
first --- for example the robot plans to pick up the fork and put it
down to the left of the plate --- and then it uses the mapping to the
precise model of space to pick a precise location in which to place
the fork. We then use a probabilistic road map planner to generate
the precise trajectory for the robot arm avoiding obstacles on the
way.
Architectures for robot cognition

How should we put together the pieces of an intelligent system? Our
approach is to group processes that share representations into groups
which communicate through a shared working memory. This group,
together with its memory is called a sub-archictecture. The complete
system is composed of a number of these. In AI terms this is very
similar to what is known as a distributed blackboard architecture. The
ability of each sub-architecture --- or even the processes within a
subarchitecture --- to run concurrently, often on different computers
is central to our approach. We have found that this parallelism or
concurrency enables the robot to process information from utterances
at the same time as looking at the scene, while reasoning about
spatial relations. This model of cognition creates many challenges,
not least of which is the challenge of engineering such large
distributed real-time systems. Our demonstration systems currently
contain about 35 basic components, distributed over seven
sub-architectures. This concurrency also means that the architecture
must have techniques for managing the way information and knowledge
flow around the system. Our work has resulted not just in the
conceptual architure, and demonstrable systems, but also in a software
toolkit that enables the relatively rapid engineering of such
cognitive systems. In the third year of the project we showed
experimentally that employing a multiple workspace model where
components are grouped according to their need for shared data results
in advantages in processing speed and response to change.
Cross-modal learning of visual qualities

How can we teach a robot what the meaning of the word red is? We take
a simple approach in which information from the camera about the most
recently observed object is associated with the correct parts of the
utterances describing the object. This results in the ability to teach
the robot by showing it objects, and describing them. If I present the
robot with an object and tell it that, "This is a small yellow thing."
it will update the associations between simple visual properties of
the object --- such as its bounding perimeter, hue, saturation and
intensity --- and the qualities in the communiciation system of
"yellow" and "small". Over a small number --- tens of objects --- the
learning system can learn descriptions for objects that are quite
reliable. In the third year of the project we have now shown how
unlearning can be incorporated to deal with overgeneralisation in
reference.
Learning and Recognition of Intentional Human Actions

In order to understand or imitate human actions a robot needs to build
a model of them. We have developed a method for representing and
recognising actions that enables the robot to learn by watching video
clip examples of known actions, and then classifying new action
sequences that it sees. This will be used in later work to allow the
robot to watch a human performing some activity with objects, which it
is then able to reproduce. This will require that robot is able to
identify the intention of an action, so that achieves the purpose
rather than just slavishly copying the human.
Planning of sensory processing

Since visual scenes are so complex, and therefore beyond complete
visual analysis robots must tailor their visual processing to the task
in hand. For a flexible robot this means that it must decide on the
fly which visual processing to perform. One way to do this is to use
planning. In this piece of work we used continual planning with
assertions as a a way of generating plans for visual processing of a
scene so that the robot can answer queries. The work so far is only a
demonstration of the principle. Thus visual operators and plans are
both quite simple. If the plans fail because a step of the visual
processing does not return what is expected then replanning is
triggered. In the next period we will look at a decision theoretic
approach to planning which takes into account the degree of
unreliability of visual processing.
Hardware
The experiments with the integrated PlayMate systems are carried out
at the University of Birmingham, and DFKI using a PeopleBot, a B21r
and a Katana arm. Both the mobile platforms have one or two cameras
mounted on a pan-tilt unit which allow the robot to scan the
scene. The Katana arm used for manipulation also uses a wrist mounted
camera to allow more precise grasping of objects. Because we are
limited to using a two finger gripper we have currently restricted the
graspable objects to everyday objects like cereal packets.
Development
Year 1
The emphasis for the first year was on integrating
spatial understanding from vision with that from language. This
required building systems that could maintain several models of the
world, and to exchange information between them. We were able to
demonstrate a system that could couple the learning of the names of
objects with the learning of their appearance, and could then answer
questions about the relative locations and names of these objects. The
system allows the human speaker to move objects in and out of the
scene in real time.
Year 2
During the second year we worked on integrated
vision and dialogue together with planning, and learning of
ontologies. This required a new kind of software architecture that
maintains a distributed set of representations of the scene, the
intentions of the actors, and of its general knowledge. In addition to
all the functionality of the first year system the resulting system is
capable of learning about the meanings of colour words, and
descriptions of shape, and size. It is also able to follow
instructions to move objects. These can specified in terms of spatial
relations that are quite natural for humans, for example "Put the blue
thing to the left of the red box".
Year 3 During the third year we worked on
integrating manipulation with continual planning. This means that the
robot can replan manipulations to achieve qualitative states --- put
the red thing to the left of the blue thing --- even when the human
interferes with its activities. If the world turns out differently
than expected it replans on the fly. The third year also allowed
integration of incremental processing of utterances with information
from other modalities via binding. Finally we used the PlayMate
scenario to investigate trade-offs in the space of architectures.
Videos
Year 1
-
Visual Learning
-
Spatial Relationships
-
Context Sensitive Saliency
Year 2
-
Cross Modal Ontology Learning
-
Language driven manipulation
-
Action learning and recognition
- Action recognition (.avi).
[01'10/52.4MB]
Using a learned hierarchical graphical model of action to recognise push and pull actions with a variety of objects.
Year 3
- Continuous cross modal learning and unlearning
Publications
2007
D. Skocaj, G. Berginc, B. Ridge, A. Stimec, M. Jogan, O. Vanek,
A. Leonardis, M. Hutter, and N. Hawes.
A system for continuous learning of visual concepts.
In International Conference on Computer Vision Systems ICVS
2007, Bielefeld, Germany, March 2007.
[ bib |
.pdf ]
|
M. Brenner, N. Hawes, J. Kelleher, and J. Wyatt.
Mediating between qualitative and quantitative representations for
task-orientated human-robot interaction.
In Proc. of the Twentieth International Joint Conference on
Artificial Intelligence (IJCAI), Hyderabad, India, January 2007.
[ bib |
.pdf ]
|
R. Triebel, R. Schmidt, O. Martınez Mozos, and W. Burgard.
Instace-based amn classification for improved object recognition in
2d and 3d laser range data.
In Proc. of the Twentieth International Joint Conference on
Artificial Intelligence (IJCAI), pages 2225-2230, Hyderabad, India, 2007.
[ bib |
.pdf ]
|
2006 |
S. Hongeng and J. L. Wyatt.
Learning causality and intention in human actions.
In Proceedings of the 6th IEEE-RAS International Conference on
Humanoid Robots, Genoa, Italy, December 2006.
[ bib |
.pdf ]
|
M. Fritz and B. Schiele.
Towards unsupervised discovery of visual categories.
In Proceedings of 28th Annual Symposium of the German
Association for Pattern Recognition (DAGM), Berlin, Germany, September 2006.
[ bib |
.pdf ]
|
K. Mikolajczyk, B. Leibe, and B. Schiele.
Multiple object class detection with a generative model.
In Proceedings of the Conference on Computer Vision and Pattern
Recognition, New York, USA, June 2006.
[ bib |
.pdf ]
|
Aaron Sloman.
How to put the pieces of ai together again.
In Proceedings AAAI'06, Boston, July 2006.
[ bib |
.pdf ]
|
E. Seemann, B. Leibe, and B. Schiele.
Multi-aspect detection of articulated objects.
In Proceedings of the Conference on Computer Vision and Pattern
Recognition, New York, USA, June 2006.
[ bib |
.pdf ]
|
Danijel Skocaj, Martina Uray, Ales Leonardis, and Horst Bischof.
Why to combine reconstructive and discriminative information for
incremental subspace learning.
In CVWW 2006, Telc, Czech Republic, February 2006.
[ bib |
.pdf ]
|
Aaron Sloman, Jeremy Wyatt, Nick Hawes, Jackie Chappell, and Geert-Jan M.
Kruijff.
Long term requirements for cognitive robotics.
In Proceedings CogRob2006, The Fifth International Cognitive
Robotics Workshop. The AAAI-06 Workshop on Cognitive Robotics, Boston,
Massachusetts, USA, July 2006.
[ bib |
.pdf ]
|
B. Leibe, A. Leonardis, and B. Schiele.
Robust object detection by interleaving categorization and
segmentation.
International Journal of Computer Vision, 2006.
[ bib ]
|
Geert-Jan M. Kruijff, John D. Kelleher, and Nick Hawes.
Information fusion for visual reference resolution in dynamic
situated dialogue.
In Elisabeth Andre, Laila Dybkjaer, Wolfgang Minker, Heiko Neumann,
and Michael Weber, editors, Perception and Interactive Technologies:
International Tutorial and Research Workshop, PIT 2006, volume 4021 of
Lecture Notes in Computer Science, pages 117 - 128, Kloster Irsee, Germany,
June 2006. Springer Berlin / Heidelberg.
[ bib |
.pdf ]
|
B. Leibe, K. Mikolajczyk, and B. Schiele.
Efficient clustering and matching for object class recognition.
In Proceedings of the 17th British Machine Vision Conference,
Edinburgh, England, 2006.
[ bib |
.pdf ]
|
B. Leibe, K. Mikolajczyk, and B. Schiele.
Segmentation based multi-cue integration for object detection.
In Proceedings of the 17th British Machine Vision Conference,
Edinburgh, England, 2006.
[ bib |
.pdf ]
|
D. Skocaj, A. Leonardis, and H. Bischof.
Weighted and robust learning of subspace representations.
Pattern recognition, pages 1556-1569, 2006.
(accepted for publication).
[ bib |
.pdf ]
|
Nick Hawes and Jeremy Wyatt.
Towards context-sensitive visual attention.
In Proceedings of the Second International Cognitive Vision
Workshop (ICVW06), Graz, Austria, May 2006.
[ bib |
.pdf ]
|
2005 |
M. Fritz, B. Leibe, B. Caputo, and B. Schiele.
Integrating representative and discriminant models for object
category detection.
In Proceedings of the 10th International Conference on Computer
Vision, Bejing, China, 2005.
[ bib |
.pdf ]
|
K. Mikolajczyk, B. Leibe, and B. Schiele.
Local features for object class recognition.
In Proceedings of the 10th International Conference on Computer
Vision, Bejing, China, volume 2, pages 1792-1799, 2005.
[ bib |
.pdf ]
|
Aaron Sloman and Jackie Chappell.
The altricial-precocial spectrum for robots.
In Proceedings IJCAI'05, pages 1187-1192, Edinburgh, July
2005.
[ bib |
.pdf ]
|
Aaron Sloman and Bernt Schiele.
Ijcai-05 tutorial on representation and learning in robots and
animals.
Edinburgh, July 2005.
[ bib |
.pdf ]
|
Peter Roth, Helmut Grabner, Danijel Skocaj, Horst Bischof, and Ales
Leonardis.
Conservative visual learning for object detection with minimal hand
labeling effort.
In DAGM 2005, Vienna, Austria, 2005.
[ bib ]
|
J. Wyatt.
Planning clarification questions to resolve ambiguous references to
objects.
In Proceedings of the 4th Workshop on Knowledge and Reasoning in
Practical Dialogue Systems, held at IJCAI 05, 2005.
[ bib |
.pdf ]
|
This file has been generated by
bibtex2html 1.79
Print this page
|
|
|