CoSy logo Cognitive Systems for Cognitive Assistants
 
 
 

The PlayMate: An Object Manipulation Scenario

Objective

Sophisticated interaction with objects is one of the activities that distinguishes intelligent creatures. Robots that are to work in human environments will need to be able to manipulate objects, to learn about their properties, and to do so in conjunction with humans. This poses not just problems in manipulation, but significant challenges in planning, language, learning, and vision. In particular it poses very difficult problems in the area of architectures. How should a robot that has experience of an object via vision, manipulation and dialogue integrate the information from those different modalities? How can such a robot plan purposeful activity, but be flexible enough to cope with human interventions? In this scenario we have not focussed on research in manipulation, but on the system level issue raised by a collaborative manipulation task. Our work includes advances in vision for object categorisation, part based recognition of objects, human robot interaction, planning and execution monitoring, and architectures and representations for binding them together.

Activities

Learning to categorise objects


A robot able to handle novel objects must be able not only to identify specific objects, for example my mobile phone, but instances of a category, any mobile phone. We have developed new methods that are able to recognise objects at the category level. In addition we want the robot to be able to learn to recognise everyday objects. We have devised new methods for solving this problem, and shown how we can tutor the robot via natural language. So to teach the robot we might show it a phone, and say "This is a phone." The robot learns a visual representation of the object that is suitable for category level recognition. However, we do not want to teach the robot about every member of a category of objects. Conveniently we can do without a tutor and learn to categorise objects by grouping the objects based on their inherent appearance. This unsupervised learning is quite good, but does make mistakes. To get around this problem we allow the human to tell the robot about a few examples of each object type, thus making the learning both fast, and precise. Our methods are appearance based, which means that to learn to recognise an object type from a viewpoint the robot has to be trained with a broadly similar viewpoints, but are robust enough to allow some variation in the orientation of the object. (Show details)

Appearance based learning of object parts for recognition


Much work on object recognition in the 1980s and early 1990s focussed on the idea of part based recognition. The core idea is that we can describe most objects from a small number of commonly used parts. If we can recognise these, and their configuration we can recognise a wide range of objects. At that time much work on object recognition attempted to reconstruct a 3D model of the object. This is a very difficult problem, and is still essentially unsolved. Because of the difficulty of 3D reconstruction many researchers turned to recognition methods that encode the appearance of an object from a particular viewpoint, rather than its 3D structure. In this project we have developed new methods for learning the parts of which an object is composed, but in an appearance based framework.

Situated dialogue and spatial reference


When humans make references to objects they often do so using spatial references that employ other objects as landmarks, for example I might say "Pass me the mug next to the phone". In addition humans are quite efficient communicators in that we will prefer references to an object that are easy for the listener to process visually. So I would refer to the "red mug" in preference to "the mug to the left of the phone". Typically we only use spatial referencing when other forms of reference are ambiguous, if for example there is more than one red mug in the scene. For a robot to understand references to objects in natural dialogue it must be able to process and make the right kinds of reference. In addition interpreting spatial references requires that the robot has a model of what it means for an object to be to the left of another object. We have built a system that is able to connect match references to objects in dialogue with the objects it can see in the scene. This means that the robot can have a relatively natural dialogue with a human.

Planning high and low level actions


Suppose you tell a robot to "Put the fork to the left of the plate, and the knife to the right". To plan this activity the robot has to reason both at the level of qualitative spatial reference "left of", and at the continuous level --- where precisely it should put the fork. In our approach we use a mapping between separate qualitative and continuous models of space. This allows the robot to look at scene, and extract both the precise spatial positions of the objects and the resulting qualitative spatial model --- the fork is behind the plate. When carrying out a task the robot plans at the high level first --- for example the robot plans to pick up the fork and put it down to the left of the plate --- and then it uses the mapping to the precise model of space to pick a precise location in which to place the fork. We then use a probabilistic road map planner to generate the precise trajectory for the robot arm avoiding obstacles on the way.

Architectures for robot cognition


How should we put together the pieces of an intelligent system? Our approach is to group processes that share representations into groups which communicate through a shared working memory. This group, together with its memory is called a sub-archictecture. The complete system is composed of a number of these. In AI terms this is very similar to what is known as a distributed blackboard architecture. The ability of each sub-architecture --- or even the processes within a subarchitecture --- to run concurrently, often on different computers is central to our approach. We have found that this parallelism or concurrency enables the robot to process information from utterances at the same time as looking at the scene, while reasoning about spatial relations. This model of cognition creates many challenges, not least of which is the challenge of engineering such large distributed real-time systems. Our demonstration systems currently contain about 35 basic components, distributed over seven sub-architectures. This concurrency also means that the architecture must have techniques for managing the way information and knowledge flow around the system. Our work has resulted not just in the conceptual architure, and demonstrable systems, but also in a software toolkit that enables the relatively rapid engineering of such cognitive systems. In the third year of the project we showed experimentally that employing a multiple workspace model where components are grouped according to their need for shared data results in advantages in processing speed and response to change.

Cross-modal learning of visual qualities


How can we teach a robot what the meaning of the word red is? We take a simple approach in which information from the camera about the most recently observed object is associated with the correct parts of the utterances describing the object. This results in the ability to teach the robot by showing it objects, and describing them. If I present the robot with an object and tell it that, "This is a small yellow thing." it will update the associations between simple visual properties of the object --- such as its bounding perimeter, hue, saturation and intensity --- and the qualities in the communiciation system of "yellow" and "small". Over a small number --- tens of objects --- the learning system can learn descriptions for objects that are quite reliable. In the third year of the project we have now shown how unlearning can be incorporated to deal with overgeneralisation in reference.

Learning and Recognition of Intentional Human Actions


In order to understand or imitate human actions a robot needs to build a model of them. We have developed a method for representing and recognising actions that enables the robot to learn by watching video clip examples of known actions, and then classifying new action sequences that it sees. This will be used in later work to allow the robot to watch a human performing some activity with objects, which it is then able to reproduce. This will require that robot is able to identify the intention of an action, so that achieves the purpose rather than just slavishly copying the human.

Planning of sensory processing


Since visual scenes are so complex, and therefore beyond complete visual analysis robots must tailor their visual processing to the task in hand. For a flexible robot this means that it must decide on the fly which visual processing to perform. One way to do this is to use planning. In this piece of work we used continual planning with assertions as a a way of generating plans for visual processing of a scene so that the robot can answer queries. The work so far is only a demonstration of the principle. Thus visual operators and plans are both quite simple. If the plans fail because a step of the visual processing does not return what is expected then replanning is triggered. In the next period we will look at a decision theoretic approach to planning which takes into account the degree of unreliability of visual processing.

Hardware

The experiments with the integrated PlayMate systems are carried out at the University of Birmingham, and DFKI using a PeopleBot, a B21r and a Katana arm. Both the mobile platforms have one or two cameras mounted on a pan-tilt unit which allow the robot to scan the scene. The Katana arm used for manipulation also uses a wrist mounted camera to allow more precise grasping of objects. Because we are limited to using a two finger gripper we have currently restricted the graspable objects to everyday objects like cereal packets.

Development

Year 1

The emphasis for the first year was on integrating spatial understanding from vision with that from language. This required building systems that could maintain several models of the world, and to exchange information between them. We were able to demonstrate a system that could couple the learning of the names of objects with the learning of their appearance, and could then answer questions about the relative locations and names of these objects. The system allows the human speaker to move objects in and out of the scene in real time.

Year 2

During the second year we worked on integrated vision and dialogue together with planning, and learning of ontologies. This required a new kind of software architecture that maintains a distributed set of representations of the scene, the intentions of the actors, and of its general knowledge. In addition to all the functionality of the first year system the resulting system is capable of learning about the meanings of colour words, and descriptions of shape, and size. It is also able to follow instructions to move objects. These can specified in terms of spatial relations that are quite natural for humans, for example "Put the blue thing to the left of the red box".

Year 3

During the third year we worked on integrating manipulation with continual planning. This means that the robot can replan manipulations to achieve qualitative states --- put the red thing to the left of the blue thing --- even when the human interferes with its activities. If the world turns out differently than expected it replans on the fly. The third year also allowed integration of incremental processing of utterances with information from other modalities via binding. Finally we used the PlayMate scenario to investigate trade-offs in the space of architectures.

Videos

Year 1

Year 2

Year 3

Publications

2007

D. Skocaj, G. Berginc, B. Ridge, A. Stimec, M. Jogan, O. Vanek, A. Leonardis, M. Hutter, and N. Hawes. A system for continuous learning of visual concepts. In International Conference on Computer Vision Systems ICVS 2007, Bielefeld, Germany, March 2007.
[ bib | .pdf ]
M. Brenner, N. Hawes, J. Kelleher, and J. Wyatt. Mediating between qualitative and quantitative representations for task-orientated human-robot interaction. In Proc. of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007.
[ bib | .pdf ]
R. Triebel, R. Schmidt, O. Martınez Mozos, and W. Burgard. Instace-based amn classification for improved object recognition in 2d and 3d laser range data. In Proc. of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI), pages 2225-2230, Hyderabad, India, 2007.
[ bib | .pdf ]

2006

S. Hongeng and J. L. Wyatt. Learning causality and intention in human actions. In Proceedings of the 6th IEEE-RAS International Conference on Humanoid Robots, Genoa, Italy, December 2006.
[ bib | .pdf ]
M. Fritz and B. Schiele. Towards unsupervised discovery of visual categories. In Proceedings of 28th Annual Symposium of the German Association for Pattern Recognition (DAGM), Berlin, Germany, September 2006.
[ bib | .pdf ]
K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class detection with a generative model. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New York, USA, June 2006.
[ bib | .pdf ]
Aaron Sloman. How to put the pieces of ai together again. In Proceedings AAAI'06, Boston, July 2006.
[ bib | .pdf ]
E. Seemann, B. Leibe, and B. Schiele. Multi-aspect detection of articulated objects. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New York, USA, June 2006.
[ bib | .pdf ]
Danijel Skocaj, Martina Uray, Ales Leonardis, and Horst Bischof. Why to combine reconstructive and discriminative information for incremental subspace learning. In CVWW 2006, Telc, Czech Republic, February 2006.
[ bib | .pdf ]
Aaron Sloman, Jeremy Wyatt, Nick Hawes, Jackie Chappell, and Geert-Jan M. Kruijff. Long term requirements for cognitive robotics. In Proceedings CogRob2006, The Fifth International Cognitive Robotics Workshop. The AAAI-06 Workshop on Cognitive Robotics, Boston, Massachusetts, USA, July 2006.
[ bib | .pdf ]
B. Leibe, A. Leonardis, and B. Schiele. Robust object detection by interleaving categorization and segmentation. International Journal of Computer Vision, 2006.
[ bib ]
Geert-Jan M. Kruijff, John D. Kelleher, and Nick Hawes. Information fusion for visual reference resolution in dynamic situated dialogue. In Elisabeth Andre, Laila Dybkjaer, Wolfgang Minker, Heiko Neumann, and Michael Weber, editors, Perception and Interactive Technologies: International Tutorial and Research Workshop, PIT 2006, volume 4021 of Lecture Notes in Computer Science, pages 117 - 128, Kloster Irsee, Germany, June 2006. Springer Berlin / Heidelberg.
[ bib | .pdf ]
B. Leibe, K. Mikolajczyk, and B. Schiele. Efficient clustering and matching for object class recognition. In Proceedings of the 17th British Machine Vision Conference, Edinburgh, England, 2006.
[ bib | .pdf ]
B. Leibe, K. Mikolajczyk, and B. Schiele. Segmentation based multi-cue integration for object detection. In Proceedings of the 17th British Machine Vision Conference, Edinburgh, England, 2006.
[ bib | .pdf ]
D. Skocaj, A. Leonardis, and H. Bischof. Weighted and robust learning of subspace representations. Pattern recognition, pages 1556-1569, 2006. (accepted for publication).
[ bib | .pdf ]
Nick Hawes and Jeremy Wyatt. Towards context-sensitive visual attention. In Proceedings of the Second International Cognitive Vision Workshop (ICVW06), Graz, Austria, May 2006.
[ bib | .pdf ]

2005

M. Fritz, B. Leibe, B. Caputo, and B. Schiele. Integrating representative and discriminant models for object category detection. In Proceedings of the 10th International Conference on Computer Vision, Bejing, China, 2005.
[ bib | .pdf ]
K. Mikolajczyk, B. Leibe, and B. Schiele. Local features for object class recognition. In Proceedings of the 10th International Conference on Computer Vision, Bejing, China, volume 2, pages 1792-1799, 2005.
[ bib | .pdf ]
Aaron Sloman and Jackie Chappell. The altricial-precocial spectrum for robots. In Proceedings IJCAI'05, pages 1187-1192, Edinburgh, July 2005.
[ bib | .pdf ]
Aaron Sloman and Bernt Schiele. Ijcai-05 tutorial on representation and learning in robots and animals. Edinburgh, July 2005.
[ bib | .pdf ]
Peter Roth, Helmut Grabner, Danijel Skocaj, Horst Bischof, and Ales Leonardis. Conservative visual learning for object detection with minimal hand labeling effort. In DAGM 2005, Vienna, Austria, 2005.
[ bib ]
J. Wyatt. Planning clarification questions to resolve ambiguous references to objects. In Proceedings of the 4th Workshop on Knowledge and Reasoning in Practical Dialogue Systems, held at IJCAI 05, 2005.
[ bib | .pdf ]

This file has been generated by bibtex2html 1.79

Print this page

 

Last modified: 23.2.2008 0:49:51