Overview

In the last decades, video processing has become a convenient and widely used tool to assist, protect and simplify the daily life of people in areas such as surveillance, domotics, elderly care, traffic monitoring and video conferencing. Cameras, which become more and more widespread, in airports, cities or even indoor environments, provide visual information of a scene to monitor, analyze certain areas, or track individuals for special purposes. Lots of applications are emerging, such as security and surveillance, video conferencing, medical care, and traffic monitoring. In many applications, bandwidth constraints, privacy issues, and difficulties in storing and analyzing large amounts of video data make applications costly and technically challenging.

Thus, the growing number of cameras, the handling and analysis of these vast amounts of video data enable the development of multi-camera applications that cooperatively use multiple sensors, ideally process video data locally and share only compact and informative representation of the data to fulfill the desired task of an application - namely, distributed or decentralized multi-camera systems.

A cooperative multi-camera network is often used to track objects and analyze their behavior by observing the same event from different viewpoint. In contrast to single fixed-viewpoint cameras, multi-camera networks are more robust and can handle more difficult situations (e.g. occlusions). For instance, in elderly care it is very important to detect when a person fell, or to analyze the long-term behavior of this person for Alzheimer's disease. In video-conferencing, positional data for each meeting attendant can be very valuable. The positional data can be used to define regions of interest containing people, and thus, to limit more detailed processing to those areas. In these applications, both, the whereabouts and the behavior of people need to be analyzed.

An application usually consists not only of one single algorithm or method, but rather combines different approaches together to solve the desired task. For such applications, the approaches can roughly be summarized as:

  • Low-level approaches: Low-level processing describes the process of extracting features from raw video data, e.g. foreground/background segmentation, face detection, head pose estimation. The processing is usually performed on, but not limited to, a single camera and operates on a frame-by-frame basis. See more...
  • Mid-level approaches: Mid-level processing takes features of the low-level processing into account and combines/fuses these features for a certain task. Furthermore, such approaches involve the use of a single camera, or multi-camera network.In particular, multi-camera networks with overlapping views provide substantial advantages over a single fixed-viewpoint camera in terms of accuracy and precision of the desired algorithms. One essential task of a multi-camera network is the tracking of objects (in most cases humans). See more...
  • High-level approaches: High-level processing operates on an abstract level, and combines several low-level and mid-level cues such as foreground/background segmentation, face detection, face recognition, head pose estimation, or positional data. In most cases, the task is to analyze activities automatically, i.e. to correctly classify video streams into a set of activities. For example, in meetings it can be very valuable to create a complete protocol, or evaluate the meeting effectiveness and efficiency automatically. See more...

However, each of those levels are already very challenging research tasks. Therefore, combining approaches of each level for the development of an application, such as smart meeting analysis, is highly non-trivial. In my research, I developed techniques ranging from low-level to high-level approaches, specifically designed for multi-camera networks.

For the purpose of object tracking, the Image Processing and Interpretation (IPI) research group, in cooperation with the Vision Systems research group at the Hogeschool Gent, developed a multi-camera system focusing on real-time, low-latency and scalable tracking of multiple people. Especially, real-time and low-latency operations are needed in many indoor tracking applications because they need to react quickly to changes in people's positions.

This research was addressed in the project "iCocoon" (Immersive COmmunication by means of COmputer vision) which was carried out with academic knowledge and industrial companies present in Flanders. The project describes a next-generation video conferencing system in which real-time object tracking was one specific task. The purpose of the project "iCocoon" was to drastically change the way people communicate remotely by creating third-generation video conferencing applications, based on world-class video technologies (such as Computer Vision, Scene Understanding and 3D). The project resulted in a real-time demonstration of video selections, display and symbolic overview features of a third-generation video conferencing application.

Watch the video