Research interests and activities
My research interests and activities can roughly be grouped into these topics;
for guidance, I am taking inspiration from research into human learning, especially developmental learning during infancy.
This is quite a new endeavour (see the paper
System-level integration in cognitive systems
Loading the player...
): linking different components of present-day "cognitive systems" the way our brain does! The main inspiration for this project came from studies on predictive visual attention hat guides eye movements to where an object *will* be. For the investigated objects, squash balls, this is not a simple
task and requires considerable model buimding, especially when dealing with rebounding balls. Taking things simple, however, I decided to
implement something much less challenging, but still complicated to do, which I term "dynamic attention priors": the improvement of a visual detector by a prediction of where objects (here: pedestrians) will be in the next frame. This knowledge comes from a second module performing trajectory analysis of detected pedestrians ("tracking"). Detection scores are boosted at places where pedestrians will presumably appear, leading to a strong gain in reliability of the detector, even where pedestrians are actually too small for the used detection windows. In particular, this technique works well where pedestrians appear before difficult backgrounds, or are partly occluded.
The more general issue here is the information exchange in cognitive systems: it seems to me that brain-like performance comes not so much from
an extraordinary single algorithm, but from many ordinary algorithms which however work very closely together: the tracking talking to object detection, ground plane estimation talking to tracking, color analysis talking to object detection, inertial sensing talking to tracking a.s.o.. Enjoy the video!
Go to top of page
Object detection in context
This line of research aims at learning
high-level knowledge such as "pedestrians are usually found on sidewalks and not on roofs", and translating
it into lower-level descriptions that can be used to guide local pattern-based detection methods. Not only can such approaches
increase the detection accuracy significantly, but also the design time is strongly reduced. I have already suceeded in showing this in the context
of vehicle detection, see the paper
. I believe this kind of "common sense models" (I term them "context models") that humans have learned for more or less all types of objects in different situations, and the ability
to translate them into precise and efficient search strategies, is what makes human perception so powerful.
What interests me currently is the question of how to learn situation-specific context models. As a very obvious example, consider the search for pedestrians in inner-city and highway traffic: while in the former case one might have to look preferentially at the sidewalk, while
in the latter case one does not look for pedestrians at all since they are rarely encountered on highways.
Go to top of page
Multi-class classification for human-machine interaction
This activity aims at recognizing hand poses observed by a low-cost time-of-flight sensor, with the aim of obtaining a robust,
- works in real time
- works in direct sunlight (Kinect sensors can't)
- achieves near-perfect classification rates
- works independently of persons
- can be used to characterize dynamic gestures
The interesting part here is the multi-class classification aspect, which is not really well understood theoretically.
So we have proposed our own modest contribution that is very pragmatic in that is does not take sides in the fight over the correct
classification or decomposition method (SVM, MLP, one-versus-one, one-versus-all,..) but proposes a simple way of
improving on top of virtually any of these architectures. The basic idea is to take the graded outputs
of an initial multi-class system and train a second classifier on top of that, thus exploiting any residual correlations.
Studies on a very large databases of 3D hand poses have shown the efficiency and practical applicability of this approach (depicted below), see
Go to top of page
Benchmark databases for vehicle detection
In the past few years at Honda Research Institute, I have spent considerable time on creating a benchmark database for vehicle detection that can be made publicly available for comparison purposes. In 2011, this activity resulted in the creation of the HRI RoadTraffic dataset which provides approximately 7= minutes of high-resolution/stereo/RGB video, along with ego-motion information and, above all object annotations to all interested researchers. The dataset is divided into 5 video streams recorded while driving the same route 5 times under very different environment and lighting conditions: overcast, rain, low sun, night, and snow. Annotated classes include generic obstacles, traffic signs, cars, trucks and pedestrians although the number of the latter is not large.
- 800x600 stereo RGB images (10Hz or 20Hz) with camera calibration parameters for stereo computation
- proprioceptive information (vehicle speed, gearing rate, gearing angle) for vehicle in which videos were recorded
- high-quality annotations in LabelMe XML format: vehicles/pedestrians/traffic signs/obstacles (annotated with rectangles), obstacle-free road area (annotated by polygons)
- annotations are semantic, i.e., contain the whole objects even when parts of it are occluded. An acclusion value is defined for each annotated object.
- high variety of environmental variation across 5 different recordings of same round trip
- includes a night time recording
- computation results for our published free-area computation algorithm are provided with dataset
- all recorded information is timestep-based
- Python tools for reading/writing LabelMe XML format provided with dataset
This technical report
contains a description of the dataset. Based on the HRI RoadTraffic dataset, we conducted a vehicle detection benchmark
using an object detection system developed at Honda Research Institute Europe GmbH.
Due to company policy at Honda, the dataset is not directly downloadable: instructions how to obtain the dataset are given in the benchmark paper we did using the HRI RoadTraffic dataset. Alternatively, a mail to me (alexander dot gepperth at ensta dot fr) will achieve the same effect.
Go to top of page
Multi-modal, weakly supervised learning
In object detection, training databases are usually created by human inspection of video images. therefore a sufficient number of training examples is hard to come by as the inspection ("labelling") process is very time consuming and therefore expensive.
Often, semi-automatic approaches are used that employ tracking methods to reduce human effort but, especially for multi-class problems like pedestrian pose classification, the number of examples for each class is still low.
What we need are therefore learning methods than can cope with absence of direct supervision in the form of crisp, symbolic labels ("pedestrian","cat","bike"). Instead, "weak supervision" signals need to be discovered
in a bunch of high-dimensional data provided by a processing system into which learning is usually embedded. In this sense, a system would no longer learn that a certain pattern class is a "car" but rather that it usually co-occurs with other events, such as, e.g.$
a certain pattern in another sensor stream. An initial step in this direction has been taken in this preliminary work
, using two simulated sensor streams, between which
the PRORPE learning algorithm detects correlated sub-spaces which are subsequently enhanced.
A currently ongoing effort is to transfer this to real-world data coming from the KITTI vehicles detection benchmark, and where the sensor streams are based
on visual and LIDAR information.
Go to top of page
Scalable incremental learning
This activity aims at learning algorithms that are usable in a "big data" context:
The focus on generative algorithms is easily explained as in real applications it is imperative to classify outliers as such, which generative methods are capable of doing but discriminative ones are not.
Furthermore, for the discovery of correlations between high-dimensional data flows, a generative algorithm is beneficial as it represents the whole distribution and therefore
can detect relations better than a discriminative one that just represents a hyperplane.
The PROPRE algorithm, as proposed in this paper
- scalable, with linear dependency on training samples count
- efficiently for problems of very high dimensionality (>1000)
- robust against irrelevant dimensions
, already fulfills a good deal of these requirements.
Namely, it is generative, scalable, efficient, applicable to high dimensional data and robust to irrelevant dimensions.
By changing a small detail in the learning architecture (learning when classification is WRONG instead of learning when it is correct),
incremental learning becomes possible. This has been described in a preliminary work
and is an ongoing activity of high priority, see this recent publication
. My interest for this topic made me co-organize a special session on incremental learning
Go to top of page
Probabilistic information processing with recurrent neural hierarchies
Although no biological organism ever has all the information it needs at its disposal, we know that at least humans have the capacity to take optimal decisions even in the face of incomplete and noisy data. Here, "optimal" means that the probability for a correct decision is maximal when applying analyzing the problem using probability theory. This capacity of humans is somewhat surprising because individual neurons, or populations of them, do not seem very good candidates for performing the special kind of mathematical operations required for optimal decision making in probability theory. Basic issues are:
- how to represent probability densities? There is a variety of proposals around (population coding, sampling), each of which has some support from biology. Furthermore, the basic issues of neural information representation are still unresolved: do neurons primarily encode information in their firing rate, or in the precise timing of correlated spike sequences? And what information is encoded: a probability, a log-probability, or something totally different?
- neurons cannot multiply but perform mostly weighted sums. However probability theory often required the multiplication of probability densities which seems to be impossible with neurons.
- Strong lateral connections, which seem to be present in all known neocortical areas in humans, can be seen to prevent the exact representation of distributions.
I propose a simple way out of this dilemma by proposing that:
- neural ensembles do not represent densities but just the interpretation of the input that is most likely under an internal model (i.e., the sought-after density)
- the internal model is not expressed through neural activity but through lateral connections.
- lateral connections do not disrupt but support the representation of probabilities as they help select the input interpretation with the highest probability under the underlying density
- inputs which are probable under the internal model create activity faster than those who are less probable due to the lateral connections. This is coherent as these actually encode the internal model.
- sub-leading interpretations can be recovered in a descendingly ordered temporal sequence by applying feedback inhibition (under research)
The beauty of this approach is that it does not rely on the particular properties of a certain model, just on generic mechanisms like lateral competition. It is very probable that both spiking and non-spiking models can be parametrized to achieve the same effect. Furthermore, the probability of the leading input interpretation under the internal model is expressed by latency which can be easily transmitted and decoded by subsequent neural layers. In this way, deep hierarchies may be built which pass around latency information.This theoretical construct (see here
) has been applied to simple object recognition tasks so far (click here
) . It has, in my view, the potential to scale up to real-world recognition tasks, both conceptually and computationally. Stay tuned!
Go to top of page