MSRM Blog

Is this my body? (Part I)

2020-01-27T10:00:00+00:00

We wake up every morning and we look in the mirror and… yes, that reflexion is you. How do we know it? How do we know where our body parts are in space or that we produced that effect in the environment? Surprisingly, our joint sensors proprioception (e.g. muscle spindles) are nothing but imprecise and we do not even have exact models of the body and the world. Despite that, we are able to have robust control of our body for general purpose tasks and, at the same time, we are flexible enough to adequate all the changes that happen in our body during life. We might think that we achieved general intelligence due to body imprecision. In fact, adaptation and learning will be two of the core processes that allow us to interact in an uncertain world.

On the one hand, we know that humans create a sensorimotor mapping in the brain by learning the relation between different sensations (cross/multimodal learning). On the other hand, experiments have shown us that body perception is rather flexible with an instantaneous strong (bottom-up) component. For instance, in less than one minute we can make you think that your limb is a plastic hand in a different location or transfer as yours the body of a friend (body transfer illusion) just by visuotactile stimulation. This effect was first discovered by means of the rubber-hand illusion and presents body perception as a very flexible and instantaneous process. This is a commonly observed effect in virtual reality where your body is different but it is rapidly integrated.

The challenge: Robots, conversely, usually have a fixed body and precise sensors and models. Then, why do they still perform poorly with their body in real-world uncertain situations?

Perception as inference in the brain

Visual illusions are a great source of information to understand how perception works in the brain. In the following figure, Dallenbach’s illusion is presented. If you have never seen the solution, it will be hard to see what is in the picture. However, after observing the solution it will be easy to recognize it.

But the crucial and astonishing consequence is that no matter what you do you afterward, you will always see the same concept in the picture. It is like a seed has been placed in your brain and stays forever. According to the medical doctor and physicist Hermann von Helmholtz, visual perception is an unconscious mechanism that infers the world. In my approach, the brain has generative models that complete or reconstruct the world from partial information. In Dallenbach’s illusion example, the prior information helps to reconstruct the animal in the image.

Assuming that the brain is perceiving the world in this manner, then body perception should have a similar process behind.

The free-energy principle

The free-energy principle (FEP), proposed by Karl Friston, presents the brain as a Markov blanket that has only access to the world through the body senses. Conceptually, body perception and action are a result of surprise minimization. Mathematically, it can be classified as a variational inference method as it uses the minimization of the variational free energy bound for tractability. This bound has been largely investigated in the machine learning community and it is a concept that was originated in the field of physics.

Notation tip: Be aware that I will use notation and terminology that sometimes confuses if you are familiar with variational inference. In this post observations are $s$ and the hidden variables are $x$.

Let us imagine that we want to infer our body posture $x$ (latent state) from the information provided by all body sensors $s$. Using the Bayes rule we obtain:

$p(x | s)=\frac{p(s | x) p(x)}{p(s)}$

That is the likelihood of observing that specific sensor information given my posture multiplied by my prior knowledge of my posture, divided by the total probability. In order to compute the denominator, we have to integrate over all possible states of the body. This is in practice intractable for continuous spaces.

Fortunately for us, we can do a workaround by tracking the real distribution $p(x|s)$ with a reference distribution $q(x)$ with a known tractable form. Thus, our goal is to approximate $q$ to the real world distribution $p$. We can minimize the KL-divergence between the true posterior and our reference distribution, as we know that when the divergence is 0 both distributions are the same. But instead of directly minimizing it, we are going to use a lower bound called negative variational free energy $F$ or evidence lower bound (ELBO).

$D_{K L}(q(x) \| p(x | s))=F+\ln p(s)$

where $\ln p(s)$ is the log-evidence of the model or surprise (in the free energy principle terminology). Note that the second term does not depend on the reference distribution so we can directly optimize $F$. (We leave the derivation of the negative variational free energy bound for another post.)

The negative variational free energy is composed of two expectations:

$F=-\int q(x) \ln p(s, x) d x+\int q(x) \ln q(x) d x$

We can simplify $F$ further by means of the Laplace approximation. We define our reference function as a tractable family of factorized Gaussian functions. This is sometimes referred to as the Laplace encoded energy $F \approx L$. Under this assumption, we can track the reference distribution by its mean $\mu$. Thus we arrive at our final definition of the variational Laplace encoded free energy:

$F(\mu, s) \approx-\ln p(s, \mu)-(\ln |\Sigma|+n \ln 2 \pi)$

where $n$ is the number of variables or size of $x$

Coming back to body perception and action problem. We want to infer the body posture and we define it as an optimization scheme such as we obtain the optimal values of the reference distribution statics by minimizing the variational free energy:

$\mu^{*}=\arg \min _{\mu} F(\mu, s)$

Therefore, we update our belief of body posture by minimizing $F$ and $\mu$ is the most plausible solution.

But where is the action? Imagine an organism that has adapted to live in an environment of 30ºC. It can sense the outside temperature by chemical sensors. If the temperature goes down, the only way to survive is by acting in the environment, for example, by moving towards a warmer location.

Now let us think about a more complex organism (e.g. a human) perceiving the world. Either it can change its belief of the world or act to produce new observation that fits better with its expectation. Thus, according to the FEP, the action should be also computed as the minimization of the variational free energy.

$a^{*}=\arg \min F(\mu, s)$

When the action is introduced in FEP it is called Active Inference, which is a form of control as probabilistic inference. In this case, the action will be driven by the error on the predicted observation. Active Inference is the terminology that Karl Friston originally used to name the FEP model where the action is also minimizing the variational free energy. In order to optimize both variables we can use a classical gradient descent approach as follows:

$\begin{aligned} &\mu=\mu-\Delta_{\mu} F(\mu, s)\\ &a=a-\Delta_{a} F(\mu, s) \end{aligned}$

This $\mu$ update only works in static perception tasks and in the next section we show how to include the dynamics of the latent space. I also left out the link between the FEP and the predictive coding approach, the hierarchical nature of the FEP.

Active inference in a humanoid robot

Now that we know more about the FEP we are ready to formalize the body perception and action problem. We developed the first Active Inference model working in a humanoid robot.

Active inference in a humanoid robot and the iCub robot description for the reaching task

The robot infers its body (e.g., joint angles) by minimizing the prediction error: discrepancy between the sensors (visual and joint) and their expected values. In the presence of error, it changes the perception of its body and generates an action to reduce this discrepancy. Both are computed by optimizing the free-energy bound. Two different tasks are defined: a reaching and a tracking task. First, the object is a causal variable that acts as a perceptual attractor $\rho$, producing an error in the desired sensory state and promoting a reaching action towards the goal. The equilibrium point appears when the hand reaches the object. Meanwhile, the robot’s head keeps the object in its visual field, improving the reaching performance.

Following the equation development of the previous sections, we will define the Laplace-encoded energy of the system as the product of the likelihood, which accounts for proprioception functions in terms of the current body configuration, and the prior, which includes the dynamic model of the system defining the change of its internal state over time.

Body configuration, or internal variables, is defined as the joint angles. The estimated states $\mu$ are the belief the agent has about the joint angle position and the action $a$ is the angular velocity of those same joints. Due to the fact that we use a velocity control for the joints, first-order dynamics must also be considered $\mu^\prime$

Sensory data will be obtained through several input sensors that provide information about the position of the end-effector in the visual field $s_v$, and joint angle position $s_p$. The dynamic model for the latent variables (joint angles) is determined by a function which depends on both the current state $\mu$ and the causal variables $\rho$ (e.g. 3D position of the object to be reached), with a noisy input following a normal distribution with mean at the value of this function$f(\mu, \rho)$. The reaching goal is defined in the dynamics of the model by introducing a perceptual attractor.

$p(s, \mu, \rho)=p(s | \mu) p(\mu)=p\left(s_{p} | \mu\right) p\left(s_{v} | \mu\right) p\left(\mu^{\prime} | \mu, \rho\right)$

Sensory data and dynamic models are assumed to be noisy following a normal distribution, allowing us to define their likelihood functions.

$\begin{aligned} &p\left(s_{p} | \mu\right)=\mathcal{N}\left(\mu, \Sigma_{s_{p}}\right)\\ &p\left(s_{v} | \mu\right)=\mathcal{N}\left(g(\mu), \Sigma_{s_v}\right)\\ &p\left(\mu^{\prime} | \mu, \rho\right)=\mathcal{N}\left(f(\mu, \rho), \Sigma_{\mu}\right) \end{aligned}$

Where $g$ is the predictor or forward model of the visual sensation and $f$ is the dynamics of the latent space.

With the variational free-energy of the system defined, we can proceed to its optimization using gradient descent. The differential equations used to update $\mu, \mu^\prime$ and $a$ are:

$\dot{\mu}=\mu^{\prime}-\frac{\partial F}{\partial \mu} \quad \dot{\mu}^{\prime}=-\frac{\partial F}{\partial \mu^{\prime}} \quad \dot{a}=-\frac{\partial F}{\partial a}$

And the partial derivatives of the free energy are:

$\begin{aligned} &-\frac{\partial F}{\partial \mu}=\frac{1}{\Sigma_{s_{p}}}\left(s_{p}-\mu\right)+\frac{1}{\Sigma_{s_{w}}} \frac{\partial g_{v}(\mu)^{T}}{\partial \mu}\left(s_{v}-g(\mu)\right)\\ &\hspace{4em} +\frac{1}{\Sigma_{s_{p}}} \frac{\partial f(\mu, \rho)^{T}}{\partial \mu}\left(\mu^{\prime}-f(\mu, \rho)\right)\\ &-\frac{\partial F}{\partial a}=-\left(\frac{1}{\Sigma_{s_{p}}} \frac{\partial s_{p}}{\partial a}\left(s_{p}-\mu\right)+\frac{1}{\Sigma_{s_{v}}} \frac{\partial s_{v}^{T}}{\partial a}\left(s_{v}-g_{v}(\mu)\right)\right)\\ &-\frac{\partial F}{\partial \mu^{\prime}}=-\frac{1}{\Sigma_{s_{\mu}}}\left(\mu^{\prime}-f(\mu, \rho)\right)=\frac{1}{\Sigma_{s_{\mu}}}\left(f(\mu, \rho)-\mu^{\prime}\right) \end{aligned}$

Note that all the terms in the update equations have a similar form to:

$-\underbrace{\frac{\partial g(\mu)^{T}}{\partial \mu}}_{\text {mapping }} \underbrace{\sum_{s}^{-1}}_{\text {precision}} \underbrace{(s-g(\mu))}_{\text {prediction error}}$

Results

Experiment: Adaptation The robot adapts its reaching behavior when we change the visual feature location that defines the end-effector. A simile would be that we change the length or location of your hand. The optimization process will find an equilibrium between the internal model and the real observation by perceptual updating but also by exerting an action.

Experiment: Comparison Motion from the active inference algorithm is compared to inverse kinematics.

Experiment: Dynamics for 2D and 3D reaching task Body perception and action variables are analyzed during an arm reaching with active head towards a moving object. The head and the eyes are tracking the object in the middle of the image and the arm is performing the reaching task.

More Info

If you are interested in this research and want to learn more, check out the selfception project webpage and the related papers below. We will release the code in open source very soon. The students Guillermo Oliver and Cansu Sancaktar contributed with the research and this blog entry. A full video with all experiments can be watched here.

Check our continuation post Part II to dig into a deep learning version of this approach.

@article{oliver2019active,
  title={Active inference body perception and action for humanoid robots},
  author={Oliver, Guillermo and Lanillos, Pablo and Cheng, Gordon},
  journal={arXiv preprint arXiv:1906.03022},
  year={2019}
}

@inproceedings{lanillos2018adaptive,
  title={Adaptive robot body learning and estimation through predictive coding},
  author={Lanillos, Pablo and Cheng, Gordon},
  booktitle={2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  pages={4083--4090},
  year={2018},
  organization={IEEE}
}

Acknowledgements. This work has been supported by SELFCEPTION project, European Union Horizon 2020 Programme under grant agreement n. 741941, the European Union’s Erasmus+ Programme, the Institute for Cognitive Systems at the Technical University of Munich (TUM) and the Artificial Cognitive Systems at the Donders Institute for Brain, Cognition and Behaviour.

Is this my body? (Part II)

2020-01-27T10:00:00+00:00

In our previous post Part I, the free-energy principle (FEP) was introduced and deployed into a humanoid robot. However, all functions and models were known. In particular, the forward model (predictor of the expected sensation given the body state) and its partial derivatives were known. How we can let the robot learn the needed functions for body perception and action? Here, we show how we can approach perceptual and active inference in the brain in a scalable machine learning point of view. For that purpose, we are going to combine variational inference with neural networks.

The model we show has been adopted from our latest work

Perception as a deep variational inference problem

Let us assume that we have a generative model able to map from our reference distribution over the latent space $q(x),$ which represents a number, encoded by the mean $\mu$ to the expected sensory input $s$ (image). We define the generative model $g$ as a nonlinear function with Gaussian noise $w_{s}:$

$s=g(\mu)+w_{s} \rightarrow s$ follows a Normal distribution $\mathcal{N}\left(g(\mu), \Sigma_{s}\right)$

We can also write the likelihood of having a sensation given our body internal variable as:

$p(s | x)=1 / \sqrt{2 \pi \Sigma_{s}} \exp \left[-\frac{1}{2 \Sigma_{s_{v}}}(s-g(\mu))^{2}\right]$

Then our variational free energy optimization under the Laplace approximation becomes:

$\dot{\mu}=-\frac{\partial F}{\partial \mu}=-\frac{\partial \ln p(s, x)}{\partial \mu}=-\frac{\partial \ln p(s | x) p(x)}{\partial \mu}$

For now, we are going to assume that the prior information about the latent space $p(x)$ is uniform and does not have any dynamics. We apply logarithms and compute the partial derivative to the likelihood $p(s | x),$ resulting in:

$\dot{\mu}=-\underbrace{\frac{\partial g(\mu)^{T}}{\partial \mu}}_{\text {mapping }} \underbrace{\Sigma_{s}^{-1}}_{\text {precision}} \underbrace{(s-g(\mu))}_{\text {prediction error }}$

In the equation above it is more clear how we compute the change on the internal variable using the error between the predicted sensation and the observed one, weighted by the relevance of that sensor. Finally, the partial derivative of the generative function gives us the mapping between the error and the latent variable.

We obtain the update rule with the first Euler integration as:

$\mu=\mu+\Delta_{t} \dot{\mu}$

For improving clarity and generalization, I will first explain the algorithm with the example of perceiving numbers with the MNIST database. We learn the decoder network that converts the latent space $\mu$ into the image of the digits 0 to 9 with the exception of number 8 that we remove from the database. After the training, we can perform the first experiment using FEP to update the belief of the world. Below you have a snippet of our PyTorch code used for computing one iteration of $\mu$ update

input = Variable(mu, requires_grad=True) # hidden variable
g = network.decoder(input) # prediction forward pass
e_v = (s - g)  # prediction error     
dF_dg = (1/Sigma_v)*self.e_v # error weighted by the precision
gv.backward(torch.ones(g.shape)*dF_dg) # backward pass
mu_dot = input.grad
mu = torch.add(self.mu, mu_dot, alpha=dt) # Update Euler integration

Experiment 1. We first initialize the latent variable to a fixed digit 0 but then the input image $s$ is 2.

The gradient descent progressively changes the latent variable producing the following shift in the predicted output $g(\mu)$

Experiment 2. The same occurs if we set $\mu$ to 7 and then we set the image of a 2 as the visual input.

The dynamics of the perception represented by its prediction is as follows:

Experiment 3. Our last test is how FEP behaves with inputs that have not been used for the training. Here we set $\mu$ to 7 and the input $s$ is 8.

The gradient optimization tries to minimize the difference between the prediction and the real observation achieving some form of five with the top part closed.

But where is the action?

In the numbers perception example, there is no action involved but one can imagine that the action will be to modify the number to better fit our initial belief.

Pixel-AI: Deep active inference

We want to scale active inference (both perception and action driven by the FEP) to visual input images with function learning. Therefore, in order to deal with raw pixels input, we developed the Pixel-AI model; a scalable model of the FEP using convolutional decoders. We deployed the algorithm in the NAO robot to evaluate its performance.

Using Pixel-Al, the robot infers its body state by minimizing the visual prediction error, i.e. the discrepancy between the camera sensor value $s_{v}$ and the expected sensation $g(\mu) .$ The internal belief of the robot corresponds to the joint angles of the robot arm. Unlike the previous model, the mapping $g(\mu)$ between the internal belief and the observed camera image is learned using a convolutional decoder. The partial derivatives $\partial g(\mu) / \partial u$ can be obtained by performing a backward pass through the convolutional decoder.

Perceptual Inference

The robot infers its body posture using the visual input provided by a monocular camera. The robot arm was brought to an initial position, but the internal belief of the body $\mu$ was set to a wrong value. As the visualizations below show, using Pixel-Al the internal belief converged to its true value so that the internally predicted visual sensation $g(\mu)$ converged to the observed visual sensation $S_{v}$. Note that here we are not using any proprioceptive information, just the raw image.

Active Inference

For the active inference tests, we used the reaching task. We set the image of a different arm configuration as an imaginary goal position. Using the actions generated by Pixel-AI, the robot’s arm converged to the goal position. The images below are with the NAO robot simulation. It is shown how the robot performs visual reaching in position and pose.

The following video shows the Pixel-AI running on the real robot. The visual goal is overlaid to the robot arm that moves until the free energy is minimized reaching the correct arm pose.

More Info

@article{sancaktar2020active,
  title={End-to-End Pixel-Based Deep Active Inference for Body Perception and Action},
  author={Sancaktar, Cansu and van Gerven, Marcel, and Lanillos, Pablo},
  journal={arXiv preprint arXiv:2001.05847},
  year={2020}
}

@inproceedings{lanillos2020robot,
  title={Robot self/other distinction: active inference meets neural networks learning in a mirror},
  author={Lanillos, Pablo and Pages, Jordi and Cheng, Gordon},
  booktitle={2020 European Conference on Artificial Intelligence (ECAI)},
  year={2020}
}

Acknowledgements. This work has been supported by SELFCEPTION project, European Union Horizon 2020 Programme under grant agreement n. 741941, the European Union’s Erasmus+ Programme, the Institute for Cognitive Systems (TUM) and the Artificial Cognitive Systems at the Donders Institute for Brain, Cognition and Behaviour.

Visual Tracking with THOR

2020-01-10T20:50:00+00:00

Visual object tracking is a fundamental problem in computer vision. The goal is to follow the movements of an object throughout an image sequence. Generally, we do not have any information about the object, like its type (e.g., a cup) or a nice and clean CAD model. In the beginning, we get a bounding box around the object, which is drawn manually or given by an object detector, and then we need to keep track of the object throughout the sequence.

This task is challenging because, during the sequence, the lighting can drastically vary, the object could be occluded, or similar-looking objects could appear and distract our tracker. We are especially interested in object tracking since it is crucial for robotics applications, see some examples below.

Visual object tracking in action. Left: to navigate safely to its goal, an autonomous car needs to keep track of the whereabouts of other cars and pedestrians. [Source]. Right: If we put a robot in a workshop environment, it needs to know where the pliers or the electric drill are located, at all times, so it can pick it up and use it. [Source]

A common way to solve this problem is to do template matching. Given the first bounding box, we keep the patch inside of the box as a template. In the following frames, we match this template with the new image and compute the new bounding box. Siamese neural networks are especially effective to do this matching. Popular real-time capable trackers are SiamFC and SiamRPN.

Template matching. Given an input image and a template image, template matching trackers encode both of them in a (learned) feature space. In this space, we can compute the similarity between the two by applying a dot product. This computation yields an activation map that tells us where we have the highest resemblance between both. Based on this map, we compute the new bounding box.

Current challenges

The research community made significant improvements in visual object tracking, especially with the help of neural networks that can learn a very expressive feature space for the matching. However, current state-of-the-art approaches rely heavily on the assumption that the first template is all we need for robust object tracking. This assumption can prove to be problematic:

Problems of using only a single template. In the beginning, the tracker works quite well and tracks the cup reliably. As soon as the coffee stains appear, the object’s appearance changes too much, and the tracker fails.

Such failure is a big problem. Imagine a robot loading a dishwasher, and while doing so, the robot would get confused because the appearance of the plates changes too much when moving them.

Well, there is an obvious and easy solution: use multiple templates! The reason why state-of-the-art trackers don’t do this is that using more than one template introduces a plethora of problems. There are two main problems. The first one is to determine if the current image crop (the pixels inside of the predicted bounding box) is also a good template. The second one is drift – the tracker could lose the object and start using templates that do not show the object, and the performance goes downhill from there.

Can we still make multiple template tracking work?

We made steps towards this aim with our recent work called THOR (short for Tracking Holistic Object Representations, possibly inspired by a certain Marvel character). Our objective was to develop an approach that can be plugged on top of any tracker (that is, any tracker that computes a similarity measure in feature space based on an inner-product operation) to improve its performance and robustness.

But instead of training a new network on a big dataset, we want to squeeze out as much as we possibly can of the information accumulated during tracking. Therefore we assume one thing: we only should keep the templates if they contain additional information – they should be as diverse as possible.

How do we get diverse templates?

The siamese network was trained to learn a feature space that is used to compute similarities. We leverage this property, but not do tracking, rather to find out how similar two templates are.

Using the siamese network in unusual ways. Left: computing the similarity between input image crop and the template. Right: we use the same neural network, but this time we compute the similarity between two templates.

If we compute the similarity of all templates $f_i$ with each other, we can construct a Gram matrix:

$G\left(f_{1}, \cdots, f_{n}\right)=\left[\begin{array}{cccc} {f_{1} \star f_{1}} & {f_{1} \star f_{2}} & {\cdots} & {f_{1} \star f_{n}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {f_{n} \star f_{1}} & {f_{n} \star f_{2}} & {\cdots} & {f_{n} \star f_{n}} \end{array}\right]$

Now, to increase diversity, we need to increase the volume that the feature vectors $f_i$ span in the feature space – the bigger the volume, the higher the diversity. A nice property of the Gram matrix is that its determinant is proportional to this spanned volume. So, maximizing the determinant, maximizes the volume:

$\max _{f_{1}, f_{2}, \ldots, f_{n}} \Gamma\left(f_{1}, \ldots, f_{n}\right) \propto \max _{f_{1}, f_{2}, \ldots, f_{n}}\left|G\left(f_{1}, f_{2}, \ldots, f_{n}\right)\right|$

where $\Gamma$ is the spanned volume. So, when we receive a new template, we check if it increases the determinant. If that is the case, we include this template in our memory.

Increasing diversity. Throughout the sequence, we accumulate more diverse templates that are further apart in the feature space. In this example, the number of templates is fixed to 5, in the beginning, they are all initialized with the first template T1

We do all these calculations in the Long-term module (LTM), which is the heart piece of THOR. To make it work even better, we introduce other, simpler concepts like a short-term module that handles abrupt movements and occlusion.

Experiments

So, let’s try the previous setting again:

THOR dealing with coffee stains. THOR finds and uses the most diverse templates, and the tracker can handle the drastic appearance changes.

Not only are we able to handle the problem that we set out to solve, but we also plugged THOR on top of 3 different trackers and were able to improve all of them on commonly used benchmarks. At the time of publishing, THOR even achieved state-of-the-art on VOT benchmark.

Speed is especially important for robotics applications, but more templates mean more computation for each frame, therefore generally slowing the tracking down. However, we can do all the additional calculations in parallel, so we don’t slow the tracker down much. We achieved state-of-the-art performance while being 3 times faster than the previous best approach since we get away with using smaller simpler network.

Speed comparison. Plugging THOR on top of SiamRPN only slows it down slightly.

A nice side effect: recently, researchers added additional output branches to the tracking networks that also predicts an object mask. We can plug THOR on top of such trackers without any modification.

THOR-SiamMask in Action. THOR can be plugged on top of novel methods that combine object tracking and segmentation.

More Info

If you got interested in our work and want to learn more, check out the project page and the paper. The code is open-source. We were very honored to receive the Best Science Paper Award at the British Machine Vision Conference 2019 for this work.

    @inproceedings{Sauer2019BMVC,
      author={Sauer, Axel and Aljalbout, Elie and Haddadin, Sami},
      title={Tracking Holistic Object Representations},
      booktitle={British Machine Vision Conference (BMVC)},
      year={2019}
    }

Graph Diffusion Convolution

2020-01-09T10:50:00+00:00

In almost every field of science and industry you will find applications that are well described by graphs (a.k.a. networks). The list is almost endless: There are scene graphs in computer vision, knowledge graphs in search engines, parse trees for natural language, syntax trees and control flow graphs for code, molecular graphs, traffic networks, social networks, family trees, electrical circuits, and so many more.

Some examples of graphs. [Wikimedia Commons, Stanford Vision Lab]

While graphs are indeed a good description for this data, many of these data structures are actually artificially created and the underlying ground truth is more complex than what is captured by the graph. For example, molecules can be described by a graph of atoms and bonds but the underlying interactions are far more complex. A more accurate description would be a point cloud of atoms or even a continuous density function for every electron.

So one of the main questions when dealing with graphical data is how to incorporate this rich underlying complexity while only being supplied with a simple graph. Our group has recently developed one way of leveraging this complexity: Graph diffusion convolution (GDC). This method can be used for improving any graph-based algorithm and is especially aimed at graph neural networks (GNNs).

GNNs have recently demonstrated great performance on a wide variety of tasks and have consequently seen a huge rise in popularity among researchers. In this blog post I want to first provide a short introduction to GNNs and then show how you can leverage GDC to enhance these models.

What are Graph Neural Networks?

In each layer the node $\nu$ receives messages from all neighboring nodes $w$ and updates its embedding based on these messages. The node embeddings before the first layer are usually obtained from some given node features. In citation graphs, where papers are connected by their citations, these features are typically a bag-of-words vector of each paper’s abstract.

The idea behind graph neural networks (GNNs) is rather simple: Instead of making predictions for each node individually we pass messages between neighboring nodes after each layer of the neural network. This is why one popular framework for GNNs is aptly called Message Passing Neural Networks (MPNNs). MPNNs are defined by the following two equations:

$m_{v}^{(t+1)}=\sum_{w \in N(v)} f_{\text {message}}^{(t+1)}\left(h_{v}^{(t)}, h_{w}^{(t)}, e_{v w}\right),\\ h_{v}^{(t+1)}=f_{\text {update}}^t\left(h_{v}^{(t)}, m_{v}^{(t+1)}\right)$

where $h_{v}$ is a node embedding, $e_{v w}$ an edge embedding, $m_{v}$ an incoming message, and $\quad N_{v}$ denotes the neighbors of $v$. In the first equation all incoming messages are aggregated, with each message being transformed by a function $f_{\text {message}}$, which is usually implemented as a neural network.

The node embeddings are then updated based on the aggregated messages via $f_{\text{update}}$, which is also commonly implemented as a neural network. As you can see, in each layer of a GNN a single message is sent and aggregated between neighbors. Each layer learns independent weights via backpropagation, i.e. $f_{\text{message}}$ and $f_{\text{update}}$ are different for each layer. The arguably most simple GNN is the Graph Convolutional Network (GCN), which can be thought of as the analogue of a CNN on a graph. Other popular GNNs are PPNP, GAT, SchNet, ChebNet, and GIN.

The above MPNN equations are limited in several ways. Most importantly, we are only using each node’s direct neighbors and give all of them equal weight. However, as we discussed earlier the underlying ground truth behind the graph is usually more complex and the graph only captures part of this information. This is why graph analysis in other domains has long overcome this limitation and moved to more expressive neighborhoods (since around 1900, in fact). Can we also do better than just using the direct neighbors?

Going beyond direct neighbors: Graph diffusion convolution

GNNs and most other graph-based models interpret edges as purely binary, i.e. they are either present or they are not. However, real relationships are far more complex than this. For example, in a social network you might have some good friends with whom you are tightly connected and many acquaintances whom you have only met once.

To improve the predictions of our model we can try to reconstruct these continuous relationships via graph diffusion. Intuitively, in graph diffusion we start by putting all attention onto the node of consideration. We then continuously pass some of this attention to the node’s neighbors, diffusing the attention away from the starting node. After some time we stop and the attention distribution at that point defines the edges from the starting node to each other node. By doing this for every node we obtain a matrix that defines a new, continuously weighted graph. More precisely, graph diffusion is defined by

$S=\sum_{k=0}^{\infty} \theta_{k} T^{k}$

where $\theta_{k}$ are coefficients and $T$ denotes the transition matrix, defined e.g. by $A D^{-1},$ with the adjacency matrix $A$ and the diagonal degree matrix $D$ with $d_{i i}=\sum_{j} a_{i j}$.

These coefficients are predefined by the specific diffusion variant we choose, e.g. personalized PageRank (PPR) or the heat kernel. Unfortunately, the obtained $S$ is dense, i.e. in this matrix every node is connected to every other node. However, we can simply sparsify this matrix by ignoring small values, e.g. by setting all entries below some threshold $\varepsilon$ to $0 .$ This way we obtain a new sparse graph defined by the weighted adjacency matrix $\tilde{S}$ and use this graph instead of the original one. There are even fast methods for directly obtaining the sparse $\tilde{S}$ without constructing a dense matrix first.

Graph diffusion convolution (GDC): We first perform diffusion on the original graph, starting from some node $\nu$. The density after diffusion defines the edges to the starting node $\nu$. We then remove all edges with small weights. By doing this once for each node we obtain a new sparse, weighted graph $S$.

Hence, GDC is a preprocessing step that can be applied to any graph and used with any graph-based algorithm. We conducted extensive experiments (more than 100,000 training runs) to show that GDC consistently improves prediction accuracy across a wide variety of models and datasets. Still, keep in mind that GDC essentially leverages the homophily found in most graphs. Homophily is the property that neighboring nodes tend to be similar, i.e. birds of a feather flock together. It is therefore not applicable to every dataset and model.

Why does this work?

Up to this point we have only given an intuitive explanation for GDC. But why does it really work? To answer this question we must dive a little into graph spectral theory.

In graph spectral theory we analyze the spectrum of a graph, i.e. the eigenvalues of the graph’s Laplacian $L=D-A$, with the adjacency matrix $A$ and the diagonal degree matrix $D$. The interesting thing about these eigenvalues is that low values correspond to eigenvectors that define tightly connected, large communities, while high values correspond to small-scale structure and oscillations, similar to the small and large frequencies in a normal signal. This is exactly what spectral clustering takes advantage of.

When we look into how these eigenvalues change when applying GDC, we find that GDC typically acts as a low-pass filter. In other words, GDC amplifies large, well-connected communities and suppresses the signals associated with small-scale structure. This directly explains why GDC can help with tasks like node classification or clustering: It amplifies the signal associated with the most dominant structures in the graph, i.e. (hopefully) the few large classes or clusters we are interested in.

GDC acts as a low-pass filter on the graph signal. The eigenvectors associated with small eigenvalues correspond to large, tightly connected communities. GDC therefore amplifies the signals that are most relevant for many graph-based tasks.

Further Information

If you want to get started with graph neural networks I recommend having a look at PyTorch Geometric, which implements many different GNNs and building blocks to create the perfect model for your purposes. I have already implemented a nice version of GDC in this library.

If you want to have a closer look at GDC I recommend checking out our paper and our reference implementation, where you will find a notebook that lets you reproduce our paper’s experimental results.

@inproceedings{klicpera_diffusion_2019,
    title = {Diffusion Improves Graph Learning},
    author = {Klicpera, Johannes and Wei{\ss}enberger, Stefan and G{\"u}nnemann, Stephan},
    booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
    year = {2019}
}

Blog Launch

2020-01-07T13:00:00+00:00

We are excited to launch the “Munich School of Robotics and Machine Intelligence” blog! We want to use this outlet to write about our research done at the institutes of the MSRM at TU Munich. However, not many like to read densely written research papers (except some Ph.D. students). Our goal is to write in a clear and approachable manner about what we are doing here at MSRM and why we are excited about it.

If you want to receive an mail when the blog is updated, you can subscribe under the “Subscribe” button at the top of the page. If you work or study at the MSRM and you want to contribute to the blog, click the “Contribute” button for more info.