Jekyll2019-07-08T09:07:06+00:00https://autonomousvision.github.io/feed.xmlAutonomous Vision BlogResearch Blog of the Autonomous Vision Group at the MPI-IS and University of Tübingen Andreas GeigerAutonomousVisionBloghttps://feedburner.google.comDeFuSR: Learning Non-volumetric Depth Fusion using Successive Reprojections2019-06-26T10:30:00+00:002019-06-26T10:30:00+00:00https://autonomousvision.github.io/defusr<p>In many fields of data processing, computer vision included, deep learning is throwing top approaches from their throne.
In computer vision, the current wave of deep learning has started mostly in image classification.
Beginning with relatively easy recognition problems (does this image show a dog or a cat?), networks quickly became better at classifying subspecies than most humans.
Whereas hand-crafted techniques for classification can only leverage the knowledge of their creators, neural networks distill decision rules directly from data; perfect in the case of classification, where the optimal decision rules can be incredibly complex and are typically not known in advance.</p>
<p>Starting from 2015, Dosovitskiy and others demonstrated that deep learning can also be applied to dense correspondence estimation tasks such as optical flow or stereo.
While optical flow and stereo can be addressed using image based networks with 2D convolutions, extending these results to the multi-view case where computation takes place in 3D space is a difficult task. In particular the large memory requirements of 3D deep networks limit resolution and therefore also accuracy. In <a href="http://www.cvlibs.net/publications/Donne2019CVPR.pdf">Learning Non-volumetric Depth Fusion using Successive Reprojections</a>, we suggest an alternative approach: Instead of performing computations in 3D space, we successively “fold” 3D information back into the original 2D image views, combining prior knowledge about multi-view geometry and triangulation with the strength of deep neural networks. This allows us to iteratively obtain consistent 3D reconstructions while all computation is performed in the 2D image space.</p>
<h2 id="multi-view-stereo">Multi-view Stereo</h2>
<p>In multi-view stereo, we are interested in estimating the 3D structure of a scene based on multiple images of that scene. In the most simple case, the locations $\tilde{u}_1$ and $\tilde{u}_2$ at which two different cameras observe object $x$ can be used to triangulate the object’s location:</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-26-defusr/twoview_triangulation.png" alt="Triangulation for finding the location of an object" class="align-center" width="500px" /></p>
<p>Before being able to triangulate object $x$ based on the input images, we must of course determine the corresponding locations $\tilde{u}_i$ in the input images.
The multi-view stereo pipeline is an established approach to this problem:</p>
<ol>
<li>
<p><strong>Feature Description</strong><br />
In the first step, we describe the different possible locations in the images (the pixels) with a feature vector each.
The implication/assumption is that, if the feature description of two image locations $\tilde{u}_1$ and $\tilde{u}_2$ are similar, they belong to the same 3D object and can be used for triangulation.</p>
</li>
<li>
<p><strong>Cost Volume Calculation</strong><br />
For a set of 3D positions, we compare the set of all possible corresponding feature descriptions in the other views.
If the cost is low, they are similar and likely describe the same surface point.
Often, the set of 3D positions is chosen based on a center view: in this case, we draw a line (ray) through each point of interest, and compare the features corresponding to each depth hypothesis with each other – a technique called planesweeping. This is illustrated in the figure below.</p>
</li>
<li>
<p><strong>Depth Estimation</strong><br />
Performing step 2 for all pixels in the center view and all depth hypotheses yields a 3D cost volume from which the most likely location along each ray can be determined.
Often, the resulting estimates are noisy and therefore need to be filtered or smoothed by exploiting the fact that depth is smooth nearly everywhere a few exceptions at sharp edges or object boundaries.</p>
</li>
<li>
<p><strong>Depth Fusion</strong><br />
To arrive at a complete 3D representation of the scene all resulting triangulated points $x$ can be fused into one 3D reconstruction.
However, even after filtering and smoothing the estimates, the result often contains a large degree of noise and inconsistencies.
Traditional techniques therefore leverage various heuristics to arrive at a consistent 3D reconstruction. The most common assumption is that each 3D point $x$ must be supported by 3D points triangulated from other views in order to be trusted. All points which do not fulfill this criterion are removed.</p>
</li>
</ol>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-26-defusr/planesweeping.png" alt="Planesweeping to decide on point x" class="align-center" width="700px" /></p>
<h2 id="deep-learning-based-multi-view-stereo">Deep Learning-based Multi-view Stereo</h2>
<p>Various parts of the pipeline described above are currently being replaced by learned alternatives.
Neural networks can be exploited for generating distinctive features as shown by <a href="https://www.ethz.ch/content/dam/ethz/special-interest/baug/igp/photogrammetry-remote-sensing-dam/documents/pdf/Papers/Learned-Multi-Patch-Similarity.pdf">Hartmann et al.</a> and <a href="https://eccv2018.org/openaccess/content_ECCV_2018/papers/Yao_Yao_MVSNet_Depth_Inference_ECCV_2018_paper.pdf">MVSNet</a>, among others.
An approach that explicitly leverages cost volumes for binocular stereo was presented by <a href="https://arxiv.org/pdf/1703.04309.pdf">Kendall et al.</a> and
depth map filtering was demonstrated by, e.g., <a href="http://www.liuyebin.com/DDRNet/DDRNet.pdf">DDRNet</a>.</p>
<p>The final step, combining the different depth maps into a single scene representation, was also tackled from a learning-based angle.
In <a href="http://www.cvlibs.net/publications/Riegler2017THREEDV.pdf">Riegler et al.</a> we have demonstrated that voxel-based depth map fusion is feasible. However, despite exploiting memory efficient octree data structures, this method was limited to resolutions up to $256^3$ voxels.
In this work, we explore depth maps themselves as representation for fusing multiple depth maps. While not as rich as volumetric grids, depth maps can be processed efficiently and at comparably large resolution as all computation can be performed in the 2D image domain.</p>
<h2 id="our-fusion-approach">Our Fusion Approach</h2>
<p>After performing planesweeping for all views, we iteratively consider each view as the center view.
To leverage the information of neighbouring views, we reproject depth and feature information from all neighbours which share a similar field of view onto the center view.
These reprojections are then used to refine the initial depth estimate from planesweeping:</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-26-defusr/teaser.png" alt="Refining the center view based on neighbours" class="align-center" width="700px" /></p>
<p>As illustrated below, there exist two types of information we can obtain from a neighboring view.
(1) the surfaces that the neighbor observes and which support the depth map of the center view (blue). (2) the depth boundaries implied by depth edges in the neighbor’s depth map (orange).
This second cue allows the center view to eliminate depth hypothesis closer than the orange line which would disagree with the depth map predicted of the neighbor.</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-26-defusr/impliedbounds.png" alt="The bounds implied by a neighbouring view" class="align-center" width="500px" /></p>
<p>The original depth estimate as well as the input image, and the reprojected information from the neighbors, is fed into a neural network that returns both an improved depth map and a confidence estimate for this new depth map.</p>
<p>We perform multiple rounds of this depth fusion approach in an auto-regressive fashion. As after each iteration, all depth maps improve, the neighbor information in subsequent iterations is of higher quality.
In practice, the performance quickly saturates and we found that three iterations over all views are sufficient.
Below, we show the resulting depth error in terms of the iterations, for depth maps estimated by two different techniques (<a href="https://demuc.de/colmap/">COLMAP</a> and <a href="https://eccv2018.org/openaccess/content_ECCV_2018/papers/Yao_Yao_MVSNet_Depth_Inference_ECCV_2018_paper.pdf">MVSNet</a>).
Blue colors indicate low errors and red colors represents large errors.
Note how quantization artifacts (for MVSNET) and errors are reduced while applying the proposed fusion technique.</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-26-defusr/iterations.png" alt="The impact of iterations on the reconstruction quality" class="align-center" /></p>
<p>Finally, we demonstrated that the proposed approach also works with data different from the data the network was trained on (different camera, object types and environment).</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-26-defusr/realdata.png" alt="Evaluation our approach on data it wasn't trained on" class="align-center" width="600px" /></p>
<h2 id="further-information">Further Information</h2>
<p>To learn more about DeFuSR, check out our video here:</p>
<!-- Courtesy of embedresponsively.com //-->
<div class="responsive-video-container">
<iframe src="https://www.youtube-nocookie.com/embed/Cz7zz7Fuqlg" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>You can find more information (including the paper, code and datasets) on <a href="https://avg.is.tuebingen.mpg.de/research_projects/defusr">our project page</a>.
If you are interested in experimenting with our approach yourself, download the source code of our project and give it a try.
We are happy to receive your feedback!</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{DeFuSR,
title = {DeFuSR: Learning Non-volumetric Depth Fusion using Successive Reprojections},
author = {Donne, Simon and Geiger, Andreas},
booktitle = {Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)},
year = {2019}
}
</code></pre></div></div>Simon Donnesimon.donne@tue.mpg.deWe propose to combine the prior work on multi-view geometry and triangulation with the strength of deep neural networks. To this end, we combine a learning-based depth refinement/fusion step with well established multi-view stereo techniques (both traditional and learning-based).Superquadrics Revisited2019-06-15T13:10:00+00:002019-06-15T13:10:00+00:00https://autonomousvision.github.io/superquadrics-revisited<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-15-superquadrics-revisited/representations.png" alt="representations" class="align-center" /></p>
<p>Recent advances in deep learning coupled with the abundance of large shape repositories gave rise to various methods that seek to learn the 3D model of an object directly from data. Based on the output representation these methods can be categorized into depth-based, voxel-based, point-based and mesh-based techniques. While all of these approaches are able to capture fine details, none of them directly yields a compact, memory-efficient and semantically meaningful representation.</p>
<p>Inspired by the nature of the human’s cognitive system, that perceives an object as a decomposition of parts, researchers have proposed to represent objects as a set of atomic elements, which we refer to as primitives.
Examples for such primitives include 3D polyhedral shapes, generalized cylinders and geons for decomposing 3D objects into a set of parts.
In 1986, Pentland introduced a parametric version of generalized cylinders, based on deformable superquadrics, to the research community. He proposed a system able to represent the scene structure using multiple superquadrics.</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-15-superquadrics-revisited/pentland_1986.png" alt="pentland_1986" class="align-center" /></p>
<p>Superquadrics are a parametric family of surfaces that can be used to describe cubes, cylinders, spheres, octahedra, ellipsoids etc. Their continuous parametrization is particularly amenable to deep learning, as their shape is smooth and varies continuously with their parameters. They can be fully described using only 11 parameters: 6 for pose, 2 for shape and 3 for size.</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-15-superquadrics-revisited/sq_world.png" alt="sqs" class="align-center" /></p>
<p>However, early approaches for fitting primitive-based representations to input shapes remained largely unsuccessful due to the difficulty of optimizing the parameters for the input shape and achieving semantic consistency across instances. In our recent work <a href="https://arxiv.org/pdf/1904.09970.pdf">Superquadrics Revisited: Learning 3D Shape Parsing beyond Cuboids</a>, we lift superquadric representations to the deep era and try to answer the question of whether it is possible to train a neural network that recovers the geometry of a 3D object as a set of superquadrics from input images or 3D shapes in an unsupervised manner, namely without supervision in terms of the primitive parameters. As it turns out, unsupervised learning of primitive-based shape abstractions is not only feasible, but, compared to traditional fitting-based approaches, allows for exploiting semantic regularities across instances, leading to stable and semantically consistent predictions.</p>
<h2 id="our-approach">Our Approach</h2>
<p>More formally, our goal is to learn a neural network</p>
<script type="math/tex; mode=display">\phi_{\theta}: I \rightarrow P</script>
<p>which maps an input representation <script type="math/tex">I</script> to a primitive representation <script type="math/tex">P</script>, where <script type="math/tex">P</script> comprises the primitive parameters. As we used a fixed-dimensional output representation, we also predict the probability of existence for each primitive, indicating whether this primitive is part of the assembled object or not. This allows for inferring the number of primitives required to faithfully but compactly represent an object.</p>
<p>Despite the absence of supervision in terms of primitive annotations, one can still measure the discrepancy between the target and the predicted shape. Inspired by the work of Tulsiani et al. <a href="https://arxiv.org/pdf/1612.00404.pdf">Learning Shape Abstractions by Assembling Volumetric Primitives</a>, we formulate our optimization objective as the minimization of distances between points uniformly sampled from the surface of the target shape and the predicted shape:</p>
<script type="math/tex; mode=display">L_D(P, X) = L_{P \rightarrow X}(P, X) + L_{X \rightarrow P}(X, P)</script>
<ul>
<li><script type="math/tex">L_{P \rightarrow X}(P, X)</script> measures the distance from the primitives P to the point cloud X and seeks to enforce precision.</li>
<li><script type="math/tex">L_{X \rightarrow P}(X, P)</script> measures the distance from the pointtlcoud X to the primitives P and seeks to enforce coverage.</li>
</ul>
<p>For evaluating our loss, we sample points on the surface of each superquadric. This results in a stochastic approximator of the expected loss.</p>
<h2 id="does-it-work">Does it work?</h2>
<p>We conducted experiments that demonstrate that our model leads to expressive 3D shape abstractions that capture fine details such as the open mouth of the dog (left-most animal in first row) despite the lack of supervision. We observe that our shape abstractions allow for differentiating between different type of objects such as scooter, chopper, racebike by adjusting the shape of individual object parts. Another surprising property of our model is the semantic consistency of the predicted primitives: the same primitives (highlighted with the same color) consistently represents the same object part.</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-15-superquadrics-revisited/teaser.png" alt="teaser" class="align-center" /></p>
<p>The diverse shape vocabulary of superquadrics allows us to recover more complicated shapes such as the human body under different poses and articulations. For instance, our model predicts pointy octahedral shapes for the feet, ellipsoidal shapes for the head and a flattened elongated superellipsoid for the main body without any supervision on the primitive parameters. Again, the same primitives (highlighted with the same color) consistently represent feet, legs, arms etc. across different poses.</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-06-15-superquadrics-revisited/humans.gif" alt="humans" class="align-center" /></p>
<h2 id="further-information">Further Information</h2>
<p>For more shape parsing results, check out this video:</p>
<!-- Courtesy of embedresponsively.com //-->
<div class="responsive-video-container">
<iframe src="https://www.youtube-nocookie.com/embed/eaZHYOsv9Lw" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>Additional experiments can be found in our <a href="http://www.cvlibs.net/publications/Paschalidou2019CVPR.pdf">paper</a>, our <a href="http://www.cvlibs.net/publications/Paschalidou2019CVPR_supplementary.pdf">supplementary</a>) and on our <a href="https://avg.is.tuebingen.mpg.de/publications/paschalidou2019cvpr">project page</a>. If you are interested in experimenting with our model you can clone the code for this project from our <a href="https://github.com/paschalidoud/superquadric_parsing">github page</a>.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{Paschalidou2019CVPR,
title = {Superquadrics Revisited: Learning 3D Shape Parsing beyond Cuboids},
author = {Paschalidou, Despoina and Ulusoy, Ali Osman and Geiger, Andreas},
booktitle = {Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)},
year = {2019}
}
</code></pre></div></div>Despoina Paschalidoudespoina.paschalidou@tue.mpg.deRecent advances in deep learning coupled with the abundance of large shape repositories gave rise to various methods that seek to learn the 3D model of an object directly from data. Based on their output representationOccupancy Networks2019-04-24T13:10:00+00:002019-04-24T13:10:00+00:00https://autonomousvision.github.io/occupancy-networks<p><img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/teaser.png" alt="teaser image" class="align-center" /></p>
<p>Over the last decade, deep learning has revolutionized computer vision. Many vision tasks such as object detection, semantic segmentation, optical flow estimation and more can now be solved with unprecedented accuracy using deep neural networks.</p>
<p>As many of these problems are represented in the 2D image domain, powerful 2D convolutional neural network architectures can be leveraged.
However, the physical world we live in is not two- but three-dimensional! Thus, reasoning in three dimensions is crucial for enabling intelligent systems to interact with their 3D environment. Consider robot navigation as an example: in order to navigate, a robot must reconstruct its environment in 3D and store this 3D representation in a data efficient manner.
But what constitutes a good 3D representation which is easily accessible to deep neural networks?
In our recent work <a href="http://www.cvlibs.net/publications/Mescheder2019CVPR.pdf">Occupancy Networks - Learning 3D Reconstruction in Function Space</a>, we examine this question and propose a novel output representation which allows to apply powerful deep architectures to the 3D domain.</p>
<h2 id="the-challenge">The Challenge</h2>
<p>Several 3D output representations have been proposed for learning-based 3D reconstruction.
Voxels are a straightforward generalization of pixels to the 3D domain. They partition the 3D space into 3D cells according to an equidistant grid. The size of each voxel or grid cell determines the granularity of the representation.
Unfortunately, voxels come with a severe limitation, in particular in the context of deep learning:
while the memory requirements for 2D images grows quadratically with resolution, the memory requirements of voxels grow cubically with resolution.
Consequently, if one would convert a state-of-the-art fully convolutional architecture for 2D images operating at a resolution of 512<sup>2</sup> pixels into a 3D convolutional architecture operating on voxels, this network would require 512 GPUs to satisfy the memory requirements of the resulting network, now operating on 512<sup>3</sup> voxels.
In practice, most voxel-based architectures are therefore restricted to very low resolution such as 32<sup>3</sup> or 64<sup>3</sup> voxels, resulting in coarse “Manhattan world” reconstructions:</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/voxels.gif" alt="voxel representation" class="align-center" /></p>
<p>Another representation that has been investigated in the past are point clouds.
However, while very flexible and computationally efficient, they lack connectivity information about the output and most existing architectures are limited in the number of points that can be reconstructed (typically a few thousands):</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/pointcloud.gif" alt="pointcloud representation" class="align-center" /></p>
<p>Other works have considered meshes comprising vertices and faces as output representation. Unfortunately, this representation either requires a template mesh from the target domain or sacrifices important properties of the 3D output such as connectivity. If a template mesh is used, the resulting model is restricted to a very specific domain such as faces or human bodies. It is very difficult to construct models that can handle multiple object categories such as chairs or cars at the same time. Approaches which sacrifice connectivity often result in non-smooth meshes with artifacts such as self-intersections:</p>
<p><img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/mesh.gif" alt="mesh representation" class="align-center" /></p>
<p>Given the limitations of existing 3D output representations for deep learning, we asked ourselves:
<em>Can we find a 3D output representation for deep neural networks that</em></p>
<ul>
<li>can represent meshes of arbitrary topology and at arbitrary resolution,</li>
<li>is not restricted to a Manhattan world,</li>
<li>is not limited by excessive memory requirements,</li>
<li>preserves connectivity information,</li>
<li>is not restricted to a specific domain (e.g. object class), and</li>
<li>blends well with deep learning techniques?</li>
</ul>
<p>Interestingly, it is indeed possible to find a representation of 3D geometry which satisfies all of these requirements.</p>
<h2 id="our-approach">Our Approach</h2>
<p>The solution is surprisingly simple: we represent the 3D geometry as the decision boundary of a classifier that learns to separate the object’s inside from its outside. This yields a <em>continuous</em> implicit surface representation that can be queried at any point in 3D space and from which watertight meshes can be extracted in a simple post-processing step. More formally, we learn a non-linear function</p>
<script type="math/tex; mode=display">f_\theta: \mathbb R^3 \to [0, 1]</script>
<p>that takes a 3D point as input and outputs its probability of occupancy. In our experiments, we represent this function using a deep neural network which we call <em>Occupancy Network</em>. The decision boundary (at $f_\theta(p)=0.5$) represents the surface of the reconstructed shape:</p>
<p style="text-align: center">
<img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/vis2d.svg" width="45%" />
<img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/vis3d.gif" width="45%" />
</p>
<p>This simple idea solves all of the problems mentioned in the previous section:
the implicit representation can represent meshes of arbitrary topology and geometry, is not restricted by memory requirements, preserves connectivity information and naturally blends with deep learning techniques.
Additionally, the model can be conditioned on an observation such as an image.
This enables it to solve tasks such as 3D reconstruction from a single image.
We train our model with randomly sampled 3D points for which we know the true class label (inside or outside).
For inference, we propose a simple algorithm which efficiently extracts meshes from our representation by incrementally constructing an octree.</p>
<h2 id="does-it-work">Does it work?</h2>
<p>We conducted extensive experiments on 3D reconstruction from point clouds, single images and voxel grids. We found that Occupancy Networks allow to represent fine details of 3D geometry, often leading to superior results compared to existing approaches.</p>
<p style="text-align: center">
<img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/im2mesh_input.png" width="45%" />
<img src="https://autonomousvision.github.io/assets/posts/2019-04-24-occupancy-networks/im2mesh_output.gif" width="45%" />
</p>
<h2 id="further-information">Further Information</h2>
<p>To learn more about Occupancy Networks, check out our video here:</p>
<!-- Courtesy of embedresponsively.com //-->
<div class="responsive-video-container">
<iframe src="https://www.youtube-nocookie.com/embed/w1Qo3bOiPaE" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>You can find more information (including the <a href="http://www.cvlibs.net/publications/Mescheder2019CVPR.pdf">paper</a> and <a href="http://www.cvlibs.net/publications/Mescheder2019CVPR_supplementary.pdf">supplementary</a>) on our <a href="https://avg.is.tuebingen.mpg.de/publications/occupancy-networks">project page</a>. If you are interested in experimenting with our occupancy networks yourself, download the <a href="https://github.com/autonomousvision/occupancy_networks">source code</a> of our project and run the examples. We are happy to receive your feedback!</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{Occupancy Networks,
title = {Occupancy Networks: Learning 3D Reconstruction in Function Space},
author = {Mescheder, Lars and Oechsle, Michael and Niemeyer, Michael and Nowozin, Sebastian and Geiger, Andreas},
booktitle = {Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)},
year = {2019}
}
</code></pre></div></div>Lars Meschederlars.mescheder@tue.mpg.deIn recent years, deep learning has led to many breakthroughs in computer vision. Many tasks such as object detection, semantic segmentation, optical flow estimation and more can now be solved with unprecedented accuracy using deep neural networks.Blog Launch2019-04-24T13:00:00+00:002019-04-24T13:00:00+00:00https://autonomousvision.github.io/blog-launch<p>The Autonomous Vision Group at the Max Planck Institute for Intelligent System and the University of Tübingen is excited to launch our new research blog which will continuously provide updates on our latest research including non-technical descriptions, videos, links to technical papers, source code and datasets. Given the increased public interest in our research and AI in general, we believe that this blog will help making our findings more accessible, particularly also for interested readers outside the computer vision and machine learning field. If you like to follow this blog via email, simply click on the “Newsletter” or “Feed” buttons in the footer of this page. Feel free to send us your feedback on this blog! We are looking forward to receiving your comments and improving this blog over time.</p>Andreas Geigerandreas.geiger@tue.mpg.deThe Autonomous Vision Group at the Max Planck Institute for Intelligent System and the University of Tübingen is excited to launch our new research blog which will continuously provide updates on our latest research including non-technical descriptions, videos, links to technical papers, source code and datasets. Given the increased public interest in our research and AI in general, we believe that this blog will help making our findings more accessible, particularly also for interested readers outside the computer vision and machine learning field. If you like to follow this blog via email, simply click on the “Newsletter” or “Feed” buttons in the footer of this page. Feel free to send us your feedback on this blog! We are looking forward to receiving your comments and improving this blog over time.