Bibtex
@article{HuetingEtAl:SceneStructureInference:2016, title = {Scene Structure Inference through Scene Map Estimation}, author = {Moos Hueting and Viorica Pătrăucean and Maks Ovsjanikov and Niloy J. Mitra}, year = {2016}, journal = {VMV} }
1University College London 2University of Cambridge 3LIX, École Polytechnique
VMV 2016
Understanding indoor scene structure from a single RGB image is useful for a wide variety of applications ranging from the editing of scenes to the mining of statistics about space utilization. Most efforts in scene understanding focus on extraction of either dense information such as pixel-level depth or semantic labels, or very sparse information such as bounding boxes obtained through object detection. In this paper we propose the concept of a scene map, a coarse scene representation, which describes the locations of the objects present in the scene from a top-down view (i.e., as they are positioned on the floor), as well as a pipeline to extract such a map from a single RGB image. To this end, we use a synthetic rendering pipeline, which supplies an adapted CNN with virtually unlimited training data. We quantitatively evaluate our results, showing that we clearly outperform a dense baseline approach, and argue that scene maps provide a useful representation for abstract indoor scene understanding.
A scene map describes the scene on a per-class basis from a top-down view corresponding to an input RGB image. A white square indicates the presence of an instance of that particular class at that location. Here we show the groundtruth scene map for the input scene on the left.
Our aim is to automatically infer scene maps from single frame RGB input images. We modify the well-known VGG network architecture to work for this purpose.
To offset a lack of existing training data for our purposes, we implement a training data generation pipeline that feeds an endless stream of randomly generated indoor scenes together with segmentation, depth, and scenemap to our training pipeline.
We evaluate our system on a synthetic dataset consisting of models and textures not seen at training time. A green square indicates a true positive, a yellow square a false positive, and a red square a false negative classification. In most cases, the errors can be explained as slight misplacements of the ground truth.
@article{HuetingEtAl:SceneStructureInference:2016, title = {Scene Structure Inference through Scene Map Estimation}, author = {Moos Hueting and Viorica Pătrăucean and Maks Ovsjanikov and Niloy J. Mitra}, year = {2016}, journal = {VMV} }
This work is in part supported by the Microsoft PhD fellowship program, EPSRC grant number EP/L010917/1, Marie-Curie CIG-334283, a CNRS chaire d’excellence, chaire Jean Marjoulet from École Polytechnique, FUI project TANDEM 2, a Google Focused Research Award, and ERC Starting Grant SmartGeometry (StG-2013-335373).