Scene Structure Inference through Scene Map Estimation

1University College London     2University of Cambridge     3LIX, École Polytechnique

VMV 2016

Given a single RGB image (left), we propose a pipeline to generate a scene map (middle) in the form of a floorplan with grid locations for the discovered objects. The resulting scene map can then be potentially used to generate a 3D scene mockup (right). Please note that our system does not yet support pose estimation; hence in the mockup the objects are in default front-facing orientations.


Understanding indoor scene structure from a single RGB image is useful for a wide variety of applications ranging from the editing of scenes to the mining of statistics about space utilization. Most efforts in scene understanding focus on extraction of either dense information such as pixel-level depth or semantic labels, or very sparse information such as bounding boxes obtained through object detection. In this paper we propose the concept of a scene map, a coarse scene representation, which describes the locations of the objects present in the scene from a top-down view (i.e., as they are positioned on the floor), as well as a pipeline to extract such a map from a single RGB image. To this end, we use a synthetic rendering pipeline, which supplies an adapted CNN with virtually unlimited training data. We quantitatively evaluate our results, showing that we clearly outperform a dense baseline approach, and argue that scene maps provide a useful representation for abstract indoor scene understanding.

Scene Map representation

A scene map describes the scene on a per-class basis from a top-down view corresponding to an input RGB image. A white square indicates the presence of an instance of that particular class at that location. Here we show the groundtruth scene map for the input scene on the left.

Scene Map representation

Our aim is to automatically infer scene maps from single frame RGB input images. We modify the well-known VGG network architecture to work for this purpose.

Scene Map representation

To offset a lack of existing training data for our purposes, we implement a training data generation pipeline that feeds an endless stream of randomly generated indoor scenes together with segmentation, depth, and scenemap to our training pipeline.

Scene Map representation

We evaluate our system on a synthetic dataset consisting of models and textures not seen at training time. A green square indicates a true positive, a yellow square a false positive, and a red square a false negative classification. In most cases, the errors can be explained as slight misplacements of the ground truth.


  title   = {Scene Structure Inference through Scene Map Estimation},
  author  = {Moos Hueting and Viorica Pătrăucean and Maks Ovsjanikov and Niloy J. Mitra},
  year    = {2016},
  journal = {VMV}

This work is in part supported by the Microsoft PhD fellowship program, EPSRC grant number EP/L010917/1, Marie-Curie CIG-334283, a CNRS chaire d’excellence, chaire Jean Marjoulet from École Polytechnique, FUI project TANDEM 2, a Google Focused Research Award, and ERC Starting Grant SmartGeometry (StG-2013-335373).


(Coming soon)

Data and Code (Coming soon)