Unsupervised Intuitive Physics from Visual Observations

1University of Oxford         2University College London         3Niantic        
*joint first authors

Asian Conference on Computer Vision (ACCV) 2018

Overview of our unsupervised object tracker. Each training point consists of a sequence of five video frames. Top: the sequence is randomly permuted with probability 50%. The position extractor (a) computes a probability maps for the object location, whose entropy is penalised by $\mathcal{L}_{ent}$. The reconstructed trajectory is then fed to a causal/non-causal discriminator network (b) that determines whether the sequence is causal or not, encouraged by $\mathcal{L}_{disc}$. The bottom Siamese branch (c) of the architecture takes a randomly warped version of the video and is expected by $\mathcal{L}_{siam}$ to extract correspondingly warped positions in (d). Blue and green blocks contain learnable weights and green blocks are siamese shared ones. At test time only $\Phi$ is retained.


While learning models of intuitive physics is an active area ofresearch, current approaches fall short of natural intelligences in one important regard: they require external supervision, such as explicit access to physical states, at training and sometimes even at test time. Some approaches sidestep these requirements by building models on top of handcrafted physical simulators. In both cases, however, methods cannot learn automatically new physical environments and their laws as humans do. In this work, we successfully demonstrate, for the first time, learning unsupervised predictors of physical states, such as the position of objects in an environment, directly from raw visual observations and without relying on simulators. We do so in two steps: (i) we learn to track dynamically-salient objects in videos using causality and equivariance, two non-generative unsupervised learning principles that do not require manual or external supervision. (ii) we demonstrate that the extracted positions are sufficient to successfully train visual motion predictors that can take the underlying environment into account. We validate our predictors on synthetic datasets; then, we introduce a new dataset, Roll4Real, consisting of real objects rolling on complex terrains (pooltable, elliptical bowl, and random height-field). We show that in all such cases it is possible to learn reliable extrapolators of the object trajectories from raw videos alone, without any form of external supervision and with no more prior knowledge than the choice of a convolutional neural network architecture.

   author = {{S\'ebastien} Ehrhardt and Aron Monszpart and Niloy {J. Mitra} and Andrea Vedaldi},
    title = "{Unsupervised Intuitive Physics from Visual Observations}",
  journal = {{ACCV}},
     year = 2018,
    month = dec