3inGAN: Learning a 3D Generative Model from Images of a Self-similar Scene

1University College London, 2Adobe Research
Accepted at: 3DV 2022
ReLU Fields Teaser

We introduce 3inGAN, an unconditional 3D generative model trained from 2D images of a single self-similar 3D scene. Such a model can be used to produce 3D “remixes” of a given scene, by mapping spatial latent codes into a 3D volumetric representation, which can subsequently be rendered from arbitrary views using physically based volume rendering. By construction, the generated scenes remain view-consistent across arbitrary camera configurations, without any flickering or spatio-temporal artifacts. During training, we employ a combination of 2D, obtained through differentiable volume tracing, and 3D Generative Adversarial Network (GAN) losses, across multiple scales, enforcing realism on both its 3D structure and the 2D renderings. We show results on semi-stochastic scenes of varying scale and complexity, obtained from real and synthetic sources. We demonstrate, for the first time, the feasibility of learning plausible view-consistent 3D scene variations from a single exemplar scene and provide qualitative and quantitative comparisons against recent related methods.


Presentation Video


3inGAN: stage 1 Reconstruction

For modelling the 3D reconstructed grids, we use the ReLU-Fields representation instead of a standard RGBA volumetric grid, for inherently inducing the notion of free space in the 3D-grid. Given a user-provided AABB (Axis-Aligned Bounding Box), we represent the RGB+A signal inside it using a volumetric grid \(\mathcal{V}\) of fixed resolution \(nv_x \times nv_y \times nv_z\). The \(A\) channel of the grid stores feature values in the range \([-1, 1]\) and the \(RGB\) channels store the normal colour values in the range \([0, 1]\). The values on the grid are trilinearly interpolated to obtain continuous fields, with the addition of applying a \(ReLU\) non-linearity after interpolating the values on the \(A\) grid.

Synthetic scenes

Balloons

Dirt

Fish

FishR

Forest

Plants

Real captured scenes

Blocks

Chalk


3inGAN: stage 2 Multi-scale Generative Model


Training the generative model makes use of the 3D feature grid \(\mathcal{V}\) trained in the previous section, which we denote as the reference grid. The following subsections provide details about the Generator architecture, followed by the multi-scale generative training, and the adversarial losses used.

Generator: The generator network \(G\) maps the random spatial grids of latent codes \(z\) to a 3D feature grid \(G(z)\) at the coarsest stage. While it adds fine residual details to the previous stage's outputs for the rest of the stages similar to SinGAN. The latent codes at the coarsest stage \(z_0\) are \(seedDimension=4\) dimensional while correspond to a blend between previous Stage's output and random noise for subsequent stages. I.e. \(z_t = G_{t-1}(z_{t-1}) + \epsilon\), where, \( \epsilon \in \mathbb{R}^{h_t \times w_t \times d_t} \) and \(\epsilon \sim \mathbb{N}(0, I)\).

Training: We train the architecture progressively: the generator first produces grids of reduced resolution. Only once this has converged, layers are added and the model is trained to produce the higher resolution. Note that we freeze the previously trained layers in order to avoid the GAN training from diverging. We employ an additional reconstruction loss that enforces one single fixed seed \(z*\) to map to the reference grid. We supervise this fixed seed loss via an MSE over the 3D grids and with 2D rendered patches.

3D discriminator: The 3D discriminator compares 3D patches from the generated grid to 3D patches of the reference grid. Let \(P_{3D}(\mathcal{V})\) be the operator that extracts a random patch from the 3D grid \(\mathcal{V}\). Thus the distributions that the 3D discriminator tries to discriminate are, \(p^{3D}_F = P_{3D}(G(z))\) and \(p^{3D}_{R} = P_{3D}(\mathcal{V})\).

2D discriminator: A 2D discriminator discriminates between 2D patches of the rendered views of the generated grid and the 2D patches obtained from the rendered views of the reference grid \(\mathcal{V}\). To render the 3D grids we need to model another distribution of the poses, which we denote as \(\mathcal{D}\), that we uniformly sample from a top-hemisphere around the grid. The focal lengths of the cameras are interpolated linearly for the various stages such that the value at the final stage corresponds to the actual camera intrinsics. Further, let \(P_{2D}()\) be an operator to extract a random patch from a 2D image. Thus, the 2D discriminator works upon the distributions: \(p^{2D}_F = P_{2D}(R(G(z), \mathcal{D}))\) and \(p^{2D}_R = P_2D(R(\mathcal{V}, \mathcal{D}))\). Where, the \(R()\) corresponds to the rendering function, for which we use the differentiable EmissionAbsorption model.


Quality and Diversity Metrics

Quality: Visual quality is measured as the expectation of SIFID (Single Image FID) scores between the distribution of exemplar 2D images and the distribution of rendered generated 2D images for a fixed camera over multiple seeds. We compute this expectation by taking a mean over a number of camera-views. This is similar in spirit to SIFID, except we compute it over images rendered from different views of the 3D scenes. Lower distance reflects better quality.

Diversity: Unfortunately, in the single scene case, we cannot meaningfully compute FID scores between the exemplar patch distribution and the distribution of all generated patches across seeds as it would make diversity appear as a distribution error. This is different from a typical GAN setup where we are given real images as samplings of desirable distribution of both quality and diversity. Instead, we measure scene diversity as the variance of a fixed patch from a fixed view over random seeds, which is a technique used to study texture synthesis diversity. Larger diversity is better.


Quantitative Results

Our Quality v/s Diversity results. A good single scene generative model needs to have a good mix of quality and diversity – excellent quality with no diversity or vice versa are both undesirable. Visual Quality and Scene Diversity for different methods (columns) and different data sets (rows). To simplify comparison, we normalize the numbers so that ours is always 1. The best for each metric on each dataset is bolded and second best is underlined. Please refer to the supplementary for unscaled numbers.


Qualitative Results

Scene: Fish

GT

Seed 1

Seed 2

Seed 3

Scene: Balloons

GT

Seed 1

Seed 2

Seed 3

Scene: Dirt

GT

Seed 1

Seed 2

Seed 3

Scene: Plants

GT

Seed 1

Seed 2

Seed 3

Scene: Forest

GT

Seed 1

Seed 2

Seed 3

Scene: FishR

GT

Seed 1

Seed 2

Seed 3

Scene: Blocks

GT

Seed 1

Seed 2

Seed 3

Scene: Chalk

GT

Seed 1

Seed 2

Seed 3

Bibtex


        @inproceedings{
            karnewar_3InGan_3dv_22,
            author = {Karnewar, Animesh and Ritschel, Tobias and Wang, Oliver and Mitra, Niloy},
            title = {{3inGAN}: Learning a {3D} Generative Model from Images of a Self-similar Scene},
            year = {2022},
            booktitle = {Proc. {3D} Vision ({3DV})}
        }
    

Acknowledgements

PRIME-EU logo

The research was partially supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956585, gifts from Adobe, and the UCL AI Centre.