Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion

Tomas Jakab*, Ruining Li*, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi
University of Oxford
(* equal contribution)
In 3DV 2024

Monocular 3D reconstruction

Input image

3D shape

Textured

Controllable synthesis

Animation

Relighting

Texture swapping

Farm3D learns an articulated object category entirely from "free" virtual supervision from a 2D diffusion-based image generator. We propose a framework that employs an image generator, such as Stable Diffusion, to produce training data for learning a reconstruction network from the ground up. Additionally, the diffusion model is incorporated as a scoring mechanism to further improve learning. Our method yields a monocular reconstruction network capable of generating controllable 3D assets from a single input image, whether real or generated, in a matter of seconds.

Approach

We prompt Stable Diffusion for virtual views of an object category that are then used to train a monocular articulated object reconstruction model that factorises the input image of an object instance into articulated shape, appearance (albedo and diffuse and ambient intensities), viewpoint, and light direction. During training, we also sample synthetic instance views that are then "critiqued" by Stable Diffusion to guide the learning.

Single-view 3D Reconstruction

Farm3D reconstructs the shape of a wide range of categories to their fine details such as legs and ears despite not being trained on any real images.

Controllable 3D Shape Synthesis

Our method enables the generation of controllable 3D assets from either a real image or an image synthesised using Stable Diffusion. Once generated, we have the ability to adjust lighting, swap textures between models of the same category, and even animate the shape.

Animation

Input image

Animation

Input image

Animation

Input image

Animation

Relighting

Input image

Relit

Input image

Relit

Input image

Relit

Texture Swapping

Input image

Texture image

Texture swapping

Input image

Texture image

Texture swapping

BibTeX

@Article{jakab2023farm3d,
    title={{Farm3D}: Learning Articulated 3D Animals by Distilling 2D Diffusion},
    author={Jakab, Tomas and Li, Ruining and Wu, Shangzhe and Rupprecht, Christian and Vedaldi, Andrea},
    journal={arXiv preprint arXiv:2304.10535},
    year={2023},
}