Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

TTI-Chicago
UC Berkeley
UC Berkeley


Visual Jenga Starting from an input image, we remove one object at a time while keeping the rest of the scene stable. This process reveals object dependencies and provides a new way to evaluate grounded scene understanding.

Abstract

This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.



Key Idea: Counterfactual Inpainting

Inpainting Results
We test how much two objects depend on each other by masking one at a time and using a large inpainting model to fill the missing region. The numbers show similarity between inpainted results and the original object. If replacements vary, the object is less likely to be a support. Here, the table consistently reappears, while the cat is replaced by many objects — suggesting the table supports the cat.


Our Pipeline

Inpainting Results
Starting from an input image, we first detect object centers using Molmo. These points are used to segment each object with SAM. We then apply our Counterfactual Inpainting method to rank objects by their dependence, and finally remove them in order using Firefly.

BibTeX

@misc{bhattad2025visualjengadiscoveringobject,
      title={Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting}, 
      author={Anand Bhattad and Konpat Preechakul and Alexei A. Efros},
      year={2025},
      eprint={2503.21770},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.21770}, 
}

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Website adapted from Nerfies