Program

Date: June 16, 2019 (PM Seesion)

Room: Seaside 7 (S7)

Invited Talks

Dhruv Batra

Georgia Tech

Habitat: A Platform for Embodied AI Research

I will present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation, before transferring the learned skills to reality. The ‘software stack’ for training embodied agents involves datasets providing 3D assets, simulators that render these assets and simulate agents, and tasks that define goals and evaluation metrics, enabling us to benchmark scientific progress. We aim to standardize this entire stack by contributing specific instantiations at each level: unified support for scanned and designed 3D scene datasets, a new simulation engine (Habitat-Sim), and a modular API (Habitat-API). The Habitat architecture and implementation combine modularity and high performance. For example, when rendering a realistic scanned scene from the Matterport3D dataset, Habitat-Sim achieves several thousand frames per second (FPS) running single-threaded and can reach over 10,000 FPS multi-process on a single GPU! These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or `merely' impractical. Finally, I will describe the Habitat Challenge, an autonomous navigation challenge that aims to benchmark and advance efforts in embodied AI.

Devi Parikh

Georgia Tech

Forcing Vision + Language Models To Actually See, Not Just Talk

Machines can often convincingly describe an image in a natural language sentence, answer a free-form question about an image, or hold a conversation with a human about an image. However, careful inspection reveals that these models often rely on superficial language correlations from training data. I will talk about some of our efforts towards making these models ground their predictions in the image content.

Part of what is exciting about problems at the intersection of vision and language is the possibility that they open up for humans to collaborate with machines. Towards the end of my talk, I will briefly mention some recent work I am excited about in using explainable AI to teach humans new concepts, and in creative AI to augment humans' expressive power.

Sanja Fidler

University of Toronto

Learning to Caption Images through a Lifetime by Asking Questions

In order to bring artificial agents into our lives, we will need to go beyond supervised learning on closed datasets to having the ability to continuously expand knowledge. Inspired by a student learning in a classroom, we present an agent that can continuously learn by posing natural language questions to humans. Our agent is composed of three interacting modules, one that performs captioning, another that generates questions and a decision maker that learns when to ask questions by implicitly reasoning about the uncertainty of the agent and expertise of the teacher. As compared to current active learning methods which query images for full captions, our agent is able to ask pointed questions to improve the generated captions. The agent trains on the improved captions, expanding its knowledge. We show that our approach achieves better performance using less human supervision than the baselines on the challenging MSCOCO dataset.

Tamara Berg

UNC Chapel Hill

Focused Words & Pictures

Much of everyday language and discourse concerns the visual world around us, making understanding the relationship between the physical world and language describing that world an important challenge problem for AI. Comprehending the complex and subtle interplay between the visual and linguistic domains will have broad applicability toward inferring human-like understanding of images, producing natural human-robot interactions, and grounding natural language. In computer vision, along with improvements in deep learning based visual recognition, there has been an explosion of interest in methods to automatically generate natural language outputs for images and videos. In this talk I will describe our group's recent efforts to understand and produce focused natural language about images. In particular I will review our work on referring expressions and visual question-answering.