demo

Abstract

We leverage Large Language Models (LLM) for zero-shot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for reinforcement learning, yet achieve relatively low success rates and lack generalizability. The intermittent nature of auditory signals further poses additional obstacles to inferring the goal information.

To address this challenge, we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sensory data, we instruct an LLM-based planner to actively explore the environment. During the exploration, our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally, we introduce an auxiliary LLM-based assistant to enhance global environmental comprehension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analysis, we show that our method outperforms relevant baselines without training demonstrations from the environment and complementary semantic information.

Demo Videos

· The goal object is a counter.

· Every step agent will hear 1 second different audio comes from goal object.

· Sound stop at step 14 in this example episode.

Method

We partition our pipeline into different modules:

    - Perception
        - Audio: Direction & Distance
        - Visual: Semantic map & Object Detection

    - Planning
        - Reflective Planner (RefPlanner)
        - Imaginative Assistant (ImaAssistant): Region Imagination & Suggestion

More details are shown in our paper.

RefPlanner Prompting Example

Here is an example of how RefPlanner and ImaAssistant works (simplified):

ImaAssistant: Facilitate commonsense in LLM to infer the type of a region and imagine its size. Then give RefPlanner its suggestion

RefPlanner: According to perceptual information and ImaAssistant suggestion, make choices about frontiers.

More details are shown in our paper.

Experiment & Metrix

RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation