Making Large Language Models Better Planners with Reasoning-Decision Alignment

Tao Tang^1*

Sihao Lin³

Zequn Jie²

Lin Ma²

Xiaodan Liang^1,5

¹Shenzhen Campus of Sun Yat-sen University	²Meituan Inc.
³University of Technology Sydney	⁴Sun Yat-sen University

⁵Research Institute of Multiple Agents and Embodied Intelligence,
Peng Cheng Laboratory, Shenzhen, China

ECCV 2024(Oral)

ArXiv | Code | Bibtex

Abstract

Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios. They find that the pretrain-finetune paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decisionmaking model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decisionmaking performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of endto-end AD systems. Specifically, our RDA-Driver achieves state-of-theart planning performance on the nuScenes dataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading results on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38 collision rate.

Methodology

RDA-Driver takes the multi-view images, ego status, and multi-turn CoT prompt as input, and simultaneously carries out CoT reasoning and planning results. We construct multiple reasoning-decision samples with misalignment from both the vanilla fine-tuned model and similar scenarios. During training, we compute the token-average score as a measure of CoT answers. We utilize proposed contrastive loss to ensure the scores of positive samples are higher than those of generated negative samples.

Results

R1: Motion planning performance on nuScenes benchmark. Our approach significantly outperforms or is comparable to the prior works with a small number of labels.

R2: Motion planning performance in DriveLM-nuScenes validation set. Ours maintain excellent performance in terms of L2 and collision rate.

Publication

Z. Huang, T. Tao, S. Chen, S. Lin, Z. Jie, L. Ma, G. Wang, X. Liang
Making Large Language Models Better Planners with Reasoning-Decision Alignment
ECCV 2024(Oral)
ArXiv | Code | Bibtex

Webpage template modified from here.