CoMA: Compositional Human Motion Generation with Multi-modal Agents

* denotes equal contribution.
1University of California, Irvine     2Southeast University     3Chongqing University     4Huazhong University of Science and Technology 5Northeastern University 6Stony Brook University
Teaser image.

CoMA can generate high quality motion sequences despite challenging user expectations. Label colors red indicate context-rich moves and/or poses, purple indicate spatially compositional motions and gray indicate trajectory-editing instructions.

Abstract

3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

CoMA overview

CoMA is designed around multi-modal agents. Taking advantage of modern Large Language Models (LLMs) and Video Language Models (VLMs), our framework's pipeline aggregates these functionalities within four independent agents: the Task Planner, Trajectory Editor, Motion Generator and Motion Reviewer. Below are brief descriptions of each agent, as well as an example workflow of CoMA.

cars peace

Multi-modal agents

Task Planner

This agent leverages a LLM (GPT-4o) to perform its three designated tasks: Recaptioning, Temporal Composition and Task Decomposition.

Trajectory Editor

If trajectory information is identified by the Task Planner, the Trajectory Editor prompts GPT-4o to generate a mathematical function to compute trajectory coordinates for the pelvis joint. This guides the entire generation sequence to follow the explicited path by the user.

Motion Generator

After the text processing agents finalized their tasks, the Motion Generator agent, through SPAM, generates and edit a motion sequence with the fine-grained instructions.

Motion Reviewer

Powered by an instruction fine-tuned VLM module (VideoChat2) using the colored human models rendered through Blender, the Motion Reviewer captions the sequence generated by SPAM and applies cosine similarity to gauge its relevancy to the original user prompt. If not above the arbitrary threshold, this agent initiates the self-correction pipeline.

Side image

SPAM: Spatially-Aware Masked Generative Motion Model

cars peace

Comparisons with SOTA (More in Gallery)

cars peace