Home - CoMA: Compositional Human Motion Generation with Multi-modal Agents

CoMA: Compositional Human Motion Generation with Multi-modal Agents

* denotes equal contribution.

¹University of California, Irvine ²Stony Brook University ³Southeast University ⁴Chongqing University ⁵Huazhong University of Science and Technology ⁶Northeastern University

Abstract

3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

CoMA overview

CoMA is designed around multi-modal agents. Taking advantage of modern Large Language Models (LLMs) and Video Language Models (VLMs), our framework's pipeline aggregates these functionalities within four independent agents: the Task Planner, Trajectory Editor, Motion Generator and Motion Reviewer. Below are brief descriptions of each agent, as well as an example workflow of CoMA.

Multi-modal agents

Task Planner

This agent leverages a LLM (GPT-4o) to perform its three designated tasks: Recaptioning, Temporal Composition and Task Decomposition.

Trajectory Editor

If trajectory information is identified by the Task Planner, the Trajectory Editor prompts GPT-4o to generate a mathematical function to compute trajectory coordinates for the pelvis joint. This guides the entire generation sequence to follow the explicited path by the user.

Motion Generator

After the text processing agents finalized their tasks, the Motion Generator agent, through SPAM, generates and edit a motion sequence with the fine-grained instructions.

Motion Reviewer

Powered by an instruction fine-tuned VLM module (VideoChat2) using the colored human models rendered through Blender, the Motion Reviewer captions the sequence generated by SPAM and applies cosine similarity to gauge its relevancy to the original user prompt. If not above the arbitrary threshold, this agent initiates the self-correction pipeline.