ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation

Kerui Chen1, Jianrong Zhang2, Ming Li3, Zhonglong Zheng4, Hehe Fan1,
1CCAI, Zhejiang University, 2ReLER, University of Technology Sydney
1Guangming Laboratory, China, 2Zhejiang Normal University
Introduction

Abstract

Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. First, instead of learning an independent embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two style embedding spaces, \ie, global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.

Overview

Introduction

Stylization Motion Generation

Content Text: A person is walking backward.  ➕  Style: areoplane

Style Motion

Ours

SMooDi


Content Text: A person walks in a circle.  ➕  Style: old

Style Motion

Ours

SMooDi


Content Text: A person walks some steps, and then sit down.  ➕  Style: chicken

Style Motion

Ours

SMooDi


Motion Style Transfer

Content Motion

+

Style Motion (star style)

Ours


Content Motion

+

Style Motion (chicken style)

Ours


Prototype-based Guidance

Content Text: A person is walking forward.  ➕  Style: areoplane

We use three different areoplane global prototypes to generate the stylized motion.

Global1

Global2

Global3

Content Text: A person is walking backward.  ➕  Style: areoplane

We use different areoplane local prototypes to generate the stylized motion.

Local1

Local2