Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. First, instead of learning an independent embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two style embedding spaces, \ie, global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.
Content Text: A person is walking backward. ➕ Style: areoplane
Style Motion
Ours
SMooDi
Content Text: A person walks in a circle. ➕ Style: old
Style Motion
Ours
SMooDi
Content Text: A person walks some steps, and then sit down. ➕ Style: chicken
Style Motion
Ours
SMooDi
Content Motion
Style Motion (star style)
Ours
Content Motion
Style Motion (chicken style)
Ours
Content Text: A person is walking forward. ➕ Style: areoplane
We use three different areoplane global prototypes to generate the stylized motion.
Global1
Global2
Global3
Content Text: A person is walking backward. ➕ Style: areoplane
We use different areoplane local prototypes to generate the stylized motion.
Local1
Local2