Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion
ACMMM 2025


Zeren Xiong1, Zikun Chen1, Zedong Zhang1,
Xiang Li3 Ying Tai3, Jian Yang1, Jun Li1*

1Nanjing University of Science and Technology
2Nanjing University 3Nankai University

Code(coming soon)


Teaser figure.


Generate Meshes

1

Original model(Sabrewulf)

图片2

Cobalt metal

图片3

Titanium metal

图片4

African chameleon

图片5

Ant

图片6

Bald eagle

图片7

Bighorn

图片8

Chromium metal

图片9

Cock

图片10

Orchid plant

图片11

Shark

图片12

fire salamander






图片1

Original model
(Kinni - character)

图片2

Aluminum metal

图片3

Egyptian cat

图片4

Theater building

图片5

Gold

图片6

Tomato

图片7

Lily plant

图片8

Indigo bunting

图片9

Broccoli

图片10

Sycamore tree

图片11

Triceratops

图片12

Orchid plant






图片1

Original model (Apatosaurus)

图片2

Original model
(Devil)

图片3

Original model (Nesting doll)

图片3

Original model
(Tiger)

图片4

Banana

图片5

Kit fox

图片6

Cock

图片6

Giraffe

图片7

Gold

图片8

Gazelle

图片9

Flamingo

图片6

African chameleon

图片10

King penguin

图片11

Polar bear

图片12

Peacock

图片12

King penguin






Abstract

In this paper, we tackle a new task of 3D object synthesis, where a 3D model is Composited with another object category to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel 2D object using adaptive text-image harmony (ATIH) with the front-view image and a text description from another object category as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations, such as shark (3D)-crocodile(text) in first row of Fig. 1.



Framework

Breed mixing results


An example of texture consistency

Breed mixing results

Our TMDiff achieves better texture consistency compared to ATIH




An example of shape accuracy

Breed mixing results

Our SMDiff demonstrates better shape accuracy compared to Era3D