Jiaxin Ye 1 Boyuan Cao 1 Qi Gao 1 Xin-Cheng Wen 2

Ziyang Ma 3 Yiwei Guo 3 Zhizhong Huang 4 Junping Zhang 1 Hongming Shan 1 †

1 Fudan University    2 Harbin Institute of Technology (Shenzhen)   
3 Shanghai Jiao Tong University    4 Michigan State University    Corresponding Author


TL;DR: Diff-Dolly is the first unified multimodal framework to address multimedia inconsistencies issue of media cloning, which incorporates multimodal diffusion process to probabilistically generate speech-video with consistency and diversity.

What can Diff-Dolly do?

1. Multilingual Voice Dubbing

Input Ref Speech

English

German

Spanish

French

Portuguese

Russain

Input Ref Speech

English

German

Spanish

French

Portuguese

Russain


2. Video Dubbing

Instruction: Please unmute manually 🔇.

English

German

Spanish

French

Portuguese

Russain

English

German

Spanish

French

Portuguese

Russain


3. Dubbed Video Swapping

Instruction: Please unmute manually 🔇.

Text

aber auch eine ganz andere, eine sehr persönliche bedeutung.

Speech

Facial Image

Talking Video

Result


4. Secure Generation

Instruction: Please unmute manually 🔇.

Speech Secure Generation

Unwatermarked

Watermarked

See Watermarked Waveform

Detection Results

Speech Secure Generation

Unwatermarked

Watermarked

Pixel Differences (x5)

Detection Results


Abstract

Media cloning aims to generate the speech or video that replicates a source person's physiological (e.g., voiceprint, facial identity) or behavioral (e.g., lip motion, head pose) attributes while preserving other attributes unchanged (e.g., background). Existing methods mostly focus on cloning a specific attribute, which raise multimedia inconsistency with other attributes (i.e., lip-async or ID inconsistency between speech and video), significantly limiting practical applicability. To bridge this gap, we propose a unified 4D diffusion framework Diff-Dolly for Consistent Media Cloning (CMC), which achieves stronger multimedia consistency than previous methods. Specifically, for consistent multimodal generation, we propose a novel 4D U-Net architecture as the multimodal noise predictor on a unified multimodal denoising process. We then introduce a training-free denoising method to produce realistic content with consistent transitions among temporal chunks and natural blending across pixel regions. Additionally, for privacy preservation on media cloning, we propose a secure generation module that can locally embed undetectable watermarking messages and enable frame-level detection. To advance the progress of media cloning in diverse real-world scenarios, we collect a multilingual text-speech-video dataset CMC-TED to validate the effectiveness of methods. Extensive experimental results demonstrate that Diff-Dolly outperforms baselines by a large margin and is a practical solution with both enhanced consistency and security.

Motivation

1. Multimedia Consistency: Previous uni-media cloning methods neglect constraints of multimedia consistency (e.g., lip-sync and ID-consistency), since they primarily focus on cloning specific attributes. For example, speech and lip motions should be inherently synchronized. However, when producing a dubbed Spanish film from an English film only using voice cloning techniques, lip-sync inconsistency arises between the generated speech and the original actor's lip motions, due to the discrepancies between languages. The inconsistency severely impacts the practical applicability of these technologies. Although media cloning methods introduce diverse reference conditions by ensembling different uni-media cloning models, they inefficiently accept a single media condition step-by-step, which fail to handle complicated global optimization and greatly limit users’ experience. These media cloning tasks still lack a unified definition to achieve consistent media cloning, towards realistic speech-video generation.
2. Realistic Generation: Since diffusion models are always limited on short frames like 16 frames arising temporal inconsistency issue on long-term generation. Moreover, the diffusion models are limited on semantically meaningful generation towards the missing areas due to their simple pixel-wise loss. It is important to explore how to generate realstic speech-video across spatial-temporal dimensions.
3. Security: Previous cloning methods can easily replicate privacy attributes, leaving them vulnerable to misuse without proper constraints. It is necessary to take security into consideration to trace the provenance of AI-generated content and protect privacy.

Method

Overall Framework

See Detailed Multimodal Block Architectures
See Detailed Secure Generation Module


Multimodal LDM of Diff-Dolly: First, it conducts an independent diffusion process, where inputs $\mathbf{x}^\text{s}$, $\mathbf{x}^\text{v}$ are encoded and noised. Second, it leverages a 4D U-Net with unimodal and cross-modal transformers for unified denoising process, enhancing unimodal high-fidelity and cross-modal consistency. Furthermore, Diff-Dolly randomly injects localized watermarking message into latent spaces, with a detector to trace whether it is AI-generated.

Results

Qualitative Results

1. Voice Cloning Results

Instruction: Please compare the test speech with the Reference TTS Speech for their speech naturaness (semantic naturaness); and compare the test speech and the Reference Voice for their voice similarity.

Demo 1: but it's not enough (English)

Reference TTS Speeech:

Reference Voice:

Ours

VITS

FastSpeech 2

HPM

V2C

Demo 2: she wanted to change policy at the government level (English)

Reference TTS Speeech:

Reference Voice:

Ours

VITS

FastSpeech 2

HPM

V2C

Demo 3: und leichte flugzeuge fliegen weiter(German)

Reference TTS Speeech:

Reference Voice:

Ours

VITS

FastSpeech 2

HPM

V2C

Demo 4: ese pedido es un mandato(Spanish)

Reference TTS Speeech:

Reference Voice:

Ours

VITS

FastSpeech 2

HPM

V2C

Demo 5: elas faziam testes de aptidão musical (Portuguese)

Reference TTS Speeech:

Reference Voice:

Ours

VITS

FastSpeech 2

HPM

V2C

See Mel-Spectrogram Comparison

As shown in Figure, in the first row, VITS exhibits duration asynchronization with the ground truth, resulting in significant differences. In the third row, we observe severe over-smoothing in FastSpeech 2, HPM, and V2C, causing a degradation in the reconstruction of details. In contrast, our results are closer to the ground truth, benefiting from enhanced detail reconstruction and duration synchronization capabilities of Diff-Dolly.


2. Lip Motion Cloning Results

Instruction: Please compare the test video with the Reference video for their naturalness and lip sync (Please unmute manually 🔇).

Demo 1: Show abuse the light of day by talking about it with your children your coworkers your friends and (English)

Demo 2: I see now i never was one and not the other (English)

Demo 3: Donc, si j'essaye de faire un poème. (French)

Demo 4: denn ich weiß, ihr wollt das zeugdoch jetzt schon. gebt es zu. (German)

Demo 5: ou por não ser utilizada , porque não passou para a próxima geração (Portuguese)

See Image Comparison

As shown in Figure on speaker B, other methods produce blurry mouth while we shows more clear results. On speaker C, our results show fewer artifacts and blend more naturally with the surrounding skin. In contrast, IP-LAP displays blurry artifacts, and TalkLip generates boundary artifacts. Furthermore, in terms of lip-sync, Diff-Dolly shows improvements of 0.05 in LSE-D and 0.59 (28% relative improvement) in LMD compared to the second-best method, which indicates that our method achieves better synchronization. On speaker A, Diff-Dolly produces videos with realistic lip shapes and natural motions. In contrast, TalkLip shows significant differences in the pronunciation of $/w/$.


3. Face Cloning Results

See Image Comparison

Specifically, we observe that the baselines tend to align the attributes of the reference video, resulting in reduced diversity of the generated content due to one-to-one mapping. Our Diff-Dolly incorporates only the landmark and background information from the reference video during inference, reducing dependency on the reference video and enhancing consistency with the source speaker's style. We also observe that DiffSwap, despite using the diffusion model, shows poor temporal consistency in videos focusing on image generation. In contrast, Diff-Dolly achieves strong temporal consistency in videos and remains compatible with image generation.


4. Consistent Media Cloning Results

Instruction: Please unmute manually 🔇.

Text

aber auch eine ganz andere, eine sehr persönliche bedeutung.

Speech

Facial Image

Talking Video

Result

Text

weil es einfach eine problematik ist,die mich selbst betroffen hat.

Speech

Facial Image

Talking Video

Result

See Image Comparison

We compare the output quality of our method to a number of previous approaches: (a) shows temporally adjacent frames, while (b) shows frames temporally further apart. While our method preserves the spatial structure of the input well, producing the desired output styles, previous methods tend to change even the semantic content of the frames.

Quantitative Results

we compare Diff-Dolly with previous state-of-the-art (SOTA) methods on the CMC-TED. These methods are from different media cloning tasks.

User Study

The scoring results of the user study are presented in the Figure. Diff-Dolly demonstrates a clear advantage over SOTA methods in all aspects, particularly in achieving higher voice similarity, lip-sync accuracy and video quality, which validates the effectiveness of our method in multi- media cloning. Furthermore, for scores rated as 4 to 5 of SyncMOS, our method achieved 100% of the user votes, whereas IP-LAP and TalkLip only garnered 60% and 53% of the votes, respectively. For scores rated as 4 to 5 of VMOS, our method achieved 93% of the user votes, whereas VITS and HPM only garnered 40% and 20% of the votes, respectively. Although the difference in IDMOS between Diff-Dolly and the top-performing methods is relatively small, we still achieve at least 10% improvement of the video quality on the face cloning.

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thanks to StreamV2V for demo inspiration, and DreamBooth and Nerfies for website template.