Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

(* equal contributions, † Project Lead)
Intelligent Creation Lab, ByteDance   

We introduce Phantom-Data, the first general-purpose large-scale cross-pair dataset aimed at addressing the notorious copy-paste problem in subject-to-video generation. Phantom-Data is built upon three key pillars:
1. Dataset: It comprises one million identity-consistent pairs spanning a wide range of subject categories and different visual contexts.
2. Dataset Pipeline: We propose a structured and scalable data construction pipeline designed to build such dataset.
3. Systematic Study: We conduct a comprehensive study on how varying training data affect subject-to-video model performance.

Abstract

Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, often referred to as the copy-paste problem, stems from the prevalent in-pair training paradigm, which entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.

Dataset Overview

We construct a large-scale, high-quality cross-pair consistency dataset comprising approximately 1 million identity-consistent pairs with over 30,000 multi-subject scenes.

Dataset Pipeline

1. Subject-centric detection module optimized for this task to obtain general and input-aligned subjects.
2. Large-scale cross-context retrieval system to provide cross-pair candidates from diverse contexts.
3. Prior-guided identity verification procedure to ensure consistent identity.

Study on different training datasets

We evaluate our method against three representative baselines: (1) In-pair training, which samples thereference subject from the same video; (2) In-pair with copy-augmentation, which introduces spatial andappearance augmentations to reduce overfitting; and (3) Face-based cross-pair, which utilizes face-level identity matching across videos. Across multiple prompts and subject categories, models trained with in-pair data consistently fail to follow textual instructions, often generating videos with obvious artifacts. In contrast, our cross-pair trained model successfully aligns with the prompt across all cases, producing coherent and faithful subject-driven videos.

Why we don't use synthetic data ?

We try two SOTA models, GPT4o and DreamO to generate consistent subjects on different context. The results shows these models could still generated inconsistent subject. However our non-synthetic cross-pair data construction pipeline could provide exactly same subjects on different context.

BibTeX

If you find our work useful, please consider citing our paper:

@article{chen2025phantom-data,
      title={Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset},
      author={Chen, Zhuowei and Li, Bingchuan and Ma, Tianxiang and Liu, Lijie and Liu, Mingcong and Zhang, Yi and Li, Gen and Li, Xinghui and Zhou, Siyu and He, Qian and Wu, Xinglong},
      journal={arXiv preprint arXiv:2506.18851},
      year={2025}
    }
    @article{liu2025phantom,
      title={Phantom: Subject-consistent video generation via cross-modal alignment},
      author={Liu, Lijie and Ma, Tianxiang and Li, Bingchuan and Chen, Zhuowei and Liu, Jiawei and Li, Gen and Zhou, Siyu and He, Qian and Wu, Xinglong},
      journal={arXiv preprint arXiv:2502.11079},
      year={2025}
    }