Icon

Phantom

Subject-Consistent Video Generation via Cross-Modal Alignment

*Equal contribution,Project lead

Intelligent Creation Team, ByteDance

Research Paper GitHub

Episodes Created from Phantom-generated Videos

Turn on the sound for a better viewing experience

Identity Preserving Video Generation
Subject-to-video generation using a facial reference image. Phantom strictly preserves the identity of the reference face while generating vivid videos that follow the provided prompt.
Loading...
A man wearing black sunglasses and a dark brown long windbreaker turns his back to the camera. The wind blows the hem of his clothes, creating a dynamic movement. The camera focuses on his weathered profile, capturing the lines and textures of his face. He stands on a modern city street in Ma Long, which is bustling with traffic and tall skyscrapers.
Loading...
A young boy with tousled brown hair is crouching by a stream in a dense forest. He is dressed in casual clothing, with a blue hoodie and khaki pants, and his shoes are slightly muddy. The background shows the dappled light filtering through the tall trees and a carpet of fallen leaves. He faces the side, reaching out to touch the water, creating small ripples, and looks up with a curious expression.
Loading...
A little girl dressed in a green and yellow dinosaur costume with a tail and a big head is grinning and jumping enthusiastically on a soft, brown sofa. The living room is brightly lit with sunlight coming in through a nearby window, casting a warm glow on the scene. The walls are painted a light blue, and there are books and toys scattered around. The girl's arms are flailing as she bounces, and her dinosaur ears are flopping with each jump.
Loading...
A woman dressed in elegant court attire, a flowing gown, a pearl necklace, and a delicate ivory fan in her gloved hands. She fanned herself slowly, her movements graceful and deliberate. The background was a grand European castle hall, with a huge domed ceiling, smooth marble floors, and exquisite tables, chairs, and candlesticks, emitting soft and warm lighting.
Loading...
A man in a sharp black suit, paired with a crisp white shirt and a dark tie, is walking down a busy city street. He holds a steaming cup of coffee in one hand, taking occasional sips as he goes. The street is lined with tall buildings and bustling with people. Sunlight filters through the tall windows of nearby skyscrapers, casting a warm glow on the pavement.
Loading...
An older gentleman with a beard and salt-and-pepper hair is mountain biking on a rugged trail. He wears a green, moisture-wicking jersey and padded shorts. The challenging terrain features steep inclines and rocky paths, set against the backdrop of a stunning mountain range. His eyes are locked on the path ahead as he navigates the twists and turns with skill and enthusiasm, illustrating an adventurous and spirited moment.
Loading...
A young tenor with tousled, sandy-blonde hair sings passionately under a single spotlight in an ornate concert hall. He is clad in a sharp, charcoal-grey suit with a vibrant red tie that adds a pop of color. The scene is set against a backdrop of deep blue curtains with golden trim, and the audience is barely visible in the dim light, entranced by his powerful voice. His fervent expression and the way he clutches the microphone stand convey raw emotion and dedication.
Loading...
A tall, elegant woman with curly, auburn hair stands center stage in an opulent opera house. She is dressed in a flowing, emerald-green gown with intricate gold embroidery that glimmers under the stage lights. The background showcases velvety red curtains, ornate golden designs, and an audience that watches in rapt attention. She lifts her arms expressively, her face reflecting the emotion of the aria she sings, creating a powerful and captivating atmosphere.
Loading...
A refined cellist with sleek, chestnut hair sits center stage in an elegant orchestra pit. She wears a floor-length, cobalt blue dress that contrasts beautifully with the golden wood of her cello. The background is filled with crimson curtains, opulent antique decor, and fellow musicians absorbed in their instruments. Her bow hand moves with grace and precision, her head gently swaying with the melancholic melody, creating an atmosphere of profound and moving beauty.
Single Reference Subject-to-Video Generation
Subject-to-video generation using a single reference image. Phantom can maintain the integrity of various types of subjects, including objects, clothing, animals, virtual characters, and more.
Loading...
The man puts on the sneakers and starts jogging along the riverbank.
Loading...
A woman in a snow leopard print coat stood in the middle of the ice, looking around, her long hair fluttering in the wind.
Loading...
This bottle of perfume is placed on the beach by the sea, and the waves gently hit it.
Loading...
The little white dog is bouncing and running from a distance, with the living room in the background, blinking its eyes and smiling at the camera.
Loading...
The rabbit bounces on the soft mattress and falls from the bed to the wooden floor.
Loading...
The character is fishing by the river.
Loading...
On the brightly lit animated stage, a woman is standing in front of a microphone, singing passionately.
Loading...
A burly man walks around a glittering Christmas tree, which is covered with colorful decorations and twinkling lights. The man has a kind smile on his face and looks around, with a black belt hanging around his waist.
Loading...
The woman is walking by the seaside in a bikini, then confidently turns around to show the details of the outfit.
Multi-Reference Subject-to-Video Generation
Subject-to-video generation using multiple reference images. Phantom can achieve realistic interactions between multiple subjects, such as group interactions, product demonstrations, virtual try-on, and more.
Loading...
A character wearing an outfit in a location turning around and posing to the camera.
Loading...
A woman strolls in the park wearing a dress, with a large area of beautiful flowers in the background.
Loading...
A woman wearing a green bikini strolls by the river and then poses for a photo.
Loading...
The man raised his camera to take pictures on the mountaintop, and the lens moved from near to far, showing the beautiful scenery in front of the man.
Loading...
A fashionable woman stared at a blue bag in the store, then picked it up and observed it carefully.
Loading...
He held the doll against his chest and raised it above his head. Suddenly, the doll exploded, and the pieces were scattered everywhere.
Loading...
The two of them were flipping through books in the study.
Loading...
They sat in front of the easel in the studio and painted, with the camera getting closer and closer.
Loading...
They were fiercely discussing the solution to a math problem in front of the blackboard, pointing and pointing at the blackboard.

Ethics Concerns

The images used in these demos are sourced from public domains or generated by models, and are intended solely to showcase the capabilities of this research. If you have any concerns, please contact us at tianxiang.ma@bytedance.com, and we will promptly remove them.

Acknowledgements

We would like to express our gratitude to the SEED team for their support. Special thanks to Lu Jiang, Haoyuan Guo, Zhibei Ma, and Sen Wang for their assistance with the model and data. In addition, we are also very grateful to Siying Chen, Qingyang Li, and Wei Han for their help with the evaluation.

BibTeX

@article{liu2025phantom,
          title={Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment},
          author={Liu, Lijie and Ma, Tianxaing and Li, Bingchuan and Chen, Zhuowei and Liu, Jiawei and He, Qian and Wu, Xinglong},
          journal={arXiv preprint arXiv:2502.11079},
          year={2025}
        }