AI for Content Creation Workshop

June 12th @ CVPR 2025

Karl F. Dean Grand Ballroom A1, 4th Floor, Music City Center, Nashville, TN, USA

Remote (Zoom): Via CVPR site

Visual Style Prompting with Swapping Self-Attention

Jeong et al., AI4CC 2024

Visual Style Prompting with Swapping Self-Attention

Towards Safer AI Content Creation by Immunizing Text-to-image Models

Zheng et al., AI4CC 2024

Towards Safer AI Content Creation by Immunizing Text-to-image Models

Seamless Human Motion Composition with Blended Positional Encodings

Barquero et al., AI4CC 2024

Seamless Human Motion Composition with Blended Positional Encodings

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

Matsunaga et al., AI4CC 2023

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

Text-to-image Editing by Image Information Removal

Zhang et al., AI4CC 2023

Text-to-image Editing by Image Information Removal

Zero-Shot Text-Guided Object Generation with Dream Fields

Jain et al., AI4CC 2022

Zero-Shot Text-Guided Object Generation with Dream Fields

Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGAN

Lee, Lee, Kim, Choi, & Kim, AI4CC 2022

Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGAN

RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects

Jang, Villegas, Yang, Ceylan, Sun, & Lee, AI4CC 2022

RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects

Overparameterization Improves StyleGAN Inversion

Poirier-Ginter, Alexandre Lessard, Ryan Smith, Jean-François Lalonde, AI4CC 2022

Overparameterization Improves StyleGAN Inversion

High-Resolution Complex Scene Synthesis with Transformers

Jahn et al., AI4CC 2021

High-Resolution Complex Scene Synthesis with Transformers

Network Fusion for Content Creation with Conditional INNs

Rombach, Esser, and Ommer, AI4CC 2020

Network Fusion for Content Creation with Conditional INNs

Toward High-quality Few-shot Font Generation with Dual Memory

Cha et al., AI4CC 2020

Toward High-quality Few-shot Font Generation with Dual Memory

Object-Centric Image Generation from Layouts

Sylvain et al., AI4CC 2020

Object-Centric Image Generation from Layouts

Summary

Content creation plays a crucial role in domains such as photography, videography, virtual reality, gaming, art, design, fashion, and advertising design. Recent progress in machine learning and AI has transformed hours of manual, painstaking content creation work into minutes or seconds of automated or interactive work. For instance, generative modeling approaches can produce photorealistic images of 2D and 3D items such as humans, landscapes, interior scenes, virtual environments, clothing, or even industrial designs. New large text, image, and video models that share latent spaces let us imaginatively describe scenes and have them realized automatically—with new multi-modal approaches able to generate consistent video and audio across long timeframes. Such approaches can also super-resolve and super-slomo videos, interpolate and extrapolate between photos and videos with intermediate novel views, decompose scene objects and appearance, and transfer styles to convincingly render and reinterpret content. Learned priors of images, videos, and 3D data can also be combined with explicit appearance and geometric constraints, perceptual understanding, or even functional and semantic constraints of objects. While often creating awe-inspiring artistic images, such techniques offer unique opportunities for generating diverse synthetic training data for downstream computer vision tasks, both in 2D, video, and 3D domains.

The AI for Content Creation workshop explores this exciting and fast-moving research area. We bring together invited speakers of world-class expertise in content creation, up-and-coming researchers, and authors of submitted workshop papers, to engage in a day filled with learning, discussion, and network building.

Welcome! -
Deqing Sun (Google)
Lingjie Liu (University of Pennsylvania)
Krishna Kumar Singh (Adobe)
Lu Jiang (ByteDance)
Jun-Yan Zhu (Carnegie Mellon University)
James Tompkin (Brown University)

Firefly Video (Adobe, 2025), Genie 2 (DeepMind, 2024), SORA (OpenAI, 2024).

2025 Awards

Best paper
VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment
Wenyan Cong (University of Texas at Austin); Hanqing Zhu (University of Texas at Austin); Kevin Wang (University of Texas at Austin); Jiahui Lei (University of Pennsylvania); Colton Stearns (Stanford University); Yuanhao Cai (Johns Hopkins University); Dilin Wang (Meta); Rakesh Ranjan (Meta); Matt Feiszli (Meta); Leonidas Guibas (Stanford University); Atlas Wang (University of Texas at Austin); Weiyao Wang (Meta); Zhiwen Fan (University of Texas at Austin) [https://videolifter.github.io/]

Best presentation (shared)
MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation
Sihyun Yu (KAIST); Meera Hahn (Google DeepMind); Dan Kondratyuk (Luma AI); Jinwoo Shin (KAIST); Agrim Gupta (Google DeepMind); José Lezama (Google DeepMind); Irfan Essa (Google DeepMind); David Ross (Google DeepMind); Jonathan Huang (Scaled Foundations)

Best presentation (shared)
Art3D: Training-Free 3D Generation from Flat-Colored Illustration
Xiaoyan Cong (Brown University); Jiayi Shen (Brown University); Zekun Li (Brown University); Rao Fu (Brown University); Tao Lu (Brown University); Srinath Sridhar (Brown University) [https://joy-jy11.github.io/]

Best presentation (shared)
Training-Free Sketch-Guided Diffusion with Latent Optimization
Sandra Zhang Ding (The University of Tokyo); Kiyoharu AIZAWA (The University of Tokyo); Jiafeng MAO (The University of Tokyo)

2025 Schedule

Morning session:

	Time CDT
	08:45	Welcome and introductions	👋
	09:00	Maneesh Agrawala (Stanford University)
	09:30	Kai Zhang (Adobe)
	10:00	Coffee break	☕
	10:30	Charles Herrmann (Google)
	11:00	Mark Boss (Stability AI)
	11:30	Poster session 1 - ExHall D #412-431 Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models JungWoo Chae (Nexon Korea); Jiyoon Kim (LGCNS); Sangheum Hwang (Seoul National University of Science and Technology) EOPose : Exemplar-based object reposing using Generalized Pose Correspondences Sarthak Mehrotra (Indian Institute of Technology Bombay); Rishabh Jain (Adobe); Mayur Hemani (Adobe); Balaji Krishnamurthy (Adobe); Mausoom Sarkar (Adobe) Don't Mesh with Me: Generating Constructive Solid Geometry Instead of Meshes by Fine-Tuning a Code-Generation LLM Maximilian Mews (HU Berlin); Ansar Aynetdinov (HU Berlin); Vivian Schiller (RWTH Aachen); Peter Eisert (HU Berlin); Alan Akbik (HU Berlin) Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling Junhong Lee (POSTECH); Seungwook Kim (POSTECH, bytedance); Minsu Cho (POSTECH) MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation Sihyun Yu (KAIST); Meera Hahn (Google DeepMind); Dan Kondratyuk (Luma AI); Jinwoo Shin (KAIST); Agrim Gupta (Google DeepMind); José Lezama (Google DeepMind); Irfan Essa (Google DeepMind); David Ross (Google DeepMind); Jonathan Huang (Scaled Foundations) Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality Pramook Khungurn (pixiv, Inc.); Phonphrm Thawatdamrongkit (VISTEC); Sukit Seripanitkarn (VISTEC); Supasorn Suwajanakorn (VISTEC) Generating Animated Layouts as Structured Text Representations Yeonsang Shin (Seoul National University); Jihwan Kim (Seoul National University); Yumin Song (Seoul National University); Kyungseung Lee (SK telecom); Hyunhee Chung (SK telecom); Taeyoung Na (SK telecom) CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation Dejia Xu (University of Texas at Austin); Weili Nie (NVIDIA); Chao Liu (NVIDIA); Sifei Liu (NVIDIA); Jan Kautz (NVIDIA); Zhangyang Wang (University of Texas at Austin); Arash Vahdat (NVIDIA ) [https://ir1d.github.io/CamCo/] LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations Tung Do (Movian Research); Thuan Nguyen (MBZUAI); Anh Tran (Movian Research); Rang Nguyen (VinAI Research); Binh-Son Hua (Trinity College Dublin) Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation Liu He (Purdue University); Yizhi Song (Purdue University); Hejun Huang (University of Michigan); Pinxin Liu (University of Rochester); Yunlong Tang (University of Rochester); Daniel Aliaga (Purdue University); Xin Zhou (Baidu USA) VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment Wenyan Cong (University of Texas at Austin); Hanqing Zhu (University of Texas at Austin); Kevin Wang (University of Texas at Austin); Jiahui Lei (University of Pennsylvania); Colton Stearns (Stanford University); Yuanhao Cai (Johns Hopkins University); Dilin Wang (Meta); Rakesh Ranjan (Meta); Matt Feiszli (Meta); Leonidas Guibas (Stanford University); Atlas Wang (University of Texas at Austin); Weiyao Wang (Meta); Zhiwen Fan (University of Texas at Austin) [https://videolifter.github.io/] HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing Jinbin Bai (National University of Singapore); Wei Chow (National University of Singapore); Ling Yang (Peking University); Xiangtai Li (Skywork AI); Juncheng Li (National University of Singapore); Hanwang Zhang (Nanyang Technological University ); Shuicheng Yan (National University of Singapore) [https://github.com/viiika/HumanEdit] DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description Adrienne Deganutti (University of Surrey); Simon Hadfield (University of Surrey); Andrew Gilbert (University of Surrey) [https://andrewjohngilbert.github.io/DANTE-AD/] Stable Flow: Vital Layers for Training-Free Image Editing Omri Avrahami (The Hebrew University of Jerusalem); Or Patashnik (Tel Aviv University); Ohad Fried (Reichman University); Egor Nemchinov (Snap); Kfir Aberman (Snap); Dani Lischinski (The Hebrew University of Jerusalem); Daniel Cohen-Or (Tel Aviv University) — CVPR 2025 HyperGS: Hyperspectral 3D Gaussian Splatting Christopher Thirgood (University of Surrey); Oscar Mendez (University of Surrey); Erin Ling (University of Surrey); Jon Storey (i3D Robotics); Simon Hadfield (University of Surrey) — CVPR 2025 Tiled Diffusion Or Madar (Reichman University); Ohad Fried (Reichman University) [https://madaror.github.io/tiled-diffusion.github.io/] — CVPR 2025 DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models Shwetha Ram (Amazon) — WACV 2025 VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors Juil Koo (KAIST); Paul Guerrero (Adobe Research); Chun-Hao Huang (Adobe Research); Duygu Ceylan (Adobe Research); Minhyuk Sung (KAIST) — CVPR 2025 HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks Maria Pilligua Costa (Computer Vision Center (CVC), Universitat Autònoma Barcelona (UAB)); Danna Xue (Northwestern Polytechnical University, Computer Vision Center (CVC), Universitat Autònoma Barcelona (UAB)); Javier Vazquez-Corral (Computer Vision Center (CVC), Universitat Autònoma Barcelona (UAB)) — CVPR 2025 Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation Rajeev Goel (Arizona State University); Utkarsh Nath (Arizona State University); Eun Som Jeon (Seoul National University of Science and Technology); Kyle Min (Intel Labs); Changhoon Kim (Arizona State University); Pavan Turaga (Arizona State Univerisity) [https://moment-3d.github.io/] — WACV 2025
	12:30	Lunch break - ExHall C	🥪

Cat4D (Google, 2024), AssetGen (Meta, 2024), DreamFusion (Google, 2022).

Afternoon session:

	Time CDT
	13:30	Oral session + best paper announcement + best presentation competition MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation Sihyun Yu (KAIST); Meera Hahn (Google DeepMind); Dan Kondratyuk (Luma AI); Jinwoo Shin (KAIST); Agrim Gupta (Google DeepMind); José Lezama (Google DeepMind); Irfan Essa (Google DeepMind); David Ross (Google DeepMind); Jonathan Huang (Scaled Foundations) VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment Wenyan Cong (University of Texas at Austin); Hanqing Zhu (University of Texas at Austin); Kevin Wang (University of Texas at Austin); Jiahui Lei (University of Pennsylvania); Colton Stearns (Stanford University); Yuanhao Cai (Johns Hopkins University); Dilin Wang (Meta); Rakesh Ranjan (Meta); Matt Feiszli (Meta); Leonidas Guibas (Stanford University); Atlas Wang (University of Texas at Austin); Weiyao Wang (Meta); Zhiwen Fan (University of Texas at Austin) [https://videolifter.github.io/] Art3D: Training-Free 3D Generation from Flat-Colored Illustration Xiaoyan Cong (Brown University); Jiayi Shen (Brown University); Zekun Li (Brown University); Rao Fu (Brown University); Tao Lu (Brown University); Srinath Sridhar (Brown University) [https://joy-jy11.github.io/] Is Your Text-to-Image Model Robust to Caption Noise? Weichen Yu (University of Chinese Academy of Sciences); Ziyang Yang (ByteDance); Shanchuan Lin (ByteDance); Qi Zhao (ByteDance); Jianyi Wang (ByteDance); Liangke Gui (ByteDance); Matt Fredrikson (CMU); Lu Jiang (ByteDance) Training-Free Sketch-Guided Diffusion with Latent Optimization Sandra Zhang Ding (The University of Tokyo); Kiyoharu AIZAWA (The University of Tokyo); Jiafeng MAO (The University of Tokyo) Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models Ketan Suhaas Saichandran (Boston University); Xavier Thomas (Boston University); Prakhar Kaushik (Johns Hopkins University); Deepti Ghadiyaram (Boston University)
	14:00	Yutong Bai (UC Berkeley)
	14:30	Nanxuan (Cherry) Zhao (Adobe)
	15:00	Coffee break	☕
	15:30	Ishan Misra (Meta)
	16:00	Panel discussion — Open Source in AI and the Creative Industry Jon Barron (Google, Senior Staff Research Scientist) Maneesh Agrawala (Stanford, Professor) Nanxuan (Cherry) Zhao (Adobe, Research Scientist) Ishan Misra (Meta, Research Scientist)	🗣️
	17:00	Poster session 2 - ExHall D #412-431 Towards Flim-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks shiyu xia (AI Lab, Giant Network); Junjie Zheng (AI Lab, Giant Network); chaoyi wang (Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences); Zihao Chen (AI Lab, Giant Network); chaofan ding (AI Lab, Giant Network); Xiaohao Zhang (AI Lab, Giant Network); Xi Tao (AI Lab, Giant Network); Xiaoming He (School of life sciences, Fudan University); XINHAN DI (Deepearthgo) Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion Minseo Kim (KAIST); MINCHAN KWON (KAIST); dongyeun Lee (KAIST); Yunho Jeon (Hanbat University); junmo Kim (KAIST) Vectorized Region Based Brush Strokes for Artistic Rendering Jeripothula Prudviraj (TCS Research); Vikram Jamwal (TCS Research) GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting Anushka Agarwal (University of Massachusetts Amherst); Yusuf Hassan (University of Massachusetts Amherst); Talha Chafekar (University of Massachusetts Amherst) DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing Qi Li (Amazon); shuwen qiu (UCLA); Kee Kiat Koo (Amazon); Julien Han (UCLA); Karim Bouyarmane (Amazon) Is Concatenation Really All You Need? Efficient Concatenation-Based Pose Conditioning and Pose Control for Virtual Try On Qi Li (Amazon); Shuwen Qiu (UCLA); Kee Kiat Koo (Amazon); Julien Han (Amazon) Art3D: Training-Free 3D Generation from Flat-Colored Illustration Xiaoyan Cong (Brown University); Jiayi Shen (Brown University); Zekun Li (Brown University); Rao Fu (Brown University); Tao Lu (Brown University); Srinath Sridhar (Brown University) [https://joy-jy11.github.io/] Is Your Text-to-Image Model Robust to Caption Noise? Weichen Yu (University of Chinese Academy of Sciences); Ziyang Yang (ByteDance); Shanchuan Lin (ByteDance); Qi Zhao (ByteDance); Jianyi Wang (ByteDance); Liangke Gui (ByteDance); Matt Fredrikson (CMU); Lu Jiang (ByteDance) Training-Free Sketch-Guided Diffusion with Latent Optimization Sandra Zhang Ding (The University of Tokyo); Kiyoharu AIZAWA (The University of Tokyo); Jiafeng MAO (The University of Tokyo) InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On Meng Han (Amazon.com); Shuwen Qiu (UCLA); Qi Li (Amazon.com); Xingzi Xu (Duke University); Kavosh Asadi (Amazon.com); Karim Bouyarmane (Amazon.com) Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models Ketan Suhaas Saichandran (Boston University); Xavier Thomas (Boston University); Prakhar Kaushik (Johns Hopkins University); Deepti Ghadiyaram (Boston University) Enhancing Creative Generation on Stable Diffusion-based Models Jiyeon Han (Korea Advanced Institute of Science and Technology); Dahee Kwon (Korea Advanced Institute of Science and Technology); Gayoung Lee (NAVER AI Lab); Junho Kim (NAVER AI Lab); Jaesik Choi (Korea Advanced Institute of Science and Technology) — CVPR 2025 NamedCurves: Learned Image Enhancement via Color Naming David Serrano-Lozano (Computer Vision Center); Luis Herranz (Universidad Autónoma de Madrid); Michael S. Brown (York University); Javier Vazquez-Corral (Computer Vision Center) — ECCV 2024 ScribbleLight: Single Image Indoor Relighting with Scribbles Jun Myeong Choi (University of North Carolina at Chapel Hill); Annie Wang (University of North Carolina at Chapel Hill); Pieter Peers (College of William & Mary); Anand Bhattad (Toyota Technological Institute at Chicago); Roni Sengupta (University of North Carolina at Chapel Hill) — CVPR 2025 4K4DGen: Panoramic 4D Generation at 4K Resolution Renjie Li (Texas A&M University); Bangbang Yang (ByteDance); Zhiwen Fan (The University of Texas at Austin); Dejia Xu (The University of Texas at Austin); Tingting Shen (XMU); Xuanyang Zhang (StepFun AI); Shijie Zhou (UCLA); Zeming Li (ByteDance); Achuta Kadambi (UCLA); Zhangyang Wang ( The University of Texas at Austin); Zhengzhong Tu (Texas A&M University); Panwang Pan (ByteDance) — ICLR 2025 LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting Xiaoyan Xing (University of Amsterdam); Konrad Groh (Bosch); Sezer Karaoglu (University of Amsterdam); Theo Gevers (University of Amsterdam); Anand Bhattad (Toyota Technological Institute at Chicago) — CVPR 2025 Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics Lee Chae-Yeon (POSTECH); Oh Hyun-Bin (POSTECH); Han EunGi (POSTECH); Kim Sung-Bin (POSTECH); Suekyeong Nam (KRAFTON); Tae-Hyun Oh (KAIST) — CVPR 2025 T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation Kaiyue Sun (The University of Hong Kong) [https://t2v-compbench-2025.github.io/] — CVPR 2025 MixerMDM: Learnable Composition of Human Motion Diffusion Models Pablo Ruiz Ponce (University of Alicante) — CVPR 2025 Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis YU Yuan (Purdue University); Xijun Wang (Purdue University); Yichen Sheng (Nvidia); Prateek Chennuri (Purdue University); Xingguang Zhang (Purdue University); Stanley Chan (Purdue University) [https://generative-photography.github.io/project/] — CVPR 2025

Dall-E 2 (OpenAI, 2022), Imagen (Google, 2022), GauGAN2 (NVIDIA, 2021).

Previous Workshops (including session videos)

2024 - AI for Content Creation (Workshop at CVPR 2024).
2023 - AI for Content Creation (Workshop at CVPR 2023).
2022 - AI for Content Creation (Workshop at CVPR 2022).
2021 - AI for Content Creation (Workshop at CVPR 2021).
2020 - AI for Content Creation (Workshop at CVPR 2020).
2019 - Deep Learning for Content Creation (Tutorial at CVPR 2019)