Motivation
A common limitation of existing TI2V methods is their tendency to generate videos with limited and identical motions when given an image and multiple prompts. We hypothesize that this limitation arises from insufficient emphasis on motion patterns. As illustrated in Figure (a), in a video with a static background, 97% of pixels remain unchanged, with only 3% showing meaningful motion. Such subtle motion is often overlooked in the standard TI2V training pipeline, where all regions are optimized equally in the L2 loss. This can result in "Condition Leakage," where the loss becomes low simply by copying the condition frames. To address this, we propose Motion Focal loss (MotiF) to guide TI2V training to focus on regions with more motion via motion heatmap re-weighting.