T2V-Turbo-V2

Sleeping

App Files Files Community

DeFactOfficial commited on Oct 14, 2024

Commit

94c9a29

verified ·

1 Parent(s): 7603095

Change settings UI

Browse files

Files changed (1) hide show

app.py +46 -22

app.py CHANGED Viewed

@@ -17,20 +17,20 @@ from utils.utils import instantiate_from_config
 from scheduler.t2v_turbo_scheduler import T2VTurboScheduler
 from pipeline.t2v_turbo_vc2_pipeline import T2VTurboVC2Pipeline
-DESCRIPTION = """# T2V-Turbo 🚀
-Our model is distilled from [VideoCrafter2](https://ailab-cvc.github.io/videocrafter2/).
-T2V-Turbo learns a LoRA on top of the base model by aligning to the reward feedback from [HPSv2.1](https://github.com/tgxs002/HPSv2/tree/master) and [InternVid2 Stage 2 Model](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4).
-T2V-Turbo-v2 optimizes the training techniques by finetuning the full base model and further aligns to [CLIPScore](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K)
-T2V-Turbo trains on pure WebVid-10M data, whereas T2V-Turbo-v2 carufully optimizes different learning objectives with a mixutre of VidGen-1M and WebVid-10M data.
-Moreover, T2V-Turbo-v2 supports to distill motion priors from the training videos.
-[Project page for T2V-Turbo](https://t2v-turbo.github.io) 🥳
 [Project page for T2V-Turbo-v2](https://t2v-turbo-v2.github.io) 🤓
 """
 if torch.cuda.is_available():
@@ -70,20 +70,20 @@ example_txt = [
     "A musician strums his guitar, serenading the moonlit night.",
 ]
-examples = [[i, 7.5, 0.5, 16, 16, 0, True, "bf16"] for i in example_txt]
 @spaces.GPU(duration=120)
 @torch.inference_mode()
 def generate(
     prompt: str,
     guidance_scale: float = 7.5,
     percentage: float = 0.5,
     num_inference_steps: int = 4,
     num_frames: int = 16,
     seed: int = 0,
     randomize_seed: bool = False,
     param_dtype="bf16",
-    motion_gs: float = 0.05,
     fps: int = 8,
 ):
@@ -167,35 +167,50 @@ if __name__ == "__main__":
     demo = gr.Interface(
         fn=generate,
         inputs=[
-            Textbox(label="", placeholder="Please enter your prompt. \n"),
             gr.Slider(
-                label="Guidance scale",
-                minimum=2,
-                maximum=14,
                 step=0.1,
                 value=7.5,
             ),
             gr.Slider(
-                label="Percentage of steps to apply motion guidance (v2 w/ MG only)",
                 minimum=0.0,
-                maximum=0.5,
                 step=0.05,
                 value=0.5,
             ),
             gr.Slider(
-                label="Number of inference steps",
-                minimum=4,
-                maximum=50,
                 step=1,
                 value=16,
             ),
             gr.Slider(
                 label="Number of Video Frames",
                 minimum=16,
-                maximum=48,
                 step=8,
                 value=16,
             ),
             gr.Slider(
                 label="Seed",
                 minimum=0,
@@ -210,8 +225,17 @@ if __name__ == "__main__":
                 label="torch.dtype",
                 value="bf16",
                 interactive=True,
-                info="Dtype for inference. Default is bf16.",
-            )
         ],
         outputs=[
             gr.Video(label="Generated Video", width=512, height=320, interactive=False, autoplay=True),

 from scheduler.t2v_turbo_scheduler import T2VTurboScheduler
 from pipeline.t2v_turbo_vc2_pipeline import T2VTurboVC2Pipeline
+DESCRIPTION = """# T2V-Turbo-v2 🚀
+## A fast and efficient txt2video model that doesn't suck
+This space was forked from the original so that I can fix whatever is causing its API not to work with HuggingChat's tools interface....
+You know, because it would be really cool to combine an LLM with a text2video model that's fast, decent quality, and open source
+I've also increased upper bounds of some params, and made other params adjustable in the UI which previously were locked. Please read the info because some of them are likely not worth messing with, but I like to give users the freedom to explore
+The TLDR on this model is that it was distilled from VideoCrafter 2, and ended up beating the parent model on all of the benchmarks even tho its smaller and MUCH faster.
+Don't get TOO excited tho - when you read the paper they claim it beat Kling and Runway Gen-3 on comprehensive benchmark scores, but this ain't Gen-3, its just not. Its a low res, high efficiency, txt2video engine that's perfect for recreational use and integration with chatbots, but it won't be winning any oscars
+Official Project Page with links to Papers, Github Code, and Leaderboard:
 [Project page for T2V-Turbo-v2](https://t2v-turbo-v2.github.io) 🤓
 """
 if torch.cuda.is_available():
     "A musician strums his guitar, serenading the moonlit night.",
 ]
+examples = [[i, 7.5, 0.5, 0.05, 16, 16, 0, True, "bf16", 8] for i in example_txt]
 @spaces.GPU(duration=120)
 @torch.inference_mode()
 def generate(
     prompt: str,
     guidance_scale: float = 7.5,
+    motion_gs: float = 0.05,
     percentage: float = 0.5,
     num_inference_steps: int = 4,
     num_frames: int = 16,
     seed: int = 0,
     randomize_seed: bool = False,
     param_dtype="bf16",
     fps: int = 8,
 ):
     demo = gr.Interface(
         fn=generate,
         inputs=[
+            Textbox(label="", placeholder="Please enter your prompt"),
             gr.Slider(
+                label="CFG Guidance",
+                minimum=1,
+                maximum=21,
                 step=0.1,
                 value=7.5,
+                info="Behaves like CFG Guidance on a txt2img diffusion model... 7.5 appears to indeed be the sweeet spot, but for certain prompts you may wish to adjust"
             ),
             gr.Slider(
+                label="MGS Guidance (Don't Change This)",
+                minimum=0.0,
+                maximum=1.0,
+                step=0.01,
+                value=0.05,
+                info="No idea where they came up with the default of 0.05 or why they're so certain its optimal, since its not mentioned in the paper. I've therefore opened it up for experimentation, with very low expectations"
+            ),
+            gr.Slider(
+                label="Motion Guidance Percentage (Don't Change This)",
                 minimum=0.0,
+                maximum=0.8,
                 step=0.05,
                 value=0.5,
+                info="The authors specifically say in their paper that its important to apply MG to only the first N inference steps out of M total step. But the ideal value of N/M is not mentioned, so may be worth playing with"
             ),
             gr.Slider(
+                label="Inference Steps",
+                minimum=2,
+                maximum=200,
                 step=1,
                 value=16,
+                info="This is an interesting one because increasing step count is the equivalent to techniques like CoT that we use to increase test time compute in LLMs. In general, more steps = lower loss (higher quality). But the relationship is asymptotic and returns quickly diminish... Opened this up in case its needed for certain use cases, otherwise leave @ 16"
             ),
             gr.Slider(
                 label="Number of Video Frames",
                 minimum=16,
+                maximum=96,
                 step=8,
                 value=16,
+                info="Generated video length = number of frames / FPS. The benchmark evals involved 16 frames, to my knowledge. It is unclear how high you can go before consistency falls apart... but it would be lovely to get 96 frames at 24 fps of high quality video. Probably won't happen, but just in case, feel free to try"
             ),
             gr.Slider(
                 label="Seed",
                 minimum=0,
                 label="torch.dtype",
                 value="bf16",
                 interactive=True,
+                info="bf16 is fast and high quality. end users should not change this setting",
+            ),
+            gr.Slider(
+                label="Desired Output FPS",
+                minimum=8,
+                maximum=24,
+                step=8,
+                value=8,
+                info="Higher = smoother, lower = longer video, purely a matter of preference"
+            ),
         ],
         outputs=[
             gr.Video(label="Generated Video", width=512, height=320, interactive=False, autoplay=True),