Spaces:

jbilcke-hf
/

VideoModelStudio

Running

App Files Files Community

jbilcke-hf HF Staff commited on 2 days ago

Commit

9fd1204

1 Parent(s): 76a0a50

we are going to hack into finetrainers

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

docs/finetrainers-src-codebase/.github/ISSUE_TEMPLATE/bug_report.yaml +51 -0
docs/finetrainers-src-codebase/.github/ISSUE_TEMPLATE/feature-request.yaml +34 -0
docs/finetrainers-src-codebase/.github/workflows/pr_tests.yml +30 -0
docs/finetrainers-src-codebase/.gitignore +179 -0
docs/finetrainers-src-codebase/CONTRIBUTING.md +41 -0
docs/finetrainers-src-codebase/LICENSE +201 -0
docs/finetrainers-src-codebase/Makefile +11 -0
docs/{finetrainers/documentation_README.md → finetrainers-src-codebase/README.md} +6 -3
docs/finetrainers-src-codebase/accelerate_configs/compiled_1.yaml +22 -0
docs/finetrainers-src-codebase/accelerate_configs/deepspeed.yaml +23 -0
docs/finetrainers-src-codebase/accelerate_configs/uncompiled_1.yaml +17 -0
docs/finetrainers-src-codebase/accelerate_configs/uncompiled_2.yaml +17 -0
docs/finetrainers-src-codebase/accelerate_configs/uncompiled_4.yaml +17 -0
docs/finetrainers-src-codebase/accelerate_configs/uncompiled_8.yaml +17 -0
docs/finetrainers-src-codebase/assets/contribute.md +16 -0
docs/finetrainers-src-codebase/assets/contribute_zh.md +16 -0
docs/finetrainers-src-codebase/assets/dataset_zh.md +72 -0
docs/finetrainers-src-codebase/assets/sft_2b.png +0 -0
docs/finetrainers-src-codebase/assets/sft_5b.png +0 -0
docs/finetrainers-src-codebase/assets/tests/metadata.csv +2 -0
docs/finetrainers-src-codebase/docs/_NOTES_FOR_FUTURE_ME.md +20 -0
docs/{finetrainers/documentation_args.md → finetrainers-src-codebase/docs/args.md} +44 -5
docs/{finetrainers/documentation_dataset_README.md → finetrainers-src-codebase/docs/dataset/README.md} +11 -4
docs/finetrainers-src-codebase/docs/dataset/_DEBUG.md +44 -0
docs/{finetrainers/documentation_environment.md → finetrainers-src-codebase/docs/environment.md} +11 -0
docs/{finetrainers/documentation_models_README.md → finetrainers-src-codebase/docs/models/README.md} +0 -0
docs/finetrainers-src-codebase/docs/models/attention.md +263 -0
docs/{finetrainers/documentation_models_cogvideox.md → finetrainers-src-codebase/docs/models/cogvideox.md} +6 -6
docs/finetrainers-src-codebase/docs/models/cogview4.md +94 -0
docs/finetrainers-src-codebase/docs/models/flux.md +53 -0
docs/{finetrainers/documentation_models_hunyuan_video.md → finetrainers-src-codebase/docs/models/hunyuan_video.md} +3 -3
docs/{finetrainers/documentation_models_ltx_video.md → finetrainers-src-codebase/docs/models/ltx_video.md} +3 -3
docs/{finetrainers/documentation_models_optimization.md → finetrainers-src-codebase/docs/models/optimization.md} +0 -0
docs/{finetrainers/documentation_models_wan.md → finetrainers-src-codebase/docs/models/wan.md} +11 -1
docs/{finetrainers/documentation_optimizers.md → finetrainers-src-codebase/docs/optimizer.md} +0 -0
docs/{finetrainers/documentation_parallel_processing_README.md → finetrainers-src-codebase/docs/parallel/README.md} +8 -3
docs/{finetrainers/documentation_trainers_control_trainer.md → finetrainers-src-codebase/docs/trainer/control_trainer.md} +0 -0
docs/{finetrainers/documentation_trainers_sft_trainer.md → finetrainers-src-codebase/docs/trainer/sft_trainer.md} +0 -0
docs/finetrainers-src-codebase/examples/_legacy/training/README.md +459 -0
docs/finetrainers-src-codebase/examples/_legacy/training/README_zh.md +455 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/__init__.py +0 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/args.py +484 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_image_to_video_lora.py +1016 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_image_to_video_sft.py +947 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_text_to_video_lora.py +955 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_text_to_video_sft.py +917 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/dataset.py +428 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/prepare_dataset.py +669 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/text_encoder/__init__.py +1 -0
docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/text_encoder/text_encoder.py +99 -0

docs/finetrainers-src-codebase/.github/ISSUE_TEMPLATE/bug_report.yaml ADDED Viewed

	@@ -0,0 +1,51 @@

+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve CogVideoX-Factory / 提交一个 Bug 问题报告来帮助我们改进 CogVideoX-Factory 开源框架
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info / 系統信息
+      description: Your operating environment / 您的运行环境信息
+      placeholder: Includes Cuda version, Diffusers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本，Diffusers，Python版本，操作系统，硬件信息(如果您怀疑是硬件方面的问题)...
+    validations:
+      required: true
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information / 问题信息
+      description: 'The problem arises when using: / 问题出现在'
+      options:
+        - label: "The official example scripts / 官方的示例脚本"
+        - label: "My own modified scripts / 我自己修改的脚本和任务"
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction / 复现过程
+      description: |
+        Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
+        If you have code snippets, error messages, stack traces, please provide them here as well.
+        Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
+        请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
+        如果您有代码片段、错误信息、堆栈跟踪，也请在此提供。
+        请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        请勿使用截图，因为截图难以阅读，而且（更重要的是）不允许他人复制粘贴您的代码。
+      placeholder: |
+        Steps to reproduce the behavior/复现Bug的步骤:
+          1.
+          2.
+          3.
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior / 期待表现
+      description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"

docs/finetrainers-src-codebase/.github/ISSUE_TEMPLATE/feature-request.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+name: "\U0001F680 Feature request"
+description: Submit a request for a new CogVideoX-Factory feature / 提交一个新的 CogVideoX-Factory 开源项目的功能建议
+labels: [ "feature" ]
+body:
+  - type: textarea
+    id: feature-request
+    validations:
+      required: true
+    attributes:
+      label: Feature request  / 功能建议
+      description: |
+        A brief description of the functional proposal. Links to corresponding papers and code are desirable.
+        对功能建议的简述。最好提供对应的论文和代码链接。
+  - type: textarea
+    id: motivation
+    validations:
+      required: true
+    attributes:
+      label: Motivation / 动机
+      description: |
+        Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
+        您提出建议的动机。如果该动机与另一个 GitHub 问题有关，请在此处提供对应的链接。
+  - type: textarea
+    id: contribution
+    validations:
+      required: true
+    attributes:
+      label: Your contribution / 您的贡献
+      description: |
+        Your PR link or any other link you can help with.
+        您的PR链接或者其他您能提供帮助的链接。

docs/finetrainers-src-codebase/.github/workflows/pr_tests.yml ADDED Viewed

	@@ -0,0 +1,30 @@

+name: Fast tests for PRs
+on:
+  pull_request:
+    branches:
+      - main
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
+jobs:
+  check_code_quality:
+    runs-on: ubuntu-22.04
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.8"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install ruff==0.9.10
+      - name: Check quality
+        run: make quality
+      - name: Check if failure
+        if: ${{ failure() }}
+        run: |
+          echo "Quality check failed. Please install ruff: `pip install ruff` and then run `make style && make quality` from the root of the repository." >> $GITHUB_STEP_SUMMARY

docs/finetrainers-src-codebase/.gitignore ADDED Viewed

	@@ -0,0 +1,179 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# JetBrains
+.idea
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# manually added
+wandb/
+*.txt
+dump*
+outputs*
+*.slurm
+.vscode/
+*dummy*
+*curated*
+validation_dataset/
+wan-framepack/
+!requirements.txt

docs/finetrainers-src-codebase/CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# How to contribute to Finetrainers
+Finetrainers is an early-stage library for training diffusion models. Everyone is welcome to contribute - models, algorithms, refactors, docs, etc. - but due to the early stage of the project, we recommend bigger contributions be discussed in an issue before submitting a PR. Eventually, we will have a better process for this!
+## How to contribute
+### Adding a new model
+If you would like to add a new model, please follow these steps:
+- Create a new file in the `finetrainers/models` directory with the model name (if it's new), or use the same directory if it's a variant of an existing model.
+- Implement the model specification in the file. For more details on what a model specification should look like, see the [ModelSpecification](TODO(aryan): add link) documentation.
+- Update the supported configs in `finetrainers/config.py` to include the new model and the training types supported.
+- Add a dummy model specification in the `tests/models` directory.
+- Make sure to test training with the following settings:
+  - Single GPU
+  - 2x GPU with `--dp_degree 2 --dp_shards 1`
+  - 2x GPU with `--dp_degree 1 --dp_shards 2`
+  For `SFTTrainer` additions, please make sure to train with atleast 1000 steps (atleast 2000 data points) to ensure the model training is working as expected.
+- Open a PR with your changes. Please make sure to share your wandb logs for the above training settings in the PR description. This will help us verify the training is working as expected.
+### Adding a new algorithm
+Currently, we are not accepting algorithm contributions. We will update this section once we are better ready 🤗
+### Refactors
+The library is in a very early stage. There are many instances of dead code, poorly written abstractions, and other issues. If you would like to refactor/clean-up a part of the codebase, please open an issue to discuss the changes before submitting a PR.
+### Dataset improvements
+Any changes to dataset/dataloader implementations can be submitted directly. The improvements and reasons for the changes should be conveyed appropriately for us to move quickly 🤗
+### Documentation
+Due to the early stage of the project, the documentation is not as comprehensive as we would like. Any improvements/refactors are welcome directly!
+## Asking for help
+If you have any questions, feel free to open an issue and we will be sure to help you out asap! Please make sure to describe your issues in either English (preferable) or Chinese. Any other language will make it hard for us to help you, so we will most likely close such issues without explanation/answer.

docs/finetrainers-src-codebase/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

docs/finetrainers-src-codebase/Makefile ADDED Viewed

	@@ -0,0 +1,11 @@

+.PHONY: quality style
+check_dirs := finetrainers tests examples train.py setup.py
+quality:
+	ruff check $(check_dirs) --exclude examples/_legacy
+	ruff format --check $(check_dirs) --exclude examples/_legacy
+style:
+	ruff check $(check_dirs) --fix --exclude examples/_legacy
+	ruff format $(check_dirs) --exclude examples/_legacy

docs/{finetrainers/documentation_README.md → finetrainers-src-codebase/README.md} RENAMED Viewed

@@ -30,10 +30,10 @@ Checkout to the latest stable release tag:
 ```bash
 git fetch --all --tags
-git checkout tags/v0.1.0
 ```
-Follow the instructions mentioned in the [README](https://github.com/a-r-r-o-w/finetrainers/tree/v0.1.0) for the latest stable release.
 #### Using the main branch
@@ -54,9 +54,10 @@ Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./exam
 ## Features
-- DDP, FSDP-2 & HSDP support for all models
 - LoRA and full-rank finetuning; Conditional Control training
 - Memory-efficient single-GPU training
 - Auto-detection of commonly used dataset formats
 - Combined image/video datasets, multiple chainable local/remote datasets, multi-resolution bucketing & more
 - Memory-efficient precomputation support with/without on-the-fly precomputation for large scale datasets
@@ -65,6 +66,8 @@ Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./exam
 ## News
 - 🔥 **2025-04-12**: Channel-concatenated control conditioning support added for CogView4 and Wan!
 - 🔥 **2025-04-08**: `torch.compile` support added!
 - 🔥 **2025-04-06**: Flux support added!

 ```bash
 git fetch --all --tags
+git checkout tags/v0.2.0
 ```
+Follow the instructions mentioned in the [README](https://github.com/a-r-r-o-w/finetrainers/tree/v0.2.0-release) for the latest stable release.
 #### Using the main branch
 ## Features
+- DDP, FSDP-2 & HSDP, CP support
 - LoRA and full-rank finetuning; Conditional Control training
 - Memory-efficient single-GPU training
+- Multiple attention backends supported - `flash`, `flex`, `sage`, `xformers` (see [attention](./docs/models/attention.md) docs)
 - Auto-detection of commonly used dataset formats
 - Combined image/video datasets, multiple chainable local/remote datasets, multi-resolution bucketing & more
 - Memory-efficient precomputation support with/without on-the-fly precomputation for large scale datasets
 ## News
+- 🔥 **2025-04-25**: Support for different attention providers added!
+- 🔥 **2025-04-21**: Wan I2V supported added!
 - 🔥 **2025-04-12**: Channel-concatenated control conditioning support added for CogView4 and Wan!
 - 🔥 **2025-04-08**: `torch.compile` support added!
 - 🔥 **2025-04-06**: Flux support added!

docs/finetrainers-src-codebase/accelerate_configs/compiled_1.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: 'NO'
+downcast_bf16: 'no'
+dynamo_config:
+  dynamo_backend: INDUCTOR
+  dynamo_mode: max-autotune
+  dynamo_use_dynamic: true
+  dynamo_use_fullgraph: false
+enable_cpu_affinity: false
+gpu_ids: '3'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

docs/finetrainers-src-codebase/accelerate_configs/deepspeed.yaml ADDED Viewed

	@@ -0,0 +1,23 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  gradient_accumulation_steps: 1
+  gradient_clipping: 1.0
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: false
+  zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 2
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

docs/finetrainers-src-codebase/accelerate_configs/uncompiled_1.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: 'NO'
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+gpu_ids: '3'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

docs/finetrainers-src-codebase/accelerate_configs/uncompiled_2.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+gpu_ids: 0,1
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 2
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

docs/finetrainers-src-codebase/accelerate_configs/uncompiled_4.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+gpu_ids: 0,1,2,3
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

docs/finetrainers-src-codebase/accelerate_configs/uncompiled_8.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

docs/finetrainers-src-codebase/assets/contribute.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# Contributions Welcome
+This project is in a very early stage, and we welcome contributions from everyone. We hope to receive contributions and support in the following areas:
+1. Support for more models. In addition to CogVideoX models, we also highly encourage contributions supporting other models.
+2. Support for richer datasets. In our example, we used a Disney video generation dataset, but we hope to support more datasets as the current one is too limited for deeper fine-tuning exploration.
+3. Anything in `TODO` we mention in our README.md
+## How to Submit
+We welcome you to create a new PR and describe the corresponding contribution. We will review it as soon as possible.
+## Naming Conventions
+- Please use English for naming, avoid using pinyin or other languages. All comments should be in English.
+- Strictly follow PEP8 conventions, and use underscores to separate words. Please avoid using names like a, b, c.

docs/finetrainers-src-codebase/assets/contribute_zh.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# 欢迎你们的贡献
+本项目属于非常初级的阶段，欢迎大家进行贡献。我们希望在以下方面得到贡献和支持：
+1. 支持更多的模型，除了 CogVideoX 模型之外的模型，我们也非常支持。
+2. 更丰富的数据集支持。在我们的例子中，我们使用了一个 Disney 视频生成数据集，但是我们希望能够支持更多的数据集，这个数据集太少了，并不足以进行更深的微调探索。
+3. 任何我们在README中`TODO`提到的内容。
+## 提交方式
+我们欢迎您直接创建一个新的PR，并说明对应的贡献，我们将第一时间查看。
+## 命名规范
+- 请使用英文命名，不要使用拼音或者其他语言命名。所有的注释均使用英文。
+- 请严格遵循 PEP8 规范，使用下划线分割单词。请勿使用 a,b,c 这样的命名。

docs/finetrainers-src-codebase/assets/dataset_zh.md ADDED Viewed

	@@ -0,0 +1,72 @@

+## 数据集格式
+### 提示词数据集要求
+创建 `prompt.txt` 文件，文件应包含逐行分隔的提示。请注意，提示必须是英文，并且建议使用 [提示润色脚本](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py) 进行润色。或者可以使用 [CogVideo-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption) 进行数据标注：
+```
+A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.
+A black and white animated sequence on a ship’s deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language...
+...
+```
+### 视频数据集要求
+该框架支持的分辨率和帧数需要满足以下条件：
+- **支持的分辨率（宽 * 高）**：
+    - 任意分辨率且必须能被32整除。例如，`720 * 480`, `1920 * 1020` 等分辨率。
+- **支持的帧数（Frames）**：
+    - 必须是 `4 * k` 或 `4 * k + 1`（例如：16, 32, 49, 81）
+所有的视频建议放在一个文件夹中。
+接着，创建 `videos.txt` 文件。 `videos.txt` 文件应包含逐行分隔的视频文件路径。请注意，路径必须相对于 `--data_root` 目录。格式如下：
+```
+videos/00000.mp4
+videos/00001.mp4
+...
+```
+对于有兴趣了解更多细节的开发者，您可以查看相关的 `BucketSampler` 代码。
+### 数据集结构
+您的数据集结构应如下所示，通过运行`tree`命令，你能看到:
+```
+dataset
+├── prompt.txt
+├── videos.txt
+├── videos
+    ├── videos/00000.mp4
+    ├── videos/00001.mp4
+    ├── ...
+```
+### 使用数据集
+当使用此格式时，`--caption_column` 应为 `prompt.txt`，`--video_column` 应为 `videos.txt`。如果您的数据存储在 CSV
+文件中，也可以指定 `--dataset_file` 为 CSV 文件的路径，`--caption_column` 和 `--video_column` 为 CSV
+文件中的实际列名。请参考 [test_dataset](../tests/test_dataset.py) 文件中的一些简单示例。
+例如，使用 [这个](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset) Disney 数据集进行微调。下载可通过🤗
+Hugging Face CLI 完成：
+```
+huggingface-cli download --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset --local-dir video-dataset-disney
+```
+该数据集已按照预期格式准备好，可直接使用。但是，直接使用视频数据集可能会导致较小 VRAM 的 GPU 出现
+OOM（内存不足），因为它需要加载 [VAE](https://huggingface.co/THUDM/CogVideoX-5b/tree/main/vae)
+（将视频编码为潜在空间）和大型 [T5-XXL](https://huggingface.co/google/t5-v1_1-xxl/)
+文本编码器。为了降低内存需求，您可以使用 `training/prepare_dataset.py` 脚本预先计算潜在变量和嵌入。
+填写或修改 `prepare_dataset.sh` 中的参数并执行它以获得预先计算的潜在变量和嵌入（请确保指定 `--save_latents_and_embeddings`
+以保存预计算的工件）。如果准备图像到视频的训练，请确保传递 `--save_image_latents`，它对沙子进行编码，将图像潜在值与视频一起保存。
+在训练期间使用这些工件时，确保指定 `--load_tensors` 标志，否则将直接使用视频并需要加载文本编码器和
+VAE。该脚本还支持 PyTorch DDP，以便可以使用多个 GPU 并行编码大型数据集（修改 `NUM_GPUS` 参数）。

docs/finetrainers-src-codebase/assets/sft_2b.png ADDED Viewed

docs/finetrainers-src-codebase/assets/sft_5b.png ADDED Viewed

docs/finetrainers-src-codebase/assets/tests/metadata.csv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ video,caption
2	+ "videos/hiker.mp4","""A hiker standing at the top of a mountain, triumphantly, high quality"""

docs/finetrainers-src-codebase/docs/_NOTES_FOR_FUTURE_ME.md ADDED Viewed

	@@ -0,0 +1,20 @@

+# Notes for Future Me
+>![NOTE]
+> This doc page is intended for developers and contributors.
+FSDP dump:
+- https://pytorch.org/docs/stable/notes/fsdp.html#fsdp-notes
+- https://github.com/pytorch/pytorch/issues/114299
+- Using FSDP1 requires that all FSDP flat parameters are of the same dtype. For LoRA training, we default lora parameters to fp32 and transformer parameters to dtype chosen by user. There seems to be no easy workaround than performing lora training in same dtype.
+- https://github.com/pytorch/pytorch/issues/100945
+- https://github.com/pytorch/torchtune/blob/9b3836028fd0b48f593ea43474b86880c49a4d74/recipes/lora_finetune_distributed.py
+- https://github.com/KellerJordan/modded-nanogpt/pull/68
+- https://github.com/pytorch/pytorch/pull/125394: monkey-patch method for FSDP pre/post-hooks to be triggered for method other than `forward`
+- https://github.com/pytorch/pytorch/pull/127786:
+- https://github.com/pytorch/pytorch/pull/130949:
+- Sanity saver: create optimizers after parallelizing/activation-checkpointing models
+DTensor:
+- https://github.com/pytorch/pytorch/issues/88838
+- https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/parallel/test_parallelize_api.py

docs/{finetrainers/documentation_args.md → finetrainers-src-codebase/docs/args.md} RENAMED Viewed

@@ -75,7 +75,10 @@ layerwise_upcasting_skip_modules_pattern (`List[str]`, defaults to `["patch_embe
     naively (as done in layerwise upcasting), can lead to poorer training and inference quality. We skip these layers
     by default, and recommend adding more layers to the default list based on the model architecture.
 compile_modules (`List[str]`, defaults to `[]`):
-    Modules that should be regionally compiled with `torch.compile`. Choose one or more from ['transformer'].
 DATASET ARGUMENTS
 -----------------
@@ -109,6 +112,9 @@ dataset_config (`str`):
 dataset_shuffle_buffer_size (`int`, defaults to `1`):
     The buffer size for shuffling the dataset. This is useful for shuffling the dataset before training. The default
     value of `1` means that the dataset will not be shuffled.
 precomputation_items (`int`, defaults to `512`):
     Number of data samples to precompute at once for memory-efficient training. The higher this value,
     the more disk memory will be used to save the precomputed samples (conditions and latents).
@@ -118,8 +124,16 @@ precomputation_dir (`str`, defaults to `None`):
 precomputation_once (`bool`, defaults to `False`):
     Precompute embeddings from all datasets at once before training. This is useful to save time during training
     with smaller datasets. If set to `False`, will save disk space by precomputing embeddings on-the-fly during
-    training when required. Make sure to set `precomputation_items` to a reasonable value in line with the size
-    of your dataset(s).
 DATALOADER_ARGUMENTS
 --------------------
@@ -248,8 +262,6 @@ logging_dir (`str`, defaults to `logs`):
     The directory where the logs will be stored.
 logging_steps (`int`, defaults to `1`):
     Training logs will be tracked every `logging_steps` steps.
-allow_tf32 (`bool`, defaults to `False`):
-    Whether or not to allow the use of TF32 matmul on compatible hardware.
 nccl_timeout (`int`, defaults to `1800`):
     Timeout for the NCCL communication.
 report_to (`str`, defaults to `wandb`):
@@ -260,6 +272,33 @@ verbose (`int`, defaults to `1`):
         - 1: Diffusers/Transformers info logging on local main process only
         - 2: Diffusers/Transformers debug logging on local main process only
         - 3: Diffusers/Transformers debug logging on all processes
 ```
 ## SFT training

     naively (as done in layerwise upcasting), can lead to poorer training and inference quality. We skip these layers
     by default, and recommend adding more layers to the default list based on the model architecture.
 compile_modules (`List[str]`, defaults to `[]`):
+    Modules that should be regionally compiled with `torch.compile`.
+compile_scopes (`str`, defaults to `None`):
+    The scope of compilation for each `--compile_modules`. Choose between ['regional', 'full']. Must have the same length as
+    `--compile_modules`. If `None`, will default to `regional` for all modules.
 DATASET ARGUMENTS
 -----------------
 dataset_shuffle_buffer_size (`int`, defaults to `1`):
     The buffer size for shuffling the dataset. This is useful for shuffling the dataset before training. The default
     value of `1` means that the dataset will not be shuffled.
+enable_precomputation (`bool`, defaults to `False`):
+    Whether or not to precompute the embeddings for the dataset. This is useful for faster training. If set to `True`,
+    the embeddings will be precomputed and saved to disk and loaded as required.
 precomputation_items (`int`, defaults to `512`):
     Number of data samples to precompute at once for memory-efficient training. The higher this value,
     the more disk memory will be used to save the precomputed samples (conditions and latents).
 precomputation_once (`bool`, defaults to `False`):
     Precompute embeddings from all datasets at once before training. This is useful to save time during training
     with smaller datasets. If set to `False`, will save disk space by precomputing embeddings on-the-fly during
+    training when required (that is, computing embeddings of more data samples once `precomputation_items` of them
+    have been exhausted across all distributed ranks). Make sure to set `precomputation_items` to a reasonable value
+    in line with the size of your dataset(s).
+precomputation_reuse (`bool`, defaults to `False`):
+    Reuse precomputed embeddings from previous training runs. This is useful to save time during training
+    with medium/large datasets. By default, old precomputed embeddings that exist in the specified precomputation
+    directory, or default precomputation dir `{output_dir}/precomputed` will be deleted if this is not set to `True`.
+    This flag is ignored if `enable_precomputation` is `False`. The topology of the distributed training run must be
+    the same as the one used to precompute the embeddings for this to work correctly (this limitation will be
+    addressed in the future).
 DATALOADER_ARGUMENTS
 --------------------
     The directory where the logs will be stored.
 logging_steps (`int`, defaults to `1`):
     Training logs will be tracked every `logging_steps` steps.
 nccl_timeout (`int`, defaults to `1800`):
     Timeout for the NCCL communication.
 report_to (`str`, defaults to `wandb`):
         - 1: Diffusers/Transformers info logging on local main process only
         - 2: Diffusers/Transformers debug logging on local main process only
         - 3: Diffusers/Transformers debug logging on all processes
+TORCH CONFIG ARGUMENTS
+----------------------
+allow_tf32 (`bool`, defaults to `False`):
+    Whether or not to allow the use of TF32 matmul on compatible hardware.
+float32_matmul_precision (`str`, defaults to `highest`):
+    The precision to use for float32 matmul. Choose between ['highest', 'high', 'medium'].
+```
+### Attention Provider
+These arguments are relevant to setting attention provider for different modeling components. The attention providers may be set differently for training and validation/inference.
+```
+attn_provider_training (`str`, defaults to "native"):
+    The attention provider to use for training. Choose between
+    [
+        'flash', 'flash_varlen', 'flex', 'native', '_native_cudnn', '_native_efficient', '_native_flash',
+        '_native_math'
+    ]
+attn_provider_inference (`str`, defaults to "native"):
+    The attention provider to use for validation. Choose between
+    [
+        'flash', 'flash_varlen', 'flex', 'native', '_native_cudnn', '_native_efficient', '_native_flash',
+        '_native_math', 'sage', 'sage_varlen', '_sage_qk_int8_pv_fp8_cuda', '_sage_qk_int8_pv_fp8_cuda_sm90',
+        '_sage_qk_int8_pv_fp16_cuda', '_sage_qk_int8_pv_fp16_triton', 'xformers'
+    ]
 ```
 ## SFT training

docs/{finetrainers/documentation_dataset_README.md → finetrainers-src-codebase/docs/dataset/README.md} RENAMED Viewed

@@ -57,6 +57,8 @@ dataset
 #### CSV/JSON/JSONL format
 > [!NOTE]
 > Relevant classes to look for implementation:
 > - ImageFolderDataset
@@ -75,6 +77,8 @@ Any dataset loadable via the [🤗 HF datasets] directly should work (not widely
 Any dataset loadable via the [🤗 HF datasets] directly should work (not widely tested at the moment). We support the [`webdataset`](https://huggingface.co/docs/datasets/v3.3.2/en/image_dataset#webdataset) and [`webdataset`](https://huggingface.co/docs/datasets/v3.3.2/en/video_dataset#webdataset) formats.
 ## Validation Dataset Format
 Arguments related to validation are:
@@ -148,18 +152,21 @@ For memory efficient training, it is important to precompute conditional and lat
 The following is a high-level overview of how datasets are loaded and preprocessed:
-- Initially, the dataset is lazy loaded using the HF `datasets` library. Every dataset is loaded in streaming and infinite mode. This means that the dataset will be loaded indefinitely until some end conditions (e.g. user-configured training steps is completed). Users can chain together multiple datasets too! For example, if you only have high resolution data available, but want to perform multi-resolution training at certain lower resolutions too, you would have to perform the resizing manually and chain the data together. Finetrainers makes this easier by allowing you to specify multiple different, or same, datasets with different resolutions.
 - The dataset is split across data replicas (GPUs groups that perform data parallelism). Each data replica will have a non-overlapping subset of the overall dataset.
-- If multiple datasets have been provided, they will be chained together. Shuffling can also be done to ensure better dataset regularization. This is done by shuffling the iterable datasets in a buffer of user-configured `--dataset_shuffle_buffer_size`. For small datasets, it is recommended to not shuffle and use the default value of `1`. For larger datasets, there is a significant overhead the higher this value is set to, so it is recommended to keep it low (< 1000) [this is because we store the data in memory in a not-so-clever way yet].
 - The dataset is preprocessed to the user-configured resolution buckets. This is done by resizing the images/videos to the specified resolution buckets. This is also necessary for collation when using batch_size > 1.
-- The dataset is precomputed for embeddings and stored to disk. This is done in batches of user-configured `--precompute_batch_size`. This is done to avoid exhausting disk space. The smaller this value, the more number of times conditioning models will be loaded upon precomputation exhaustion. The larger this value, the more disk space will be used.
 - When data points are required for training, they are loaded from disk on the main process and dispatched to data replicas. [TODO: this needs some improvements to speedup training eventually]
 ## Understanding how datasets are precomputed
-There are 3 arguments related to precomputation:
 - `--precomputation_items`: The number of data points to precompute and store to disk at a time. This is useful for performing memory-efficient training without exhausting disk space by precomputing embeddings of the entire dataset(s) at once. We default to `512` data points, but configure this to a lower value for smaller datasets. As training progresses, the precomputed data will be read from disk and dispatched to data replicas. Once all precomputed data has been used, the next batch of data points will be precomputed and stored to disk in a rolling fashion.
 - `--precomputation_dir`: The directory where precomputed data will be stored. This is useful for resuming training from a checkpoint, as the precomputed data will be loaded from this directory. If this directory is not provided, the precomputed data will be stored in the `--output_dir/precomputed`.
 - `--precomputation_once`: If you're working with small datasets and want to precompute all embeddings at once, set this flag. This will allow you to train without having to compute embeddings every time the precomputed data is exhausted. Currently, `webdataset` format loading does not support this feature, and it is also disabled for `> 1024` data points due to hard coded logic (can be removed manually by users for now).
 Batching is not yet supported for precomputation. This will be added in the future.

 #### CSV/JSON/JSONL format
+- Supported names are: `metadata.json`, `metadata.jsonl`, `metadata.csv`
 > [!NOTE]
 > Relevant classes to look for implementation:
 > - ImageFolderDataset
 Any dataset loadable via the [🤗 HF datasets] directly should work (not widely tested at the moment). We support the [`webdataset`](https://huggingface.co/docs/datasets/v3.3.2/en/image_dataset#webdataset) and [`webdataset`](https://huggingface.co/docs/datasets/v3.3.2/en/video_dataset#webdataset) formats.
 ## Validation Dataset Format
 Arguments related to validation are:
 The following is a high-level overview of how datasets are loaded and preprocessed:
+- Initially, the dataset is lazy loaded using the HF `datasets` library. Every dataset is loaded in streaming and infinite mode. This means that the dataset will be loaded indefinitely until some end conditions (e.g. user-configured training steps is completed). Multiple datasets can be chained together. For example, if you only have high resolution data available, but want to perform multi-resolution training at certain lower resolutions too, you would have to perform the resizing manually and create a new copy of the dataset containing multiresolution data. Finetrainers makes this easier by allowing you to specify multiple different, or same, datasets with different resolutions.
+- When chaining multiple different datasets, make sure they are roughly the same size to avoid having smaller datasets repeatedly being used in the training loop. This is because the datasets are loaded in a round-robin fashion. For example, if you have 2 datasets of size 1000 and 2000, the first dataset will be fully seen twice before the second dataset is fully seen once by the model.
 - The dataset is split across data replicas (GPUs groups that perform data parallelism). Each data replica will have a non-overlapping subset of the overall dataset.
+- If multiple datasets have been provided, they will be chained together. Shuffling can also be done to ensure better dataset regularization. This is done by shuffling the iterable datasets in a buffer of user-configured `--dataset_shuffle_buffer_size`. For small datasets, it is recommended to not shuffle and use the default value of `1`. For larger datasets, there is a significant overhead the higher this value is set to, so it is recommended to keep it low (< 1000) [this is because we store the data in memory in a not-so-clever way].
 - The dataset is preprocessed to the user-configured resolution buckets. This is done by resizing the images/videos to the specified resolution buckets. This is also necessary for collation when using batch_size > 1.
+- The dataset is precomputed for embeddings and stored to disk. This is done in batches of user-configured `--precomputation_items` to avoid exhausting disk space. The smaller this value, the more number of times conditioning models will be loaded upon precomputation exhaustion. The larger this value, the more disk space will be used.
 - When data points are required for training, they are loaded from disk on the main process and dispatched to data replicas. [TODO: this needs some improvements to speedup training eventually]
 ## Understanding how datasets are precomputed
+There are 4 arguments related to precomputation:
+- `--enable_precomputation`: If set, precomputation will be enabled. The parameters that follow are only relevant if this flag is set. If this flag is not set, all models will be loaded in memory and training will take place without first precomputing embeddings.
 - `--precomputation_items`: The number of data points to precompute and store to disk at a time. This is useful for performing memory-efficient training without exhausting disk space by precomputing embeddings of the entire dataset(s) at once. We default to `512` data points, but configure this to a lower value for smaller datasets. As training progresses, the precomputed data will be read from disk and dispatched to data replicas. Once all precomputed data has been used, the next batch of data points will be precomputed and stored to disk in a rolling fashion.
 - `--precomputation_dir`: The directory where precomputed data will be stored. This is useful for resuming training from a checkpoint, as the precomputed data will be loaded from this directory. If this directory is not provided, the precomputed data will be stored in the `--output_dir/precomputed`.
 - `--precomputation_once`: If you're working with small datasets and want to precompute all embeddings at once, set this flag. This will allow you to train without having to compute embeddings every time the precomputed data is exhausted. Currently, `webdataset` format loading does not support this feature, and it is also disabled for `> 1024` data points due to hard coded logic (can be removed manually by users for now).
+- `--precomputation_reuse`: If you're working with medium/large-size datasets and want to precompute all embeddings and re-use them across different training runs, make sure to set this flag.
 Batching is not yet supported for precomputation. This will be added in the future.

docs/finetrainers-src-codebase/docs/dataset/_DEBUG.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# Distributed dataset debugging
+>![NOTE]
+> This doc page is intended for developers and contributors.
+If the number of samples in the dataset is lower than the number of processes per node, the training will hand indefinitely. I haven't been able to pin down on how this could be fixed due to limited time, but basically:
+- Start training with `--dp_degree 2` and `torchrun --standalone --nnodes=1 --nproc_per_node=2`. This launches training with DDP across 2 ranks.
+- The dataset has `< dp_degree` samples
+- When `datasets.distributed.split_dataset_by_node` is called, the data is distributed correctly to one rank, but the other rank hangs indefinitely. Due to this edge case, fast tests seem to fail.
+- For now, we should just use `>= dp_degree` samples in the test dataset. However, should be fixed in the future.
+Minimal reproducer:
+```python
+import torch
+import torch.distributed as dist
+from datasets import Dataset
+from datasets.distributed import split_dataset_by_node
+from torch.utils.data import DataLoader
+ds = Dataset.from_dict({"x": [1]}).to_iterable_dataset()
+dist.init_process_group()
+rank, world_size = dist.get_rank(), dist.get_world_size()
+ds = split_dataset_by_node(ds, rank=rank,world_size=world_size)
+dl = DataLoader(ds)
+exhausted = torch.zeros(world_size, dtype=torch.bool)
+def loop():
+    while True:
+        print(rank, "hello", flush=True)
+        yield from dl
+        yield "end"
+for x in loop():
+    if x == "end":
+        exhausted[rank] = True
+        continue
+    dist.all_reduce(exhausted)
+    if torch.all(exhausted):
+        break
+    print(f"{rank} {x}", flush=True)
+```

docs/{finetrainers/documentation_environment.md → finetrainers-src-codebase/docs/environment.md} RENAMED Viewed

@@ -26,3 +26,14 @@ NVIDIA A100-SXM4-80GB, 81920 MiB
 ```
 Other versions of dependencies may or may not work as expected. We would like to make finetrainers work on a wider range of environments, but due to the complexity of testing at the early stages of development, we are unable to do so. The long term goals include compatibility with most pytorch versions on CUDA, MPS, ROCm and XLA devices.

 ```
 Other versions of dependencies may or may not work as expected. We would like to make finetrainers work on a wider range of environments, but due to the complexity of testing at the early stages of development, we are unable to do so. The long term goals include compatibility with most pytorch versions on CUDA, MPS, ROCm and XLA devices.
+> [!IMPORTANT]
+>
+> For context parallelism, PyTorch 2.6+ is required.
+## Configuration
+The following environment variables may be configured to change the default behaviour of finetrainers:
+`FINETRAINERS_ATTN_PROVIDER`: Sets the default attention provider for training/validation. Defaults to `native`, as in native PyTorch SDPA. See [attention docs](./models/attention.md) for more information.
+`FINETRAINERS_ATTN_CHECKS`: Whether or not to run basic sanity checks when using different attention providers. This is useful for debugging but you should leave it disabled for longer training runs. Defaults to `"0"`. Can be set to a truthy env value.

docs/{finetrainers/documentation_models_README.md → finetrainers-src-codebase/docs/models/README.md} RENAMED Viewed

File without changes

docs/finetrainers-src-codebase/docs/models/attention.md ADDED Viewed

	@@ -0,0 +1,263 @@

+# Attention backends
+Finetrainers supports multiple attention backends to support different hardware and tradeoff between speed and memory usage. The following attention implementations are supported:
+- Training:
+  - If model uses attention masks: `flash_varlen`, `flex`, `native`
+  - If model does not use attention masks: `flash`, `flex`, `native`, `xformers`
+- Inference:
+  - If model uses attention masks: `flash_varlen`, `flex`, `native`, `sage_varlen`
+  - If model does not use attention masks: `flash`, `flash_varlen`, `flex`, `native`, `sage`, `sage_varlen`, `xformers`
+Additionally, some specialized methods are available for debugging-specific purposes: `_native_cudnn`, `_native_efficient`, `_native_flash`, `_native_math`, `_sage_qk_int8_pv_fp8_cuda`, `_sage_qk_int8_pv_fp8_cuda_sm90`, `_sage_qk_int8_pv_fp16_cuda`, `_sage_qk_int8_pv_fp16_triton`. With time, more attention-specific optimizations and custom implementations will be supported. Contributions are welcome!
+Unfortunately, due to limited time for testing, only specific versions of packages that provide these implementations are supported. Other versions may work. The supported versions will be gradually made lower for more flexibility, but for now, please use the following versions:
+- `flash-attn>=2.6.3`
+- `sageattention>=2.1.1`
+- `xformers>=0.0.29.post3`
+This guide will help you quickly install flash-attn, sageattention, and xformers to make your models run faster and use less memory for training/inference. We'll cover installation on Linux (Ubuntu 22.04) and Windows (using WSL).
+Before you start, make sure to use a clean python virtual environment to not mess up your system seriously, or to avoid conflicting dependencies leading to failed installations which might leave the environment in hard-to-recover state.
+### Flash attention
+Providers covered: `flash`, `flash_varlen`
+The installation steps have only been tested with Ubuntu 22.04; CUDA version higher than 12.2 and 12.6.
+- Check your CUDA version: look at the output of `nvidia-smi` or run `nvcc --version`.
+- You might need the following packages: `pip install packaging ninja`
+- Linux: Run: `pip install flash-attn --no-build-isolation`. Verify the version with `pip show flash-attn`
+- WSL: Same instruction as above should work. Native Windows might require building from source - check community guiders and follow the instruction [here](https://github.com/Dao-AILab/flash-attention).
+### Sage attention
+Providers covered: `sage`, `sage_varlen`, `_sage_qk_int8_pv_fp8_cuda`, `_sage_qk_int8_pv_fp8_cuda_sm90`, `_sage_qk_int8_pv_fp16_cuda`, `_sage_qk_int8_pv_fp16_triton`
+FP8 implementations will require CUDA compute capability of 90 or higher (H100, RTX 5090, etc.). Some may work on compute capability 89 as well (RTX 4090, for example). For FP16 implementations, compute capability of atleast 80 is required (A100, RTX 3090, etc.). For other GPUs, FP16 implementations may or may not work (this is untested by me).
+- Check your compute capability with the following command:
+  ```bash
+  python -c "import torch; print(torch.cuda.get_device_capability())"
+  ```
+- Check your CUDA version: look at the output of `nvidia-smi` or run `nvcc --version`.
+- You might need the following packages: `pip install triton`. For Windows, check out the [triton-windows](https://github.com/woct0rdho/triton-windows) project.
+- Linux/WSL: Run: `pip install git+https://github.com/thu-ml/SageAttention`. Verify the version with `pip show sageattention`.
+- Make sure to look at the official installation guide in [SageAttention](https://github.com/thu-ml/SageAttention) too!
+### xformers
+Providers covered: `xformers`
+- Check your CUDA version: look at the output of `nvidia-smi` or run `nvcc --version`.
+- Linux/WSL: Run: `pip install -U xformers --index-url https://download.pytorch.org/whl/cu126` (assuming CUDA 12.6). Verify the version with `pip show xformers`.
+- Make sure to look at the official installation guide in [xformers](https://github.com/facebookresearch/xformers) too!
+----------
+All other providers are either native PyTorch implementations or require a specific PyTorch version (for example, Flex Attention requires torch version of atleast 2.5.0).
+----------
+## Usage
+There are two ways to use the attention dispatcher mechanism:
+- Replace `scaled_dot_product_attention` globally:
+  ```python
+  import torch.nn.functional as F
+  from finetrainers.models.attention_dispatch import attention_dispatch
+  F.scaled_dot_product_attention = attention_dispatch
+  ```
+- Replace all occurrences of `scaled_dot_product_attention` in your code with `attention_dispatch`.
+```python
+# Use dispatcher directly
+from finetrainers.models.attention_dispatch import attention_provider, AttentionProvider
+with attention_provider(AttentionProvider.FLASH_VARLEN):
+    model(...)
+# or,
+with attention_provider("sage_varlen"):
+    model(...)
+```
+## Context Parallel
+References and reading material:
+- https://docs.pytorch.org/tutorials/prototype/context_parallel.html
+- https://insujang.github.io/2024-09-20/introducing-context-parallelism/
+- https://www.youtube.com/watch?v=ws7angQYIxI
+- https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/
+- https://arxiv.org/abs/2309.14509
+There are three steps to enabling context parallelism with any model:
+- Defining the context parallel plan: This is a dictionary that mentions what tensors to split and gather across CP region at different layers in the model
+- Applying the CP plan with `apply_context_parallel` function: This registers the necessary hooks to split and gather tensors at the right places in the model without having to manually modify the model code.
+- Running model under the `attention_provider` context manager
+For a quick example, refer to the [inference example](#inference) below.
+The CP plan is a dictionary that maps the name of the module to a list of `CPInput` or `CPOutput` objects. The keys in the dictionary are the names of the internal modules in the model, and the values are dictionaries that map a parameter identifier (either as an argument index or keyword argument as used in the forward method) to a `CPInput` or `CPOutput` object. The `CPInput` object specifies the input tensor to be split, and the `CPOutput` object specifies the output tensor to be gathered.
+```python
+class ParamId:
+    name: Optional[str] = None
+    index: Optional[int] = None
+class CPInput:
+    split_dim: int
+    expected_dims: Optional[int] = None
+    split_output: bool = False
+class CPOutput:
+    gather_dim: int
+    expected_dims: Optional[int] = None
+```
+- The `split_dim` and `gather_dim` parameters specify the dimension along which to split or gather the tensor. When using CP with native scaled dot product attention from pytorch, the tensor shape is `[B, N, S, D]`, so the `split_dim` and `gather_dim` parameters should be set to `2` as it is the sequence dimension.
+- The `expected_dims` parameter is an optional parameter that is used for sanity checking if the tensor contains the expected number of dimensions.
+- By default, `CPInput`'s are split in a pre-forward hook and `CPOutput`'s are gathered in a post-forward hook. If you want to split the output of a module, you can set the `split_output` parameter to `True`. This will split the output tensor in the post-forward hook instead of the pre-forward hook.
+- Attention providers supported for training with CP: `flash`, `_native_cudnn`, `_native_efficient`, `_native_flash`
+- Attention providers supported for inference with CP: `flash`, `_native_cudnn`, `_native_efficient`, `_native_flash`
+### Training
+To enable training with context parallelism, you need to make sure a suitable CP plan is registered for the model you are using and launch training with `--cp_degree 2`. For models supported in finetrainers, this is internally done in the [transformer metadata](https://github.com/a-r-r-o-w/finetrainers/tree/main/finetrainers/models/_metadata/transformer.py) file. For custom models, make sure to pass the `plan` argument to the `apply_context_parallel` function.
+Currently supported models include: CogVideoX, CogView4, Flux, Wan 2.1. Support for more models and attention providers is in progress.
+### Inference
+The following example shows how to run context parallel inference. For more examples and ready-to-use inference scripts, check out the [examples/inference](https://github.com/a-r-r-o-w/finetrainers/tree/main/examples/inference/) folder.
+<details>
+<summary> Example </summary>
+```python
+import torch
+import torch.distributed as dist
+from diffusers import AutoencoderKLWan, WanPipeline
+from diffusers.utils import export_to_video
+from finetrainers._metadata import ParamId, CPInput, CPOutput
+from finetrainers.parallel.ptd import apply_context_parallel
+from finetrainers.models.attention_dispatch import attention_provider, attention_dispatch
+torch.nn.functional.scaled_dot_product_attention = attention_dispatch
+def apply_compile(model: torch.nn.Module, compile_scope: str) -> torch.nn.Module:
+    r"""Apply torch.compile to a model or its submodules if not already compiled."""
+    if getattr(model, "_torch_compiled", False):
+        return model  # Already compiled
+    if compile_scope == "full":
+        model = torch.compile(model)
+        setattr(model, "_torch_compiled", True)
+    elif compile_scope == "regional":
+        if isinstance(model, torch.nn.ModuleList):
+            for name, module in model.named_children():
+                if not getattr(module, "_torch_compiled", False):
+                    compiled_module = torch.compile(module, mode="max-autotune-no-cudagraphs", fullgraph=False, dynamic=False)
+                    setattr(compiled_module, "_torch_compiled", True)
+                    model.register_module(name, compiled_module)
+        else:
+            for name, module in model.named_children():
+                apply_compile(module, compile_scope)
+    else:
+        raise ValueError(f"Unknown compile mode: {compile_scope}. Use 'full' or 'regional'.")
+    return model
+torch.manual_seed(0)
+dist.init_process_group("nccl")
+rank, world_size = dist.get_rank(), dist.get_world_size()
+torch.cuda.set_device(rank)
+cp_mesh = dist.device_mesh.init_device_mesh("cuda", [world_size], mesh_dim_names=["cp"])
+cp_plan = {
+    "rope": {
+        ParamId(index=0): CPInput(2, 4, split_output=True),
+    },
+    "blocks.*": {
+        ParamId("encoder_hidden_states", 1): CPInput(1, 3),
+    },
+    "blocks.0": {
+        ParamId("hidden_states", 0): CPInput(1, 3),
+    },
+    "proj_out": [CPOutput(1, 3)],
+}
+try:
+    model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
+    vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
+    pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
+    pipe.to("cuda")
+    apply_context_parallel(pipe.transformer, mesh=cp_mesh, plan=cp_plan)
+    apply_compile(pipe.transformer, compile_scope="regional")
+    prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
+    negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
+    with torch.no_grad():
+        prompt_embeds, negative_prompt_embeds = pipe.encode_prompt(
+            prompt=prompt, negative_prompt=negative_prompt, device="cuda",
+        )
+    attention_backend = "_native_flash"
+    generator = torch.Generator().manual_seed(0)
+    # Warmup for compilation
+    with attention_provider(attention_backend, mesh=cp_mesh, convert_to_fp32=True, rotate_method="alltoall"):
+        latents = pipe(
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            height=480,
+            width=832,
+            num_frames=81,
+            num_inference_steps=2,
+            guidance_scale=5.0,
+            output_type="latent",
+            generator=generator,
+        ).frames[0]
+    # Inference
+    with attention_provider(attention_backend, mesh=cp_mesh, convert_to_fp32=True, rotate_method="allgather"):
+        latents = pipe(
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            height=480,
+            width=832,
+            num_frames=81,
+            guidance_scale=5.0,
+            num_inference_steps=30,
+            output_type="latent",
+            generator=generator,
+        ).frames[0]
+    with torch.no_grad():
+        latents = latents.to(pipe.vae.dtype)
+        latents_mean = (
+            torch.tensor(pipe.vae.config.latents_mean)
+            .view(1, pipe.vae.config.z_dim, 1, 1, 1)
+            .to(latents.device, latents.dtype)
+        )
+        latents_std = 1.0 / torch.tensor(pipe.vae.config.latents_std).view(1, pipe.vae.config.z_dim, 1, 1, 1).to(
+            latents.device, latents.dtype
+        )
+        latents = latents / latents_std + latents_mean
+        video = pipe.vae.decode(latents, return_dict=False)[0]
+        video = pipe.video_processor.postprocess_video(video, output_type="pil")[0]
+    if rank == 0:
+        export_to_video(video, "output.mp4", fps=16)
+finally:
+    dist.destroy_process_group()
+```
+</details>

docs/{finetrainers/documentation_models_cogvideox.md → finetrainers-src-codebase/docs/models/cogvideox.md} RENAMED Viewed

@@ -20,9 +20,9 @@ On Windows, you will have to modify the script to a compatible format to run it.
 CogVideoX has multiple checkpoints as one can note [here](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce). The following checkpoints were tested with `finetrainers` and are known to be working:
-* [THUDM/CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b)
-* [THUDM/CogVideoX-5B](https://huggingface.co/THUDM/CogVideoX-5B)
-* [THUDM/CogVideoX1.5-5B](https://huggingface.co/THUDM/CogVideoX1.5-5B)
 ## Inference
@@ -45,6 +45,6 @@ export_to_video(video, "output.mp4")
 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
-* [CogVideoX in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox)
-* [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
-* [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

 CogVideoX has multiple checkpoints as one can note [here](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce). The following checkpoints were tested with `finetrainers` and are known to be working:
+- [THUDM/CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b)
+- [THUDM/CogVideoX-5B](https://huggingface.co/THUDM/CogVideoX-5B)
+- [THUDM/CogVideoX1.5-5B](https://huggingface.co/THUDM/CogVideoX1.5-5B)
 ## Inference
 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
+- [CogVideoX in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox)
+- [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/finetrainers-src-codebase/docs/models/cogview4.md ADDED Viewed

	@@ -0,0 +1,94 @@

+# CogView4
+## Training
+For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+Examples available:
+- [Raider White Tarot cards style](../../examples/training/sft/cogview4/raider_white_tarot/)
+- [Omni Edit Control LoRA](../../examples/training/control/cogview4/omni_edit/)
+- [Canny Control LoRA](../../examples/training/control/cogview4/canny/)
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
+```bash
+chmod +x ./examples/training/sft/cogview4/raider_white_tarot/train.sh
+./examples/training/sft/cogview4/raider_white_tarot/train.sh
+```
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
+## Supported checkpoints
+The following checkpoints were tested with `finetrainers` and are known to be working:
+- [THUDM/CogView4-6B](https://huggingface.co/THUDM/CogView4-6B)
+## Inference
+Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
+```diff
+import torch
+from diffusers import CogView4Pipeline
+from diffusers.utils import export_to_video
+pipe = CogView4Pipeline.from_pretrained(
+    "THUDM/CogView4-6B", torch_dtype=torch.bfloat16
+).to("cuda")
++ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="cogview4-lora")
++ pipe.set_adapters(["cogview4-lora"], [0.9])
+video = pipe("<my-awesome-prompt>").frames[0]
+export_to_video(video, "output.mp4")
+```
+To use trained Control LoRAs, the following can be used for inference (ideally, you should raise a support request in Diffusers):
+<details>
+<summary> Control Lora inference </summary>
+```python
+import torch
+from diffusers import CogView4Pipeline
+from diffusers.utils import load_image
+from finetrainers.models.utils import _expand_linear_with_zeroed_weights
+from finetrainers.patches import load_lora_weights
+from finetrainers.patches.dependencies.diffusers.control import control_channel_concat
+dtype = torch.bfloat16
+device = torch.device("cuda")
+generator = torch.Generator().manual_seed(0)
+pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=dtype)
+in_channels = pipe.transformer.config.in_channels
+patch_channels = pipe.transformer.patch_embed.proj.in_features
+pipe.transformer.patch_embed.proj = _expand_linear_with_zeroed_weights(pipe.transformer.patch_embed.proj, new_in_features=2 * patch_channels)
+load_lora_weights(pipe, "/raid/aryan/cogview4-control-lora", "cogview4-lora")
+pipe.to(device)
+prompt = "Make the image look like it's from an ancient Egyptian mural."
+control_image = load_image("examples/training/control/cogview4/omni_edit/validation_dataset/0.png")
+height, width = 1024, 1024
+with torch.no_grad():
+    latents = pipe.prepare_latents(1, in_channels, height, width, dtype, device, generator)
+    control_image = pipe.image_processor.preprocess(control_image, height=height, width=width)
+    control_image = control_image.to(device=device, dtype=dtype)
+    control_latents = pipe.vae.encode(control_image).latent_dist.sample(generator=generator)
+    control_latents = (control_latents - pipe.vae.config.shift_factor) * pipe.vae.config.scaling_factor
+with control_channel_concat(pipe.transformer, ["hidden_states"], [control_latents], dims=[1]):
+    image = pipe(prompt, latents=latents, num_inference_steps=30, generator=generator).images[0]
+image.save("output.png")
+```
+</details>
+You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
+- [CogView4 in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogview4)
+- [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/finetrainers-src-codebase/docs/models/flux.md ADDED Viewed

	@@ -0,0 +1,53 @@

+# Flux
+## Training
+For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+Examples available:
+- [Raider White Tarot cards style](../../examples/training/sft/flux_dev/raider_white_tarot/)
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
+```bash
+chmod +x ./examples/training/sft/flux_dev/raider_white_tarot/train.sh
+./examples/training/sft/flux_dev/raider_white_tarot/train.sh
+```
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
+> [!NOTE]
+> Currently, only FLUX.1-dev is supported. It is a guidance-distilled model which directly predicts the outputs of its teacher model when the teacher is run with CFG. To match the output distribution of the distilled model with that of the teacher model, a guidance scale of 1.0 is hardcoded into the codebase. However, other values may work too but it is experimental.
+> FLUX.1-schnell is not supported for training yet. It is a timestep-distilled model. Matching its output distribution for training is significantly more difficult.
+## Supported checkpoints
+The following checkpoints were tested with `finetrainers` and are known to be working:
+- [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
+- [black-forest-labs/FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
+## Inference
+Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
+```diff
+import torch
+from diffusers import FluxPipeline
+pipe = FluxPipeline.from_pretrained(
+    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
+).to("cuda")
++ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="flux-lora")
++ pipe.set_adapters(["flux-lora"], [0.9])
+# Make sure to set guidance_scale to 0.0 when inferencing with FLUX.1-schnell or derivative models
+image = pipe("<my-awesome-prompt>").images[0]
+image.save("output.png")
+```
+You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
+- [Flux in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux)
+- [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/{finetrainers/documentation_models_hunyuan_video.md → finetrainers-src-codebase/docs/models/hunyuan_video.md} RENAMED Viewed

@@ -50,6 +50,6 @@ export_to_video(output, "output.mp4", fps=15)
 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
-* [Hunyuan-Video in Diffusers](https://huggingface.co/docs/diffusers/main/api/pipelines/hunyuan_video)
-* [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
-* [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
+- [Hunyuan-Video in Diffusers](https://huggingface.co/docs/diffusers/main/api/pipelines/hunyuan_video)
+- [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/{finetrainers/documentation_models_ltx_video.md → finetrainers-src-codebase/docs/models/ltx_video.md} RENAMED Viewed

@@ -37,6 +37,6 @@ export_to_video(video, "output.mp4", fps=8)
 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
-* [LTX-Video in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video)
-* [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
-* [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
+- [LTX-Video in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video)
+- [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/{finetrainers/documentation_models_optimization.md → finetrainers-src-codebase/docs/models/optimization.md} RENAMED Viewed

File without changes

docs/{finetrainers/documentation_models_wan.md → finetrainers-src-codebase/docs/models/wan.md} RENAMED Viewed

@@ -18,6 +18,16 @@ chmod +x ./examples/training/sft/wan/crush_smol_lora/train.sh
 On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
 ## Inference
 Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
@@ -102,4 +112,4 @@ You can refer to the following guides to know more about the model pipeline and
 - [Wan in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan)
 - [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
-- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

 On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
+## Supported checkpoints
+Wan has multiple checkpoints as one can find [here](https://huggingface.co/Wan-AI). The following checkpoints were tested with `finetrainers` and are known to be working:
+- [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers)
+- [Wan-AI/Wan2.1-T2V-14B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers)
+- [Wan-AI/Wan2.1-I2V-14B-480P-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P-Diffusers)
+- [Wan-AI/Wan2.1-I2V-14B-720P-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P-Diffusers)
+- [Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers)
 ## Inference
 Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
 - [Wan in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan)
 - [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/{finetrainers/documentation_optimizers.md → finetrainers-src-codebase/docs/optimizer.md} RENAMED Viewed

File without changes

docs/{finetrainers/documentation_parallel_processing_README.md → finetrainers-src-codebase/docs/parallel/README.md} RENAMED Viewed

@@ -14,11 +14,12 @@ As an experiment for comparing performance of different training backends, Finet
 ## Support matrix
-There are various algorithms for parallel training. Currently, we only support:
 - [DDP](https://pytorch.org/docs/stable/notes/ddp.html)
 - [FSDP2](https://pytorch.org/docs/stable/fsdp.html)
 - [HSDP](https://pytorch.org/docs/stable/fsdp.html)
-- [TP](https://pytorch.org/docs/stable/distributed.tensor.parallel.html)
 ## Training
@@ -28,7 +29,7 @@ The following parameters are relevant for launching training:
 - `pp_degree`: The degree of pipeline parallelism. Currently unsupported.
 - `dp_degree`: The degree of data parallelis/replicas. Defaults to `1`.
 - `dp_shards`: The number of shards for data parallelism. Defaults to `1`.
-- `cp_degree`: The degree of context parallelism. Currently unsupported.
 - `tp_degree`: The degree of tensor parallelism.
 For launching training with the Pytorch DTensor backend, use the following:
@@ -57,3 +58,7 @@ accelerate launch --config_file accelerate_configs/uncompiled_4.yaml --gpu_ids 0
 # Multi-node - Nx8 GPUs available
 # TODO(aryan): Add slurm script
 ```

 ## Support matrix
+Currently supported parallelizations include:
 - [DDP](https://pytorch.org/docs/stable/notes/ddp.html)
 - [FSDP2](https://pytorch.org/docs/stable/fsdp.html)
 - [HSDP](https://pytorch.org/docs/stable/fsdp.html)
+- [CP](https://docs.pytorch.org/tutorials/prototype/context_parallel.html)
+<!-- - [TP](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) -->
 ## Training
 - `pp_degree`: The degree of pipeline parallelism. Currently unsupported.
 - `dp_degree`: The degree of data parallelis/replicas. Defaults to `1`.
 - `dp_shards`: The number of shards for data parallelism. Defaults to `1`.
+- `cp_degree`: The degree of context parallelism.
 - `tp_degree`: The degree of tensor parallelism.
 For launching training with the Pytorch DTensor backend, use the following:
 # Multi-node - Nx8 GPUs available
 # TODO(aryan): Add slurm script
 ```
+## Inference
+For inference-only purposes, the example implementation can be found in the [examples/inference/](../../examples/inference/) directory.

docs/{finetrainers/documentation_trainers_control_trainer.md → finetrainers-src-codebase/docs/trainer/control_trainer.md} RENAMED Viewed

File without changes

docs/{finetrainers/documentation_trainers_sft_trainer.md → finetrainers-src-codebase/docs/trainer/sft_trainer.md} RENAMED Viewed

File without changes

docs/finetrainers-src-codebase/examples/_legacy/training/README.md ADDED Viewed

	@@ -0,0 +1,459 @@

+# CogVideoX Factory 🧪
+[中文阅读](./README_zh.md)
+Fine-tune Cog family of video models for custom video generation under 24GB of GPU memory ⚡️📼
+<table align="center">
+<tr>
+  <td align="center"><video src="https://github.com/user-attachments/assets/aad07161-87cb-4784-9e6b-16d06581e3e5">Your browser does not support the video tag.</video></td>
+</tr>
+</table>
+**Update 29 Nov 2024**: We have added an experimental memory-efficient trainer for Mochi-1. Check it out [here](https://github.com/a-r-r-o-w/cogvideox-factory/blob/main/training/mochi-1/)!
+## Quickstart
+Clone the repository and make sure the requirements are installed: `pip install -r requirements.txt` and install diffusers from source by `pip install git+https://github.com/huggingface/diffusers`.
+Then download a dataset:
+```bash
+# install `huggingface_hub`
+huggingface-cli download \
+  --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \
+  --local-dir video-dataset-disney
+```
+Then launch LoRA fine-tuning for text-to-video (modify the different hyperparameters, dataset root, and other configuration options as per your choice):
+```bash
+# For LoRA finetuning of the text-to-video CogVideoX models
+./train_text_to_video_lora.sh
+# For full finetuning of the text-to-video CogVideoX models
+./train_text_to_video_sft.sh
+# For LoRA finetuning of the image-to-video CogVideoX models
+./train_image_to_video_lora.sh
+```
+Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
+```diff
+import torch
+from diffusers import CogVideoXPipeline
+from diffusers.utils import export_to_video
+pipe = CogVideoXPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
+).to("cuda")
++ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="cogvideox-lora")
++ pipe.set_adapters(["cogvideox-lora"], [1.0])
+video = pipe("<my-awesome-prompt>").frames[0]
+export_to_video(video, "output.mp4", fps=8)
+```
+For Image-to-Video LoRAs trained with multiresolution videos, one must also add the following lines (see [this](https://github.com/a-r-r-o-w/cogvideox-factory/issues/26) Issue for more details):
+```python
+from diffusers import CogVideoXImageToVideoPipeline
+pipe = CogVideoXImageToVideoPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16
+).to("cuda")
+# ...
+del pipe.transformer.patch_embed.pos_embedding
+pipe.transformer.patch_embed.use_learned_positional_embeddings = False
+pipe.transformer.config.use_learned_positional_embeddings = False
+```
+You can also check if your LoRA is correctly mounted [here](tests/test_lora_inference.py).
+Below we provide additional sections detailing on more options explored in this repository. They all attempt to make fine-tuning for video models as accessible as possible by reducing memory requirements as much as possible.
+## Prepare Dataset and Training
+Before starting the training, please check whether the dataset has been prepared according to the [dataset specifications](assets/dataset.md). We provide training scripts suitable for text-to-video and image-to-video generation, compatible with the [CogVideoX model family](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce). Training can be started using the `train*.sh` scripts, depending on the task you want to train. Let's take LoRA fine-tuning for text-to-video as an example.
+- Configure environment variables as per your choice:
+  ```bash
+  export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
+  export TORCHDYNAMO_VERBOSE=1
+  export WANDB_MODE="offline"
+  export NCCL_P2P_DISABLE=1
+  export TORCH_NCCL_ENABLE_MONITORING=0
+  ```
+- Configure which GPUs to use for training: `GPU_IDS="0,1"`
+- Choose hyperparameters for training. Let's try to do a sweep on learning rate and optimizer type as an example:
+  ```bash
+  LEARNING_RATES=("1e-4" "1e-3")
+  LR_SCHEDULES=("cosine_with_restarts")
+  OPTIMIZERS=("adamw" "adam")
+  MAX_TRAIN_STEPS=("3000")
+  ```
+- Select which Accelerate configuration you would like to train with: `ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml"`. We provide some default configurations in the `accelerate_configs/` directory - single GPU uncompiled/compiled, 2x GPU DDP, DeepSpeed, etc. You can create your own config files with custom settings using `accelerate config --config_file my_config.yaml`.
+- Specify the absolute paths and columns/files for captions and videos.
+  ```bash
+  DATA_ROOT="/path/to/my/datasets/video-dataset-disney"
+  CAPTION_COLUMN="prompt.txt"
+  VIDEO_COLUMN="videos.txt"
+  ```
+- Launch experiments sweeping different hyperparameters:
+  ```
+  for learning_rate in "${LEARNING_RATES[@]}"; do
+    for lr_schedule in "${LR_SCHEDULES[@]}"; do
+      for optimizer in "${OPTIMIZERS[@]}"; do
+        for steps in "${MAX_TRAIN_STEPS[@]}"; do
+          output_dir="/path/to/my/models/cogvideox-lora__optimizer_${optimizer}__steps_${steps}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate}/"
+          cmd="accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox/cogvideox_text_to_video_lora.py \
+            --pretrained_model_name_or_path THUDM/CogVideoX-5b \
+            --data_root $DATA_ROOT \
+            --caption_column $CAPTION_COLUMN \
+            --video_column $VIDEO_COLUMN \
+            --id_token BW_STYLE \
+            --height_buckets 480 \
+            --width_buckets 720 \
+            --frame_buckets 49 \
+            --dataloader_num_workers 8 \
+            --pin_memory \
+            --validation_prompt \"BW_STYLE A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::BW_STYLE A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance\" \
+            --validation_prompt_separator ::: \
+            --num_validation_videos 1 \
+            --validation_epochs 10 \
+            --seed 42 \
+            --rank 128 \
+            --lora_alpha 128 \
+            --mixed_precision bf16 \
+            --output_dir $output_dir \
+            --max_num_frames 49 \
+            --train_batch_size 1 \
+            --max_train_steps $steps \
+            --checkpointing_steps 1000 \
+            --gradient_accumulation_steps 1 \
+            --gradient_checkpointing \
+            --learning_rate $learning_rate \
+            --lr_scheduler $lr_schedule \
+            --lr_warmup_steps 400 \
+            --lr_num_cycles 1 \
+            --enable_slicing \
+            --enable_tiling \
+            --optimizer $optimizer \
+            --beta1 0.9 \
+            --beta2 0.95 \
+            --weight_decay 0.001 \
+            --max_grad_norm 1.0 \
+            --allow_tf32 \
+            --report_to wandb \
+            --nccl_timeout 1800"
+          echo "Running command: $cmd"
+          eval $cmd
+          echo -ne "-------------------- Finished executing script --------------------\n\n"
+        done
+      done
+    done
+  done
+  ```
+  To understand what the different parameters mean, you could either take a look at the [args](./training/args.py) file or run the training script with `--help`.
+Note: Training scripts are untested on MPS, so performance and memory requirements can differ widely compared to the CUDA reports below.
+## Memory requirements
+<table align="center">
+<tr>
+  <td align="center" colspan="2"><b>CogVideoX LoRA Finetuning</b></td>
+</tr>
+<tr>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td>
+</tr>
+<tr>
+  <td align="center"><img src="../assets/lora_2b.png" /></td>
+  <td align="center"><img src="../assets/lora_5b.png" /></td>
+</tr>
+<tr>
+  <td align="center" colspan="2"><b>CogVideoX Full Finetuning</b></td>
+</tr>
+<tr>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td>
+</tr>
+<tr>
+  <td align="center"><img src="../assets/sft_2b.png" /></td>
+  <td align="center"><img src="../assets/sft_5b.png" /></td>
+</tr>
+</table>
+Supported and verified memory optimizations for training include:
+- `CPUOffloadOptimizer` from [`torchao`](https://github.com/pytorch/ao). You can read about its capabilities and limitations [here](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload). In short, it allows you to use the CPU for storing trainable parameters and gradients. This results in the optimizer step happening on the CPU, which requires a fast CPU optimizer, such as `torch.optim.AdamW(fused=True)` or applying `torch.compile` on the optimizer step. Additionally, it is recommended not to `torch.compile` your model for training. Gradient clipping and accumulation is not supported yet either.
+- Low-bit optimizers from [`bitsandbytes`](https://huggingface.co/docs/bitsandbytes/optimizers). TODO: to test and make [`torchao`](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim) ones work
+- DeepSpeed Zero2: Since we rely on `accelerate`, follow [this guide](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed) to configure your `accelerate` installation to enable training with DeepSpeed Zero2 optimizations.
+> [!IMPORTANT]
+> The memory requirements are reported after running the `training/prepare_dataset.py`, which converts the videos and captions to latents and embeddings. During training, we directly load the latents and embeddings, and do not require the VAE or the T5 text encoder. However, if you perform validation/testing, these must be loaded and increase the amount of required memory. Not performing validation/testing saves a significant amount of memory, which can be used to focus solely on training if you're on smaller VRAM GPUs.
+>
+> If you choose to run validation/testing, you can save some memory on lower VRAM GPUs by specifying `--enable_model_cpu_offload`.
+### LoRA finetuning
+> [!NOTE]
+> The memory requirements for image-to-video lora finetuning are similar to that of text-to-video on `THUDM/CogVideoX-5b`, so it hasn't been reported explicitly.
+>
+> Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using:
+> `ffmpeg -i input.mp4 -frames:v 1 frame.png`,
+> or provide a URL to a valid and accessible image.
+<details>
+<summary> AdamW </summary>
+**Note:** Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.
+With `train_batch_size = 1`:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          False         |         12.945         |          43.764          |         46.918          |       24.234         |
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.121          |       24.234         |
+| THUDM/CogVideoX-2b |    64     |          False         |         13.035         |          44.314          |         47.469          |       24.469         |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.036         |          13.035          |         21.564          |       24.500         |
+| THUDM/CogVideoX-2b |    256    |          False         |         13.095         |          45.826          |         48.990          |       25.543         |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          13.095          |         22.344          |       25.537         |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.746          |       38.123         |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.818          |         30.338          |       38.738         |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          22.119          |         31.939          |       41.537         |
+With `train_batch_size = 4`:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.803          |         21.814          |       24.322         |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          22.254          |         22.254          |       24.572         |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.033          |       25.574         |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.492          |         46.492          |       38.197         |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          47.805          |         47.805          |       39.365         |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.268          |         47.332          |       41.008         |
+</details>
+<details>
+<summary> AdamW (8-bit bitsandbytes) </summary>
+**Note:** Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.
+With `train_batch_size = 1`:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          False         |         12.945         |          43.732          |         46.887          |        24.195        |
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.430          |        24.195        |
+| THUDM/CogVideoX-2b |    64     |          False         |         13.035         |          44.004          |         47.158          |        24.369        |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          13.035          |         21.297          |        24.357        |
+| THUDM/CogVideoX-2b |    256    |          False         |         13.035         |          45.291          |         48.455          |        24.836        |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.035         |          13.035          |         21.625          |        24.869        |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.602          |        38.049        |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.818          |         29.359          |        38.520        |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          21.352          |         30.727          |        39.596        |
+With `train_batch_size = 4`:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.734          |         21.775          |       24.281         |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.036         |          21.941          |         21.941          |       24.445         |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.266          |       24.943         |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.320          |         46.326          |       38.104         |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          46.820          |         46.820          |       38.588         |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.920          |         47.980          |       40.002         |
+</details>
+<details>
+<summary> AdamW + CPUOffloadOptimizer (with gradient offloading) </summary>
+**Note:** Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.
+With `train_batch_size = 1`:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          False         |         12.945         |          43.705          |         46.859          |       24.180         |
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.395          |       24.180         |
+| THUDM/CogVideoX-2b |    64     |          False         |         13.035         |          43.916          |         47.070          |       24.234         |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          13.035          |         20.887          |       24.266         |
+| THUDM/CogVideoX-2b |    256    |          False         |         13.095         |          44.947          |         48.111          |       24.607         |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.095         |          13.095          |         21.391          |       24.635         |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.533          |       38.002         |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.006          |         29.107          |       38.785         |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          20.771          |         30.078          |       39.559         |
+With `train_batch_size = 4`:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.709          |         21.762          |       24.254         |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          21.844          |         21.855          |       24.338         |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.031          |       24.709         |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.262          |         46.297          |       38.400         |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          46.561          |         46.574          |       38.840         |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.268          |         47.332          |       39.623         |
+</details>
+<details>
+<summary> DeepSpeed (AdamW + CPU/Parameter offloading) </summary>
+**Note:** Results are reported with `gradient_checkpointing` enabled, running on a 2x A100.
+With `train_batch_size = 1`:
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.141         |          13.141          |         21.070          |       24.602         |
+| THUDM/CogVideoX-5b |         20.170         |          20.170          |         28.662          |       38.957         |
+With `train_batch_size = 4`:
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.141         |          19.854          |         20.836          |       24.709         |
+| THUDM/CogVideoX-5b |         20.170         |          40.635          |         40.699          |       39.027         |
+</details>
+### Full finetuning
+> [!NOTE]
+> The memory requirements for image-to-video full finetuning are similar to that of text-to-video on `THUDM/CogVideoX-5b`, so it hasn't been reported explicitly.
+>
+> Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using:
+> `ffmpeg -i input.mp4 -frames:v 1 frame.png`,
+> or provide a URL to a valid and accessible image.
+> [!NOTE]
+> Trying to run full finetuning without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.
+<details>
+<summary> AdamW </summary>
+With `train_batch_size = 1`:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          33.934          |         43.848          |       37.520         |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          OOM             |         OOM             |       OOM            |
+With `train_batch_size = 4`:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          38.281          |         48.341          |       37.544         |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          OOM             |         OOM             |       OOM            |
+</details>
+<details>
+<summary> AdamW (8-bit bitsandbytes) </summary>
+With `train_batch_size = 1`:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          16.447          |         27.555          |       27.156         |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          52.826          |         58.570          |       49.541         |
+With `train_batch_size = 4`:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          27.930          |         27.990          |       27.326         |
+| THUDM/CogVideoX-5b |          True          |         16.396         |          66.648          |         66.705          |       48.828         |
+</details>
+<details>
+<summary> AdamW + CPUOffloadOptimizer (with gradient offloading) </summary>
+With `train_batch_size = 1`:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          16.396          |         26.100          |       23.832         |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          39.359          |         48.307          |       37.947         |
+With `train_batch_size = 4`:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          27.916          |         27.975          |       23.936         |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          66.607          |         66.668          |       38.061         |
+</details>
+<details>
+<summary> DeepSpeed (AdamW + CPU/Parameter offloading) </summary>
+**Note:** Results are reported with `gradient_checkpointing` enabled, running on a 2x A100.
+With `train_batch_size = 1`:
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.111         |          13.111          |         20.328          |       23.867         |
+| THUDM/CogVideoX-5b |         19.762         |          19.998          |         27.697          |       38.018         |
+With `train_batch_size = 4`:
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.111         |          21.188          |         21.254          |       23.869         |
+| THUDM/CogVideoX-5b |         19.762         |          43.465          |         43.531          |       38.082         |
+</details>
+> [!NOTE]
+> - `memory_after_validation` is indicative of the peak memory required for training. This is because apart from the activations, parameters and gradients stored for training, you also need to load the vae and text encoder in memory and spend some memory to perform inference. In order to reduce total memory required to perform training, one can choose not to perform validation/testing as part of the training script.
+>
+> - `memory_before_validation` is the true indicator of the peak memory required for training if you choose to not perform validation/testing.
+<table align="center">
+<tr>
+  <td align="center"><a href="https://www.youtube.com/watch?v=UvRl4ansfCg"> Slaying OOMs with PyTorch</a></td>
+</tr>
+<tr>
+  <td align="center"><img src="assets/slaying-ooms.png" style="width: 480px; height: 480px;"></td>
+</tr>
+</table>
+## TODOs
+- [x] Make scripts compatible with DDP
+- [ ] Make scripts compatible with FSDP
+- [x] Make scripts compatible with DeepSpeed
+- [ ] vLLM-powered captioning script
+- [x] Multi-resolution/frame support in `prepare_dataset.py`
+- [ ] Analyzing traces for potential speedups and removing as many syncs as possible
+- [x] Test scripts with memory-efficient optimizer from bitsandbytes
+- [x] Test scripts with CPUOffloadOptimizer, etc.
+- [ ] Test scripts with torchao quantization, and low bit memory optimizers (Currently errors with AdamW (8/4-bit torchao))
+- [ ] Test scripts with AdamW (8-bit bitsandbytes) + CPUOffloadOptimizer (with gradient offloading) (Currently errors out)
+- [ ] [Sage Attention](https://github.com/thu-ml/SageAttention) (work with the authors to support backward pass, and optimize for A100)
+> [!IMPORTANT]
+> Since our goal is to make the scripts as memory-friendly as possible we don't guarantee multi-GPU training.

docs/finetrainers-src-codebase/examples/_legacy/training/README_zh.md ADDED Viewed

	@@ -0,0 +1,455 @@

+# CogVideoX Factory 🧪
+[Read in English](./README.md)
+在 24GB GPU 内存下对 Cog 系列视频模型进行微调以实现自定义视频生成，支持多分辨率 ⚡️📼
+<table align="center">
+<tr>
+  <td align="center"><video src="https://github.com/user-attachments/assets/aad07161-87cb-4784-9e6b-16d06581e3e5">您的浏览器不支持视频标签。</video></td>
+</tr>
+</table>
+## 快速开始
+克隆此仓库并确保安装了相关依赖：`pip install -r requirements.txt`。
+接着下载数据集：
+```
+# 安装 `huggingface_hub`
+huggingface-cli download   --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset   --local-dir video-dataset-disney
+```
+然后启动 LoRA 微调进行文本到视频的生成（根据您的选择修改不同的超参数、数据集根目录以及其他配置选项）：
+```
+# 对 CogVideoX 模型进行文本到视频的 LoRA 微调
+./train_text_to_video_lora.sh
+# 对 CogVideoX 模型进行文本到视频的完整微调
+./train_text_to_video_sft.sh
+# 对 CogVideoX 模型进行图像到视频的 LoRA 微调
+./train_image_to_video_lora.sh
+```
+假设您的 LoRA 已保存并推送到 HF Hub，并命名为 `my-awesome-name/my-awesome-lora`，现在我们可以使用微调模型进行推理：
+```
+import torch
+from diffusers import CogVideoXPipeline
+from diffusers import export_to_video
+pipe = CogVideoXPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
+).to("cuda")
++ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name=["cogvideox-lora"])
++ pipe.set_adapters(["cogvideox-lora"], [1.0])
+video = pipe("<my-awesome-prompt>").frames[0]
+export_to_video(video, "output.mp4", fps=8)
+```
+你也可以在[这里](tests/test_lora_inference.py)来检查你的Lora是否正常挂载。
+**注意：** 对于图像到视频的微调，您必须从 [这个分支](https://github.com/huggingface/diffusers/pull/9482) 安装
+diffusers（该分支为 CogVideoX 的图像到视频添加了 LoRA 加载支持）直到它被合并。
+以下我们提供了更多探索此仓库选项的额外部分。所有这些都旨在尽可能降低内存需求，使视频模型的微调变得更易于访问。
+## 训练
+在开始训练之前，请你检查是否按照[数据集规范](assets/dataset_zh.md)准备好了数据集。 我们提供了适用于文本到视频 (text-to-video) 和图像到视频 (image-to-video) 生成的训练脚本，兼容 [CogVideoX 模型家族](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce)。训练可以通过 `train*.sh` 脚本启动，具体取决于你想要训练的任务。让我们以文本到视频的 LoRA 微调为例。
+- 根据你的需求配置环境变量：
+  ```
+  export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
+  export TORCHDYNAMO_VERBOSE=1
+  export WANDB_MODE="offline"
+  export NCCL_P2P_DISABLE=1
+  export TORCH_NCCL_ENABLE_MONITORING=0
+  ```
+- 配置用于训练的 GPU：`GPU_IDS="0,1"`
+- 选择训练的超参数。让我们以学习率和优化器类型的超参数遍历为例：
+  ```
+  LEARNING_RATES=("1e-4" "1e-3")
+  LR_SCHEDULES=("cosine_with_restarts")
+  OPTIMIZERS=("adamw" "adam")
+  MAX_TRAIN_STEPS=("3000")
+  ```
+- 选择用于训练的 Accelerate 配置文件：`ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml"`
+  。我们在 `accelerate_configs/` 目录中提供了一些默认配置 - 单 GPU 编译/未编译、2x GPU DDP、DeepSpeed
+  等。你也可以使用 `accelerate config --config_file my_config.yaml` 自定义配置文件。
+- 指定字幕和视频的绝对路径以及列/文件。
+  ```
+  DATA_ROOT="/path/to/my/datasets/video-dataset-disney"
+  CAPTION_COLUMN="prompt.txt"
+  VIDEO_COLUMN="videos.txt"
+  ```
+- 运行实验，遍历不同的超参数：
+    ```
+  for learning_rate in "${LEARNING_RATES[@]}"; do
+    for lr_schedule in "${LR_SCHEDULES[@]}"; do
+      for optimizer in "${OPTIMIZERS[@]}"; do
+        for steps in "${MAX_TRAIN_STEPS[@]}"; do
+          output_dir="/path/to/my/models/cogvideox-lora__optimizer_${optimizer}__steps_${steps}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate}/"
+          cmd="accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox_text_to_video_lora.py \
+            --pretrained_model_name_or_path THUDM/CogVideoX-5b \
+            --data_root $DATA_ROOT \
+            --caption_column $CAPTION_COLUMN \
+            --video_column $VIDEO_COLUMN \
+            --id_token BW_STYLE \
+            --height_buckets 480 \
+            --width_buckets 720 \
+            --frame_buckets 49 \
+            --dataloader_num_workers 8 \
+            --pin_memory \
+            --validation_prompt \"BW_STYLE A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::BW_STYLE A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance\" \
+            --validation_prompt_separator ::: \
+            --num_validation_videos 1 \
+            --validation_epochs 10 \
+            --seed 42 \
+            --rank 128 \
+            --lora_alpha 128 \
+            --mixed_precision bf16 \
+            --output_dir $output_dir \
+            --max_num_frames 49 \
+            --train_batch_size 1 \
+            --max_train_steps $steps \
+            --checkpointing_steps 1000 \
+            --gradient_accumulation_steps 1 \
+            --gradient_checkpointing \
+            --learning_rate $learning_rate \
+            --lr_scheduler $lr_schedule \
+            --lr_warmup_steps 400 \
+            --lr_num_cycles 1 \
+            --enable_slicing \
+            --enable_tiling \
+            --optimizer $optimizer \
+            --beta1 0.9 \
+            --beta2 0.95 \
+            --weight_decay 0.001 \
+            --max_grad_norm 1.0 \
+            --allow_tf32 \
+            --report_to wandb \
+            --nccl_timeout 1800"
+          echo "Running command: $cmd"
+          eval $cmd
+          echo -ne "-------------------- Finished executing script --------------------\n\n"
+        done
+      done
+    done
+  done
+  ```
+要了解不同参数的含义，你可以查看 [args](./training/args.py) 文件，或者使用 `--help` 运行训练脚本。
+注意：训练脚本尚未在 MPS 上测试，因此性能和内存要求可能与下面的 CUDA 报告差异很大。
+## 内存需求
+<table align="center">
+<tr>
+  <td align="center" colspan="2"><b>CogVideoX LoRA 微调</b></td>
+</tr>
+<tr>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td>
+</tr>
+<tr>
+  <td align="center"><img src="assets/lora_2b.png" /></td>
+  <td align="center"><img src="assets/lora_5b.png" /></td>
+</tr>
+<tr>
+  <td align="center" colspan="2"><b>CogVideoX 全量微调</b></td>
+</tr>
+<tr>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td>
+  <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td>
+</tr>
+<tr>
+  <td align="center"><img src="assets/sft_2b.png" /></td>
+  <td align="center"><img src="assets/sft_5b.png" /></td>
+</tr>
+</table>
+支持和验证的训练内存优化包括：
+- `CPUOffloadOptimizer` 来自 [`torchao`](https://github.com/pytorch/ao)
+  。你可以在[这里](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload)
+  阅读它的能力和局限性。简而言之，它允许你将可训练参数和梯度存储在 CPU 中，从而在 CPU 上进行优化步骤。这需要快速的 CPU
+  优化器，如 `torch.optim.AdamW(fused=True)`，或者在优化步骤中应用 `torch.compile`
+  。此外，建议不要在训练时对模型应用 `torch.compile`。梯度裁剪和累积目前还不支持。
+- 来自 [`bitsandbytes`](https://huggingface.co/docs/bitsandbytes/optimizers)
+  的低位优化器。TODO：测试并使 [`torchao`](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim) 能正常工作。
+- DeepSpeed Zero2：由于我们依赖 `accelerate`
+  ，请按照[此指南](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed) 配置 `accelerate` 以启用 DeepSpeed
+  Zero2 优化训练。
+> [!重要提示]
+> 内存需求是运行 `training/prepare_dataset.py`
+>
+后报告的，该脚本将视频和字幕转换为潜在向量和嵌入。在训练期间，我们直接加载这些潜在向量和嵌入，不需要VAE或T5文本编码器。然而，如果执行验证/测试，则必须加载这些模块，并且会增加所需内存的数量。不进行验证/测试可以节省大量内存，这些内存可以用于较小显存的GPU上专注于训练。
+>
+> 如果选择运行验证/测试，可以通过指定 `--enable_model_cpu_offload` 来为较低显存的GPU节省一些内存。
+### LoRA微调
+> [!重要提示]
+> 图像到视频的LoRA微调的内存需求与文本到视频上的 `THUDM/CogVideoX-5b` 类似，因此没有明确报告。
+>
+> 此外，为了准备I2V微调的测试图像，可以通过修改脚本实时生成它们，或使用以下命令从训练数据中提取一些帧：
+> `ffmpeg -i input.mp4 -frames:v 1 frame.png`，
+> 或提供一个有效且可访问的图像URL。
+<details>
+<summary> AdamW </summary>
+**注意：** 尝试在没有梯度检查点的情况下运行 CogVideoX-5b 即使在 A100（80 GB）上也会导致 OOM（内存不足）错误，因此内存需求尚未列出。
+当 `train_batch_size = 1` 时:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |         False          |         12.945         |          43.764          |         46.918          |        24.234        |
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.121          |        24.234        |
+| THUDM/CogVideoX-2b |    64     |         False          |         13.035         |          44.314          |         47.469          |        24.469        |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.036         |          13.035          |         21.564          |        24.500        |
+| THUDM/CogVideoX-2b |    256    |         False          |         13.095         |          45.826          |         48.990          |        25.543        |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          13.095          |         22.344          |        25.537        |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.746          |        38.123        |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.818          |         30.338          |        38.738        |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          22.119          |         31.939          |        41.537        |
+当 `train_batch_size = 4` 时:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.803          |         21.814          |        24.322        |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          22.254          |         22.254          |        24.572        |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.033          |        25.574        |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.492          |         46.492          |        38.197        |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          47.805          |         47.805          |        39.365        |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.268          |         47.332          |        41.008        |
+</details>
+<details>
+<summary> AdamW (8-bit bitsandbytes) </summary>
+**注意：** 在没有启用梯度检查点的情况下，尝试运行 CogVideoX-5b 模型即使在 A100（80 GB）上也会导致 OOM（内存不足），因此未列出内存测量数据。
+当 `train_batch_size = 1` 时：
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |         False          |         12.945         |          43.732          |         46.887          |        24.195        |
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.430          |        24.195        |
+| THUDM/CogVideoX-2b |    64     |         False          |         13.035         |          44.004          |         47.158          |        24.369        |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          13.035          |         21.297          |        24.357        |
+| THUDM/CogVideoX-2b |    256    |         False          |         13.035         |          45.291          |         48.455          |        24.836        |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.035         |          13.035          |         21.625          |        24.869        |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.602          |        38.049        |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.818          |         29.359          |        38.520        |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          21.352          |         30.727          |        39.596        |
+当 `train_batch_size = 4` 时:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.734          |         21.775          |        24.281        |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.036         |          21.941          |         21.941          |        24.445        |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.266          |        24.943        |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.320          |         46.326          |        38.104        |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          46.820          |         46.820          |        38.588        |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.920          |         47.980          |        40.002        |
+</details>
+<details>
+<summary> AdamW + CPUOffloadOptimizer (with gradient offloading) </summary>
+**注意：** 在没有启用梯度检查点的情况下，尝试运行 CogVideoX-5b 模型即使在 A100（80 GB）上也会导致 OOM（内存不足），因此未列出内存测量数据。
+当 `train_batch_size = 1` 时：
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |         False          |         12.945         |          43.705          |         46.859          |        24.180        |
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.395          |        24.180        |
+| THUDM/CogVideoX-2b |    64     |         False          |         13.035         |          43.916          |         47.070          |        24.234        |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          13.035          |         20.887          |        24.266        |
+| THUDM/CogVideoX-2b |    256    |         False          |         13.095         |          44.947          |         48.111          |        24.607        |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.095         |          13.095          |         21.391          |        24.635        |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.533          |        38.002        |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.006          |         29.107          |        38.785        |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          20.771          |         30.078          |        39.559        |
+当 `train_batch_size = 4` 时:
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.709          |         21.762          |        24.254        |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          21.844          |         21.855          |        24.338        |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.031          |        24.709        |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.262          |         46.297          |        38.400        |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          46.561          |         46.574          |        38.840        |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.268          |         47.332          |        39.623        |
+</details>
+<details>
+<summary> DeepSpeed (AdamW + CPU/Parameter offloading) </summary>
+**注意：** 结果是在启用梯度检查点的情况下，使用 2x A100 运行时记录的。
+当 `train_batch_size = 1` 时：
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.141         |          13.141          |         21.070          |        24.602        |
+| THUDM/CogVideoX-5b |         20.170         |          20.170          |         28.662          |        38.957        |
+当 `train_batch_size = 4` 时:
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.141         |          19.854          |         20.836          |        24.709        |
+| THUDM/CogVideoX-5b |         20.170         |          40.635          |         40.699          |        39.027        |
+</details>
+### Full finetuning
+> [!注意]
+> 图像到视频的完整微调内存需求与 `THUDM/CogVideoX-5b` 的文本到视频微调相似，因此没有单独列出。
+>
+> 此外，要准备用于 I2V 微调的测试图像，你可以通过修改脚本实时生成图像，或者从你的训练数据中提取一些帧：
+> `ffmpeg -i input.mp4 -frames:v 1 frame.png`，
+> 或提供一个有效且可访问的图像 URL。
+> [!注意]
+> 在没有使用梯度检查点的情况下运行完整微调，即使是在 A100（80GB）上，也会出现 OOM（内存不足）错误，因此未列出内存需求。
+<details>
+<summary> AdamW </summary>
+当 `train_batch_size = 1` 时：
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          33.934          |         43.848          |        37.520        |
+| THUDM/CogVideoX-5b |          True          |         30.061         |           OOM            |           OOM           |         OOM          |
+当 `train_batch_size = 4` 时:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          38.281          |         48.341          |        37.544        |
+| THUDM/CogVideoX-5b |          True          |         30.061         |           OOM            |           OOM           |         OOM          |
+</details>
+<details>
+<summary> AdamW (8-bit 量化) </summary>
+当 `train_batch_size = 1` 时：
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          16.447          |         27.555          |        27.156        |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          52.826          |         58.570          |        49.541        |
+当 `train_batch_size = 4` 时:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          27.930          |         27.990          |        27.326        |
+| THUDM/CogVideoX-5b |          True          |         16.396         |          66.648          |         66.705          |        48.828        |
+</details>
+<details>
+<summary> AdamW + CPUOffloadOptimizer（带有梯度卸载）</summary>
+当 `train_batch_size = 1` 时：
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          16.396          |         26.100          |        23.832        |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          39.359          |         48.307          |        37.947        |
+当 `train_batch_size = 4` 时:
+|       model        | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |          True          |         16.396         |          27.916          |         27.975          |        23.936        |
+| THUDM/CogVideoX-5b |          True          |         30.061         |          66.607          |         66.668          |        38.061        |
+</details>
+<details>
+<summary> DeepSpeed（AdamW + CPU/参数卸载） </summary>
+**注意:** 结果是在启用 `gradient_checkpointing`（梯度检查点）功能，并在 2 台 A100 显卡上运行时报告的。
+当 `train_batch_size = 1` 时：
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.111         |          13.111          |         20.328          |        23.867        |
+| THUDM/CogVideoX-5b |         19.762         |          19.998          |         27.697          |        38.018        |
+当 `train_batch_size = 4` 时:
+|       model        | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |         13.111         |          21.188          |         21.254          |        23.869        |
+| THUDM/CogVideoX-5b |         19.762         |          43.465          |         43.531          |        38.082        |
+</details>
+> [!注意]
+> - `memory_after_validation`（验证后内存） 表示训练所需的峰值内存。这是因为除了存储训练过程中需要的激活、参数和梯度之外，还需要加载
+    VAE 和文本编码器到内存中，并且执行推理操作也会消耗一定内存。为了减少训练所需的总内存，您可以选择在训练脚本中不执行验证/测试。
+>
+> - 如果选择不进行验证/测试，`memory_before_validation`（验证前内存） 才是训练所需内存的真实指示器。
+<table align="center">
+<tr>
+  <td align="center"><a href="https://www.youtube.com/watch?v=UvRl4ansfCg"> Slaying OOMs with PyTorch</a></td>
+</tr>
+<tr>
+  <td align="center"><img src="assets/slaying-ooms.png" style="width: 480px; height: 480px;"></td>
+</tr>
+</table>
+## 待办事项
+- [x] 使脚本兼容 DDP
+- [ ] 使脚本兼容 FSDP
+- [x] 使脚本兼容 DeepSpeed
+- [ ] 基于 vLLM 的字幕脚本
+- [x] 在 `prepare_dataset.py` 中支持多分辨率/帧数
+- [ ] 分析性能瓶颈并尽可能减少同步操作
+- [ ] 支持 QLoRA（优先），以及其他高使用率的 LoRA 方法
+- [x] 使用 bitsandbytes 的节省内存优化器测试脚本
+- [x] 使用 CPUOffloadOptimizer 等测试脚本
+- [ ] 使用 torchao 量化和低位内存优化器测试脚本（目前在 AdamW（8/4-bit torchao）上报错）
+- [ ] 使用 AdamW（8-bit bitsandbytes）+ CPUOffloadOptimizer（带有梯度卸载）的测试脚本（目前报错）
+- [ ] [Sage Attention](https://github.com/thu-ml/SageAttention) （与作者合作支持反向传播，并针对 A100 进行优化）
+> [!重要]
+> 由于我们的目标是使脚本尽可能节省内存，因此我们不保证支持多 GPU 训练。

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/__init__.py ADDED Viewed

File without changes

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/args.py ADDED Viewed

	@@ -0,0 +1,484 @@

+import argparse
+def _get_model_args(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument(
+        "--pretrained_model_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to pretrained model or model identifier from huggingface.co/models.",
+    )
+    parser.add_argument(
+        "--revision",
+        type=str,
+        default=None,
+        required=False,
+        help="Revision of pretrained model identifier from huggingface.co/models.",
+    )
+    parser.add_argument(
+        "--variant",
+        type=str,
+        default=None,
+        help="Variant of the model files of the pretrained model identifier from huggingface.co/models, 'e.g.' fp16",
+    )
+    parser.add_argument(
+        "--cache_dir",
+        type=str,
+        default=None,
+        help="The directory where the downloaded models and datasets will be stored.",
+    )
+def _get_dataset_args(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument(
+        "--data_root",
+        type=str,
+        default=None,
+        help=("A folder containing the training data."),
+    )
+    parser.add_argument(
+        "--dataset_file",
+        type=str,
+        default=None,
+        help=("Path to a CSV file if loading prompts/video paths using this format."),
+    )
+    parser.add_argument(
+        "--video_column",
+        type=str,
+        default="video",
+        help="The column of the dataset containing videos. Or, the name of the file in `--data_root` folder containing the line-separated path to video data.",
+    )
+    parser.add_argument(
+        "--caption_column",
+        type=str,
+        default="text",
+        help="The column of the dataset containing the instance prompt for each video. Or, the name of the file in `--data_root` folder containing the line-separated instance prompts.",
+    )
+    parser.add_argument(
+        "--id_token",
+        type=str,
+        default=None,
+        help="Identifier token appended to the start of each prompt if provided.",
+    )
+    parser.add_argument(
+        "--height_buckets",
+        nargs="+",
+        type=int,
+        default=[256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536],
+    )
+    parser.add_argument(
+        "--width_buckets",
+        nargs="+",
+        type=int,
+        default=[256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536],
+    )
+    parser.add_argument(
+        "--frame_buckets",
+        nargs="+",
+        type=int,
+        default=[49],
+        help="CogVideoX1.5 need to guarantee that ((num_frames - 1) // self.vae_scale_factor_temporal + 1) % patch_size_t == 0, such as 53"
+    )
+    parser.add_argument(
+        "--load_tensors",
+        action="store_true",
+        help="Whether to use a pre-encoded tensor dataset of latents and prompt embeddings instead of videos and text prompts. The expected format is that saved by running the `prepare_dataset.py` script.",
+    )
+    parser.add_argument(
+        "--random_flip",
+        type=float,
+        default=None,
+        help="If random horizontal flip augmentation is to be used, this should be the flip probability.",
+    )
+    parser.add_argument(
+        "--dataloader_num_workers",
+        type=int,
+        default=0,
+        help="Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process.",
+    )
+    parser.add_argument(
+        "--pin_memory",
+        action="store_true",
+        help="Whether or not to use the pinned memory setting in pytorch dataloader.",
+    )
+def _get_validation_args(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument(
+        "--validation_prompt",
+        type=str,
+        default=None,
+        help="One or more prompt(s) that is used during validation to verify that the model is learning. Multiple validation prompts should be separated by the '--validation_prompt_seperator' string.",
+    )
+    parser.add_argument(
+        "--validation_images",
+        type=str,
+        default=None,
+        help="One or more image path(s)/URLs that is used during validation to verify that the model is learning. Multiple validation paths should be separated by the '--validation_prompt_seperator' string. These should correspond to the order of the validation prompts.",
+    )
+    parser.add_argument(
+        "--validation_prompt_separator",
+        type=str,
+        default=":::",
+        help="String that separates multiple validation prompts",
+    )
+    parser.add_argument(
+        "--num_validation_videos",
+        type=int,
+        default=1,
+        help="Number of videos that should be generated during validation per `validation_prompt`.",
+    )
+    parser.add_argument(
+        "--validation_epochs",
+        type=int,
+        default=None,
+        help="Run validation every X training epochs. Validation consists of running the validation prompt `args.num_validation_videos` times.",
+    )
+    parser.add_argument(
+        "--validation_steps",
+        type=int,
+        default=None,
+        help="Run validation every X training steps. Validation consists of running the validation prompt `args.num_validation_videos` times.",
+    )
+    parser.add_argument(
+        "--guidance_scale",
+        type=float,
+        default=6,
+        help="The guidance scale to use while sampling validation videos.",
+    )
+    parser.add_argument(
+        "--use_dynamic_cfg",
+        action="store_true",
+        default=False,
+        help="Whether or not to use the default cosine dynamic guidance schedule when sampling validation videos.",
+    )
+    parser.add_argument(
+        "--enable_model_cpu_offload",
+        action="store_true",
+        default=False,
+        help="Whether or not to enable model-wise CPU offloading when performing validation/testing to save memory.",
+    )
+def _get_training_args(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")
+    parser.add_argument("--rank", type=int, default=64, help="The rank for LoRA matrices.")
+    parser.add_argument(
+        "--lora_alpha",
+        type=int,
+        default=64,
+        help="The lora_alpha to compute scaling factor (lora_alpha / rank) for LoRA matrices.",
+    )
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default=None,
+        choices=["no", "fp16", "bf16"],
+        help=(
+            "Whether to use mixed precision. Choose between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10.and an Nvidia Ampere GPU. "
+            "Default to the value of accelerate config of the current system or the flag passed with the `accelerate.launch` command. Use this "
+            "argument to override the accelerate config."
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default="cogvideox-sft",
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--height",
+        type=int,
+        default=480,
+        help="All input videos are resized to this height.",
+    )
+    parser.add_argument(
+        "--width",
+        type=int,
+        default=720,
+        help="All input videos are resized to this width.",
+    )
+    parser.add_argument(
+        "--video_reshape_mode",
+        type=str,
+        default=None,
+        help="All input videos are reshaped to this mode. Choose between ['center', 'random', 'none']",
+    )
+    parser.add_argument("--fps", type=int, default=8, help="All input videos will be used at this FPS.")
+    parser.add_argument(
+        "--max_num_frames",
+        type=int,
+        default=49,
+        help="All input videos will be truncated to these many frames.",
+    )
+    parser.add_argument(
+        "--skip_frames_start",
+        type=int,
+        default=0,
+        help="Number of frames to skip from the beginning of each input video. Useful if training data contains intro sequences.",
+    )
+    parser.add_argument(
+        "--skip_frames_end",
+        type=int,
+        default=0,
+        help="Number of frames to skip from the end of each input video. Useful if training data contains outro sequences.",
+    )
+    parser.add_argument(
+        "--train_batch_size",
+        type=int,
+        default=4,
+        help="Batch size (per device) for the training dataloader.",
+    )
+    parser.add_argument("--num_train_epochs", type=int, default=1)
+    parser.add_argument(
+        "--max_train_steps",
+        type=int,
+        default=None,
+        help="Total number of training steps to perform. If provided, overrides `--num_train_epochs`.",
+    )
+    parser.add_argument(
+        "--checkpointing_steps",
+        type=int,
+        default=500,
+        help=(
+            "Save a checkpoint of the training state every X updates. These checkpoints can be used both as final"
+            " checkpoints in case they are better than the last checkpoint, and are also suitable for resuming"
+            " training using `--resume_from_checkpoint`."
+        ),
+    )
+    parser.add_argument(
+        "--checkpoints_total_limit",
+        type=int,
+        default=None,
+        help=("Max number of checkpoints to store."),
+    )
+    parser.add_argument(
+        "--resume_from_checkpoint",
+        type=str,
+        default=None,
+        help=(
+            "Whether training should be resumed from a previous checkpoint. Use a path saved by"
+            ' `--checkpointing_steps`, or `"latest"` to automatically select the last available checkpoint.'
+        ),
+    )
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument(
+        "--gradient_checkpointing",
+        action="store_true",
+        help="Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        type=float,
+        default=1e-4,
+        help="Initial learning rate (after the potential warmup period) to use.",
+    )
+    parser.add_argument(
+        "--scale_lr",
+        action="store_true",
+        default=False,
+        help="Scale the learning rate by the number of GPUs, gradient accumulation steps, and batch size.",
+    )
+    parser.add_argument(
+        "--lr_scheduler",
+        type=str,
+        default="constant",
+        help=(
+            'The scheduler type to use. Choose between ["linear", "cosine", "cosine_with_restarts", "polynomial",'
+            ' "constant", "constant_with_warmup"]'
+        ),
+    )
+    parser.add_argument(
+        "--lr_warmup_steps",
+        type=int,
+        default=500,
+        help="Number of steps for the warmup in the lr scheduler.",
+    )
+    parser.add_argument(
+        "--lr_num_cycles",
+        type=int,
+        default=1,
+        help="Number of hard resets of the lr in cosine_with_restarts scheduler.",
+    )
+    parser.add_argument(
+        "--lr_power",
+        type=float,
+        default=1.0,
+        help="Power factor of the polynomial scheduler.",
+    )
+    parser.add_argument(
+        "--enable_slicing",
+        action="store_true",
+        default=False,
+        help="Whether or not to use VAE slicing for saving memory.",
+    )
+    parser.add_argument(
+        "--enable_tiling",
+        action="store_true",
+        default=False,
+        help="Whether or not to use VAE tiling for saving memory.",
+    )
+    parser.add_argument(
+        "--noised_image_dropout",
+        type=float,
+        default=0.05,
+        help="Image condition dropout probability when finetuning image-to-video.",
+    )
+    parser.add_argument(
+        "--ignore_learned_positional_embeddings",
+        action="store_true",
+        default=False,
+        help=(
+            "Whether to ignore the learned positional embeddings when training CogVideoX Image-to-Video. This setting "
+            "should be used when performing multi-resolution training, because CogVideoX-I2V does not support it "
+            "otherwise. Please read the comments in https://github.com/a-r-r-o-w/cogvideox-factory/issues/26 to understand why."
+        ),
+    )
+def _get_optimizer_args(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument(
+        "--optimizer",
+        type=lambda s: s.lower(),
+        default="adam",
+        choices=["adam", "adamw", "prodigy", "came"],
+        help=("The optimizer type to use."),
+    )
+    parser.add_argument(
+        "--use_8bit",
+        action="store_true",
+        help="Whether or not to use 8-bit optimizers from `bitsandbytes` or `bitsandbytes`.",
+    )
+    parser.add_argument(
+        "--use_4bit",
+        action="store_true",
+        help="Whether or not to use 4-bit optimizers from `torchao`.",
+    )
+    parser.add_argument(
+        "--use_torchao", action="store_true", help="Whether or not to use the `torchao` backend for optimizers."
+    )
+    parser.add_argument(
+        "--beta1",
+        type=float,
+        default=0.9,
+        help="The beta1 parameter for the Adam and Prodigy optimizers.",
+    )
+    parser.add_argument(
+        "--beta2",
+        type=float,
+        default=0.95,
+        help="The beta2 parameter for the Adam and Prodigy optimizers.",
+    )
+    parser.add_argument(
+        "--beta3",
+        type=float,
+        default=None,
+        help="Coefficients for computing the Prodigy optimizer's stepsize using running averages. If set to None, uses the value of square root of beta2.",
+    )
+    parser.add_argument(
+        "--prodigy_decouple",
+        action="store_true",
+        help="Use AdamW style decoupled weight decay.",
+    )
+    parser.add_argument(
+        "--weight_decay",
+        type=float,
+        default=1e-04,
+        help="Weight decay to use for optimizer.",
+    )
+    parser.add_argument(
+        "--epsilon",
+        type=float,
+        default=1e-8,
+        help="Epsilon value for the Adam optimizer and Prodigy optimizers.",
+    )
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--prodigy_use_bias_correction",
+        action="store_true",
+        help="Turn on Adam's bias correction.",
+    )
+    parser.add_argument(
+        "--prodigy_safeguard_warmup",
+        action="store_true",
+        help="Remove lr from the denominator of D estimate to avoid issues during warm-up stage.",
+    )
+    parser.add_argument(
+        "--use_cpu_offload_optimizer",
+        action="store_true",
+        help="Whether or not to use the CPUOffloadOptimizer from TorchAO to perform optimization step and maintain parameters on the CPU.",
+    )
+    parser.add_argument(
+        "--offload_gradients",
+        action="store_true",
+        help="Whether or not to offload the gradients to CPU when using the CPUOffloadOptimizer from TorchAO.",
+    )
+def _get_configuration_args(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument("--tracker_name", type=str, default=None, help="Project tracker name")
+    parser.add_argument(
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the model to the Hub.",
+    )
+    parser.add_argument(
+        "--hub_token",
+        type=str,
+        default=None,
+        help="The token to use to push to the Model Hub.",
+    )
+    parser.add_argument(
+        "--hub_model_id",
+        type=str,
+        default=None,
+        help="The name of the repository to keep in sync with the local `output_dir`.",
+    )
+    parser.add_argument(
+        "--logging_dir",
+        type=str,
+        default="logs",
+        help="Directory where logs are stored.",
+    )
+    parser.add_argument(
+        "--allow_tf32",
+        action="store_true",
+        help=(
+            "Whether or not to allow TF32 on Ampere GPUs. Can be used to speed up training. For more information, see"
+            " https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices"
+        ),
+    )
+    parser.add_argument(
+        "--nccl_timeout",
+        type=int,
+        default=600,
+        help="Maximum timeout duration before which allgather, or related, operations fail in multi-GPU/multi-node training settings.",
+    )
+    parser.add_argument(
+        "--report_to",
+        type=str,
+        default=None,
+        help=(
+            'The integration to report the results and logs to. Supported platforms are `"tensorboard"`'
+            ' (default), `"wandb"` and `"comet_ml"`. Use `"all"` to report to all integrations.'
+        ),
+    )
+def get_args():
+    parser = argparse.ArgumentParser(description="Simple example of a training script for CogVideoX.")
+    _get_model_args(parser)
+    _get_dataset_args(parser)
+    _get_training_args(parser)
+    _get_validation_args(parser)
+    _get_optimizer_args(parser)
+    _get_configuration_args(parser)
+    return parser.parse_args()

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_image_to_video_lora.py ADDED Viewed

	@@ -0,0 +1,1016 @@

+# Copyright 2024 The HuggingFace Team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import gc
+import logging
+import math
+import os
+import random
+import shutil
+from datetime import timedelta
+from pathlib import Path
+from typing import Any, Dict
+import diffusers
+import torch
+import transformers
+import wandb
+from accelerate import Accelerator, DistributedType
+from accelerate.logging import get_logger
+from accelerate.utils import (
+    DistributedDataParallelKwargs,
+    InitProcessGroupKwargs,
+    ProjectConfiguration,
+    set_seed,
+)
+from diffusers import (
+    AutoencoderKLCogVideoX,
+    CogVideoXDPMScheduler,
+    CogVideoXImageToVideoPipeline,
+    CogVideoXTransformer3DModel,
+)
+from diffusers.models.autoencoders.vae import DiagonalGaussianDistribution
+from diffusers.optimization import get_scheduler
+from diffusers.training_utils import cast_training_params
+from diffusers.utils import convert_unet_state_dict_to_peft, export_to_video, load_image
+from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
+from huggingface_hub import create_repo, upload_folder
+from peft import LoraConfig, get_peft_model_state_dict, set_peft_model_state_dict
+from torch.utils.data import DataLoader
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer, T5EncoderModel
+from args import get_args  # isort:skip
+from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop  # isort:skip
+from text_encoder import compute_prompt_embeddings  # isort:skip
+from utils import (
+    get_gradient_norm,
+    get_optimizer,
+    prepare_rotary_positional_embeddings,
+    print_memory,
+    reset_memory,
+    unwrap_model,
+)
+logger = get_logger(__name__)
+def save_model_card(
+    repo_id: str,
+    videos=None,
+    base_model: str = None,
+    validation_prompt=None,
+    repo_folder=None,
+    fps=8,
+):
+    widget_dict = []
+    if videos is not None:
+        for i, video in enumerate(videos):
+            export_to_video(video, os.path.join(repo_folder, f"final_video_{i}.mp4", fps=fps))
+            widget_dict.append(
+                {
+                    "text": validation_prompt if validation_prompt else " ",
+                    "output": {"url": f"video_{i}.mp4"},
+                }
+            )
+    model_description = f"""
+# CogVideoX LoRA Finetune
+<Gallery />
+## Model description
+This is a lora finetune of the CogVideoX model `{base_model}`.
+The model was trained using [CogVideoX Factory](https://github.com/a-r-r-o-w/cogvideox-factory) - a repository containing memory-optimized training scripts for the CogVideoX family of models using [TorchAO](https://github.com/pytorch/ao) and [DeepSpeed](https://github.com/microsoft/DeepSpeed). The scripts were adopted from [CogVideoX Diffusers trainer](https://github.com/huggingface/diffusers/blob/main/examples/cogvideo/train_cogvideox_lora.py).
+## Download model
+[Download LoRA]({repo_id}/tree/main) in the Files & Versions tab.
+## Usage
+Requires the [🧨 Diffusers library](https://github.com/huggingface/diffusers) installed.
+```py
+import torch
+from diffusers import CogVideoXImageToVideoPipeline
+from diffusers.utils import export_to_video, load_image
+pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16).to("cuda")
+pipe.load_lora_weights("{repo_id}", weight_name="pytorch_lora_weights.safetensors", adapter_name="cogvideox-lora")
+# The LoRA adapter weights are determined by what was used for training.
+# In this case, we assume `--lora_alpha` is 32 and `--rank` is 64.
+# It can be made lower or higher from what was used in training to decrease or amplify the effect
+# of the LoRA upto a tolerance, beyond which one might notice no effect at all or overflows.
+pipe.set_adapters(["cogvideox-lora"], [32 / 64])
+image = load_image("/path/to/image.png")
+video = pipe(image=image, prompt="{validation_prompt}", guidance_scale=6, use_dynamic_cfg=True).frames[0]
+export_to_video(video, "output.mp4", fps=8)
+```
+For more details, including weighting, merging and fusing LoRAs, check the [documentation](https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading_adapters) on loading LoRAs in diffusers.
+## License
+Please adhere to the licensing terms as described [here](https://huggingface.co/THUDM/CogVideoX-5b-I2V/blob/main/LICENSE).
+"""
+    model_card = load_or_create_model_card(
+        repo_id_or_path=repo_id,
+        from_training=True,
+        license="other",
+        base_model=base_model,
+        prompt=validation_prompt,
+        model_description=model_description,
+        widget=widget_dict,
+    )
+    tags = [
+        "text-to-video",
+        "image-to-video",
+        "diffusers-training",
+        "diffusers",
+        "lora",
+        "cogvideox",
+        "cogvideox-diffusers",
+        "template:sd-lora",
+    ]
+    model_card = populate_model_card(model_card, tags=tags)
+    model_card.save(os.path.join(repo_folder, "README.md"))
+def log_validation(
+    accelerator: Accelerator,
+    pipe: CogVideoXImageToVideoPipeline,
+    args: Dict[str, Any],
+    pipeline_args: Dict[str, Any],
+    is_final_validation: bool = False,
+):
+    logger.info(
+        f"Running validation... \n Generating {args.num_validation_videos} videos with prompt: {pipeline_args['prompt']}."
+    )
+    pipe = pipe.to(accelerator.device)
+    # run inference
+    generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) if args.seed else None
+    videos = []
+    for _ in range(args.num_validation_videos):
+        video = pipe(**pipeline_args, generator=generator, output_type="np").frames[0]
+        videos.append(video)
+    for tracker in accelerator.trackers:
+        phase_name = "test" if is_final_validation else "validation"
+        if tracker.name == "wandb":
+            video_filenames = []
+            for i, video in enumerate(videos):
+                prompt = (
+                    pipeline_args["prompt"][:25]
+                    .replace(" ", "_")
+                    .replace(" ", "_")
+                    .replace("'", "_")
+                    .replace('"', "_")
+                    .replace("/", "_")
+                )
+                filename = os.path.join(args.output_dir, f"{phase_name}_video_{i}_{prompt}.mp4")
+                export_to_video(video, filename, fps=8)
+                video_filenames.append(filename)
+            tracker.log(
+                {
+                    phase_name: [
+                        wandb.Video(filename, caption=f"{i}: {pipeline_args['prompt']}")
+                        for i, filename in enumerate(video_filenames)
+                    ]
+                }
+            )
+    return videos
+def run_validation(
+    args: Dict[str, Any],
+    accelerator: Accelerator,
+    transformer,
+    scheduler,
+    model_config: Dict[str, Any],
+    weight_dtype: torch.dtype,
+) -> None:
+    accelerator.print("===== Memory before validation =====")
+    print_memory(accelerator.device)
+    torch.cuda.synchronize(accelerator.device)
+    pipe = CogVideoXImageToVideoPipeline.from_pretrained(
+        args.pretrained_model_name_or_path,
+        transformer=unwrap_model(accelerator, transformer),
+        scheduler=scheduler,
+        revision=args.revision,
+        variant=args.variant,
+        torch_dtype=weight_dtype,
+    )
+    if args.enable_slicing:
+        pipe.vae.enable_slicing()
+    if args.enable_tiling:
+        pipe.vae.enable_tiling()
+    if args.enable_model_cpu_offload:
+        pipe.enable_model_cpu_offload()
+    validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+    validation_images = args.validation_images.split(args.validation_prompt_separator)
+    for validation_image, validation_prompt in zip(validation_images, validation_prompts):
+        pipeline_args = {
+            "image": load_image(validation_image),
+            "prompt": validation_prompt,
+            "guidance_scale": args.guidance_scale,
+            "use_dynamic_cfg": args.use_dynamic_cfg,
+            "height": args.height,
+            "width": args.width,
+            "max_sequence_length": model_config.max_text_seq_length,
+        }
+        log_validation(
+            pipe=pipe,
+            args=args,
+            accelerator=accelerator,
+            pipeline_args=pipeline_args,
+        )
+    accelerator.print("===== Memory after validation =====")
+    print_memory(accelerator.device)
+    reset_memory(accelerator.device)
+    del pipe
+    gc.collect()
+    torch.cuda.empty_cache()
+    torch.cuda.synchronize(accelerator.device)
+class CollateFunction:
+    def __init__(self, weight_dtype: torch.dtype, load_tensors: bool) -> None:
+        self.weight_dtype = weight_dtype
+        self.load_tensors = load_tensors
+    def __call__(self, data: Dict[str, Any]) -> Dict[str, torch.Tensor]:
+        prompts = [x["prompt"] for x in data[0]]
+        if self.load_tensors:
+            prompts = torch.stack(prompts).to(dtype=self.weight_dtype, non_blocking=True)
+        images = [x["image"] for x in data[0]]
+        images = torch.stack(images).to(dtype=self.weight_dtype, non_blocking=True)
+        videos = [x["video"] for x in data[0]]
+        videos = torch.stack(videos).to(dtype=self.weight_dtype, non_blocking=True)
+        return {
+            "images": images,
+            "videos": videos,
+            "prompts": prompts,
+        }
+def main(args):
+    if args.report_to == "wandb" and args.hub_token is not None:
+        raise ValueError(
+            "You cannot use both --report_to=wandb and --hub_token due to a security risk of exposing your token."
+            " Please use `huggingface-cli login` to authenticate with the Hub."
+        )
+    if torch.backends.mps.is_available() and args.mixed_precision == "bf16":
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    logging_dir = Path(args.output_dir, args.logging_dir)
+    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
+    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+    init_process_group_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=args.nccl_timeout))
+    accelerator = Accelerator(
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        mixed_precision=args.mixed_precision,
+        log_with=args.report_to,
+        project_config=accelerator_project_config,
+        kwargs_handlers=[ddp_kwargs, init_process_group_kwargs],
+    )
+    # Disable AMP for MPS.
+    if torch.backends.mps.is_available():
+        accelerator.native_amp = False
+    # Make one log on every process with the configuration for debugging.
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+    if accelerator.is_local_main_process:
+        transformers.utils.logging.set_verbosity_warning()
+        diffusers.utils.logging.set_verbosity_info()
+    else:
+        transformers.utils.logging.set_verbosity_error()
+        diffusers.utils.logging.set_verbosity_error()
+    # If passed along, set the training seed now.
+    if args.seed is not None:
+        set_seed(args.seed)
+    # Handle the repository creation
+    if accelerator.is_main_process:
+        if args.output_dir is not None:
+            os.makedirs(args.output_dir, exist_ok=True)
+        if args.push_to_hub:
+            repo_id = create_repo(
+                repo_id=args.hub_model_id or Path(args.output_dir).name,
+                exist_ok=True,
+            ).repo_id
+    # Prepare models and scheduler
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="tokenizer",
+        revision=args.revision,
+    )
+    text_encoder = T5EncoderModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="text_encoder",
+        revision=args.revision,
+    )
+    # CogVideoX-2b weights are stored in float16
+    # CogVideoX-5b and CogVideoX-5b-I2V weights are stored in bfloat16
+    load_dtype = torch.bfloat16 if "5b" in args.pretrained_model_name_or_path.lower() else torch.float16
+    transformer = CogVideoXTransformer3DModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="transformer",
+        torch_dtype=load_dtype,
+        revision=args.revision,
+        variant=args.variant,
+    )
+    # These changes will also be required when trying to run inference with the trained lora
+    if args.ignore_learned_positional_embeddings:
+        del transformer.patch_embed.pos_embedding
+        transformer.patch_embed.use_learned_positional_embeddings = False
+        transformer.config.use_learned_positional_embeddings = False
+    vae = AutoencoderKLCogVideoX.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="vae",
+        revision=args.revision,
+        variant=args.variant,
+    )
+    scheduler = CogVideoXDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+    if args.enable_slicing:
+        vae.enable_slicing()
+    if args.enable_tiling:
+        vae.enable_tiling()
+    # We only train the additional adapter LoRA layers
+    text_encoder.requires_grad_(False)
+    transformer.requires_grad_(False)
+    vae.requires_grad_(False)
+    VAE_SCALING_FACTOR = vae.config.scaling_factor
+    VAE_SCALE_FACTOR_SPATIAL = 2 ** (len(vae.config.block_out_channels) - 1)
+    RoPE_BASE_HEIGHT = transformer.config.sample_height * VAE_SCALE_FACTOR_SPATIAL
+    RoPE_BASE_WIDTH = transformer.config.sample_width * VAE_SCALE_FACTOR_SPATIAL
+    # For mixed precision training we cast all non-trainable weights (vae, text_encoder and transformer) to half-precision
+    # as these weights are only used for inference, keeping weights in full precision is not required.
+    weight_dtype = torch.float32
+    if accelerator.state.deepspeed_plugin:
+        # DeepSpeed is handling precision, use what's in the DeepSpeed config
+        if (
+            "fp16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["fp16"]["enabled"]
+        ):
+            weight_dtype = torch.float16
+        if (
+            "bf16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["bf16"]["enabled"]
+        ):
+            weight_dtype = torch.bfloat16
+    else:
+        if accelerator.mixed_precision == "fp16":
+            weight_dtype = torch.float16
+        elif accelerator.mixed_precision == "bf16":
+            weight_dtype = torch.bfloat16
+    if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    text_encoder.to(accelerator.device, dtype=weight_dtype)
+    transformer.to(accelerator.device, dtype=weight_dtype)
+    vae.to(accelerator.device, dtype=weight_dtype)
+    if args.gradient_checkpointing:
+        transformer.enable_gradient_checkpointing()
+    # now we will add new LoRA weights to the attention layers
+    transformer_lora_config = LoraConfig(
+        r=args.rank,
+        lora_alpha=args.lora_alpha,
+        init_lora_weights=True,
+        target_modules=["to_k", "to_q", "to_v", "to_out.0"],
+    )
+    transformer.add_adapter(transformer_lora_config)
+    # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
+    def save_model_hook(models, weights, output_dir):
+        if accelerator.is_main_process:
+            transformer_lora_layers_to_save = None
+            for model in models:
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    model = unwrap_model(accelerator, model)
+                    transformer_lora_layers_to_save = get_peft_model_state_dict(model)
+                else:
+                    raise ValueError(f"Unexpected save model: {model.__class__}")
+                # make sure to pop weight so that corresponding model is not saved again
+                if weights:
+                    weights.pop()
+            CogVideoXImageToVideoPipeline.save_lora_weights(
+                output_dir,
+                transformer_lora_layers=transformer_lora_layers_to_save,
+            )
+    def load_model_hook(models, input_dir):
+        transformer_ = None
+        # This is a bit of a hack but I don't know any other solution.
+        if not accelerator.distributed_type == DistributedType.DEEPSPEED:
+            while len(models) > 0:
+                model = models.pop()
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    transformer_ = unwrap_model(accelerator, model)
+                else:
+                    raise ValueError(f"Unexpected save model: {unwrap_model(accelerator, model).__class__}")
+        else:
+            transformer_ = CogVideoXTransformer3DModel.from_pretrained(
+                args.pretrained_model_name_or_path, subfolder="transformer"
+            )
+            transformer_.add_adapter(transformer_lora_config)
+        lora_state_dict = CogVideoXImageToVideoPipeline.lora_state_dict(input_dir)
+        transformer_state_dict = {
+            f'{k.replace("transformer.", "")}': v for k, v in lora_state_dict.items() if k.startswith("transformer.")
+        }
+        transformer_state_dict = convert_unet_state_dict_to_peft(transformer_state_dict)
+        incompatible_keys = set_peft_model_state_dict(transformer_, transformer_state_dict, adapter_name="default")
+        if incompatible_keys is not None:
+            # check only for unexpected keys
+            unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None)
+            if unexpected_keys:
+                logger.warning(
+                    f"Loading adapter weights from state_dict led to unexpected keys not found in the model: "
+                    f" {unexpected_keys}. "
+                )
+        # Make sure the trainable params are in float32. This is again needed since the base models
+        # are in `weight_dtype`. More details:
+        # https://github.com/huggingface/diffusers/pull/6514#discussion_r1449796804
+        if args.mixed_precision == "fp16":
+            # only upcast trainable parameters (LoRA) into fp32
+            cast_training_params([transformer_])
+    accelerator.register_save_state_pre_hook(save_model_hook)
+    accelerator.register_load_state_pre_hook(load_model_hook)
+    # Enable TF32 for faster training on Ampere GPUs,
+    # cf https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
+    if args.allow_tf32 and torch.cuda.is_available():
+        torch.backends.cuda.matmul.allow_tf32 = True
+    if args.scale_lr:
+        args.learning_rate = (
+            args.learning_rate * args.gradient_accumulation_steps * args.train_batch_size * accelerator.num_processes
+        )
+    # Make sure the trainable params are in float32.
+    if args.mixed_precision == "fp16":
+        # only upcast trainable parameters (LoRA) into fp32
+        cast_training_params([transformer], dtype=torch.float32)
+    transformer_lora_parameters = list(filter(lambda p: p.requires_grad, transformer.parameters()))
+    # Optimization parameters
+    transformer_parameters_with_lr = {
+        "params": transformer_lora_parameters,
+        "lr": args.learning_rate,
+    }
+    params_to_optimize = [transformer_parameters_with_lr]
+    num_trainable_parameters = sum(param.numel() for model in params_to_optimize for param in model["params"])
+    use_deepspeed_optimizer = (
+        accelerator.state.deepspeed_plugin is not None
+        and "optimizer" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    use_deepspeed_scheduler = (
+        accelerator.state.deepspeed_plugin is not None
+        and "scheduler" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    optimizer = get_optimizer(
+        params_to_optimize=params_to_optimize,
+        optimizer_name=args.optimizer,
+        learning_rate=args.learning_rate,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        beta3=args.beta3,
+        epsilon=args.epsilon,
+        weight_decay=args.weight_decay,
+        prodigy_decouple=args.prodigy_decouple,
+        prodigy_use_bias_correction=args.prodigy_use_bias_correction,
+        prodigy_safeguard_warmup=args.prodigy_safeguard_warmup,
+        use_8bit=args.use_8bit,
+        use_4bit=args.use_4bit,
+        use_torchao=args.use_torchao,
+        use_deepspeed=use_deepspeed_optimizer,
+        use_cpu_offload_optimizer=args.use_cpu_offload_optimizer,
+        offload_gradients=args.offload_gradients,
+    )
+    # Dataset and DataLoader
+    dataset_init_kwargs = {
+        "data_root": args.data_root,
+        "dataset_file": args.dataset_file,
+        "caption_column": args.caption_column,
+        "video_column": args.video_column,
+        "max_num_frames": args.max_num_frames,
+        "id_token": args.id_token,
+        "height_buckets": args.height_buckets,
+        "width_buckets": args.width_buckets,
+        "frame_buckets": args.frame_buckets,
+        "load_tensors": args.load_tensors,
+        "random_flip": args.random_flip,
+        "image_to_video": True,
+    }
+    if args.video_reshape_mode is None:
+        train_dataset = VideoDatasetWithResizing(**dataset_init_kwargs)
+    else:
+        train_dataset = VideoDatasetWithResizeAndRectangleCrop(
+            video_reshape_mode=args.video_reshape_mode, **dataset_init_kwargs
+        )
+    collate_fn = CollateFunction(weight_dtype, args.load_tensors)
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=1,
+        sampler=BucketSampler(train_dataset, batch_size=args.train_batch_size, shuffle=True),
+        collate_fn=collate_fn,
+        num_workers=args.dataloader_num_workers,
+        pin_memory=args.pin_memory,
+    )
+    # Scheduler and math around the number of training steps.
+    overrode_max_train_steps = False
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.max_train_steps is None:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+        overrode_max_train_steps = True
+    if args.use_cpu_offload_optimizer:
+        lr_scheduler = None
+        accelerator.print(
+            "CPU Offload Optimizer cannot be used with DeepSpeed or builtin PyTorch LR Schedulers. If "
+            "you are training with those settings, they will be ignored."
+        )
+    else:
+        if use_deepspeed_scheduler:
+            from accelerate.utils import DummyScheduler
+            lr_scheduler = DummyScheduler(
+                name=args.lr_scheduler,
+                optimizer=optimizer,
+                total_num_steps=args.max_train_steps * accelerator.num_processes,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+            )
+        else:
+            lr_scheduler = get_scheduler(
+                args.lr_scheduler,
+                optimizer=optimizer,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+                num_training_steps=args.max_train_steps * accelerator.num_processes,
+                num_cycles=args.lr_num_cycles,
+                power=args.lr_power,
+            )
+    # Prepare everything with our `accelerator`.
+    transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+        transformer, optimizer, train_dataloader, lr_scheduler
+    )
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if overrode_max_train_steps:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+    # Afterwards we recalculate our number of training epochs
+    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    # We need to initialize the trackers we use, and also store our configuration.
+    # The trackers initializes automatically on the main process.
+    if accelerator.distributed_type == DistributedType.DEEPSPEED or accelerator.is_main_process:
+        tracker_name = args.tracker_name or "cogvideox-lora"
+        accelerator.init_trackers(tracker_name, config=vars(args))
+        accelerator.print("===== Memory before training =====")
+        reset_memory(accelerator.device)
+        print_memory(accelerator.device)
+    # Train!
+    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("***** Running training *****")
+    accelerator.print(f"  Num trainable parameters = {num_trainable_parameters}")
+    accelerator.print(f"  Num examples = {len(train_dataset)}")
+    accelerator.print(f"  Num batches each epoch = {len(train_dataloader)}")
+    accelerator.print(f"  Num epochs = {args.num_train_epochs}")
+    accelerator.print(f"  Instantaneous batch size per device = {args.train_batch_size}")
+    accelerator.print(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
+    accelerator.print(f"  Gradient accumulation steps = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  Total optimization steps = {args.max_train_steps}")
+    global_step = 0
+    first_epoch = 0
+    # Potentially load in the weights and states from a previous save
+    if not args.resume_from_checkpoint:
+        initial_global_step = 0
+    else:
+        if args.resume_from_checkpoint != "latest":
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = os.listdir(args.output_dir)
+            dirs = [d for d in dirs if d.startswith("checkpoint")]
+            dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
+            path = dirs[-1] if len(dirs) > 0 else None
+        if path is None:
+            accelerator.print(
+                f"Checkpoint '{args.resume_from_checkpoint}' does not exist. Starting a new training run."
+            )
+            args.resume_from_checkpoint = None
+            initial_global_step = 0
+        else:
+            accelerator.print(f"Resuming from checkpoint {path}")
+            accelerator.load_state(os.path.join(args.output_dir, path))
+            global_step = int(path.split("-")[1])
+            initial_global_step = global_step
+            first_epoch = global_step // num_update_steps_per_epoch
+    progress_bar = tqdm(
+        range(0, args.max_train_steps),
+        initial=initial_global_step,
+        desc="Steps",
+        # Only show the progress bar once on each machine.
+        disable=not accelerator.is_local_main_process,
+    )
+    # For DeepSpeed training
+    model_config = transformer.module.config if hasattr(transformer, "module") else transformer.config
+    if args.load_tensors:
+        del vae, text_encoder
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+    alphas_cumprod = scheduler.alphas_cumprod.to(accelerator.device, dtype=torch.float32)
+    for epoch in range(first_epoch, args.num_train_epochs):
+        transformer.train()
+        for step, batch in enumerate(train_dataloader):
+            models_to_accumulate = [transformer]
+            logs = {}
+            with accelerator.accumulate(models_to_accumulate):
+                images = batch["images"].to(accelerator.device, non_blocking=True)
+                videos = batch["videos"].to(accelerator.device, non_blocking=True)
+                prompts = batch["prompts"]
+                # Encode videos
+                if not args.load_tensors:
+                    images = images.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+                    image_noise_sigma = torch.normal(
+                        mean=-3.0, std=0.5, size=(images.size(0),), device=accelerator.device, dtype=weight_dtype
+                    )
+                    image_noise_sigma = torch.exp(image_noise_sigma)
+                    noisy_images = images + torch.randn_like(images) * image_noise_sigma[:, None, None, None, None]
+                    image_latent_dist = vae.encode(noisy_images).latent_dist
+                    videos = videos.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+                    latent_dist = vae.encode(videos).latent_dist
+                else:
+                    image_latent_dist = DiagonalGaussianDistribution(images)
+                    latent_dist = DiagonalGaussianDistribution(videos)
+                image_latents = image_latent_dist.sample() * VAE_SCALING_FACTOR
+                image_latents = image_latents.permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+                image_latents = image_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                video_latents = latent_dist.sample() * VAE_SCALING_FACTOR
+                video_latents = video_latents.permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+                video_latents = video_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                padding_shape = (video_latents.shape[0], video_latents.shape[1] - 1, *video_latents.shape[2:])
+                latent_padding = image_latents.new_zeros(padding_shape)
+                image_latents = torch.cat([image_latents, latent_padding], dim=1)
+                if random.random() < args.noised_image_dropout:
+                    image_latents = torch.zeros_like(image_latents)
+                # Encode prompts
+                if not args.load_tensors:
+                    prompt_embeds = compute_prompt_embeddings(
+                        tokenizer,
+                        text_encoder,
+                        prompts,
+                        model_config.max_text_seq_length,
+                        accelerator.device,
+                        weight_dtype,
+                        requires_grad=False,
+                    )
+                else:
+                    prompt_embeds = prompts.to(dtype=weight_dtype)
+                # Sample noise that will be added to the latents
+                noise = torch.randn_like(video_latents)
+                batch_size, num_frames, num_channels, height, width = video_latents.shape
+                # Sample a random timestep for each image
+                timesteps = torch.randint(
+                    0,
+                    scheduler.config.num_train_timesteps,
+                    (batch_size,),
+                    dtype=torch.int64,
+                    device=accelerator.device,
+                )
+                # Prepare rotary embeds
+                image_rotary_emb = (
+                    prepare_rotary_positional_embeddings(
+                        height=height * VAE_SCALE_FACTOR_SPATIAL,
+                        width=width * VAE_SCALE_FACTOR_SPATIAL,
+                        num_frames=num_frames,
+                        vae_scale_factor_spatial=VAE_SCALE_FACTOR_SPATIAL,
+                        patch_size=model_config.patch_size,
+                        patch_size_t=model_config.patch_size_t if hasattr(model_config, "patch_size_t") else None,
+                        attention_head_dim=model_config.attention_head_dim,
+                        device=accelerator.device,
+                        base_height=RoPE_BASE_HEIGHT,
+                        base_width=RoPE_BASE_WIDTH,
+                    )
+                    if model_config.use_rotary_positional_embeddings
+                    else None
+                )
+                # Add noise to the model input according to the noise magnitude at each timestep
+                # (this is the forward diffusion process)
+                noisy_video_latents = scheduler.add_noise(video_latents, noise, timesteps)
+                noisy_model_input = torch.cat([noisy_video_latents, image_latents], dim=2)
+                ofs_embed_dim = model_config.ofs_embed_dim if hasattr(model_config, "ofs_embed_dim") else None,
+                ofs_emb = None if ofs_embed_dim is None else noisy_model_input.new_full((1,), fill_value=2.0)
+                # Predict the noise residual
+                model_output = transformer(
+                    hidden_states=noisy_model_input,
+                    encoder_hidden_states=prompt_embeds,
+                    timestep=timesteps,
+                    ofs=ofs_emb,
+                    image_rotary_emb=image_rotary_emb,
+                    return_dict=False,
+                )[0]
+                model_pred = scheduler.get_velocity(model_output, noisy_video_latents, timesteps)
+                weights = 1 / (1 - alphas_cumprod[timesteps])
+                while len(weights.shape) < len(model_pred.shape):
+                    weights = weights.unsqueeze(-1)
+                target = video_latents
+                loss = torch.mean(
+                    (weights * (model_pred - target) ** 2).reshape(batch_size, -1),
+                    dim=1,
+                )
+                loss = loss.mean()
+                accelerator.backward(loss)
+                if accelerator.sync_gradients and accelerator.distributed_type != DistributedType.DEEPSPEED:
+                    gradient_norm_before_clip = get_gradient_norm(transformer.parameters())
+                    accelerator.clip_grad_norm_(transformer.parameters(), args.max_grad_norm)
+                    gradient_norm_after_clip = get_gradient_norm(transformer.parameters())
+                    logs.update(
+                        {
+                            "gradient_norm_before_clip": gradient_norm_before_clip,
+                            "gradient_norm_after_clip": gradient_norm_after_clip,
+                        }
+                    )
+                if accelerator.state.deepspeed_plugin is None:
+                    optimizer.step()
+                    optimizer.zero_grad()
+                if not args.use_cpu_offload_optimizer:
+                    lr_scheduler.step()
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+                # Checkpointing
+                if accelerator.is_main_process or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                    if global_step % args.checkpointing_steps == 0:
+                        # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
+                        if args.checkpoints_total_limit is not None:
+                            checkpoints = os.listdir(args.output_dir)
+                            checkpoints = [d for d in checkpoints if d.startswith("checkpoint")]
+                            checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[1]))
+                            # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
+                            if len(checkpoints) >= args.checkpoints_total_limit:
+                                num_to_remove = len(checkpoints) - args.checkpoints_total_limit + 1
+                                removing_checkpoints = checkpoints[0:num_to_remove]
+                                logger.info(
+                                    f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
+                                )
+                                logger.info(f"Removing checkpoints: {', '.join(removing_checkpoints)}")
+                                for removing_checkpoint in removing_checkpoints:
+                                    removing_checkpoint = os.path.join(args.output_dir, removing_checkpoint)
+                                    shutil.rmtree(removing_checkpoint)
+                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
+                        accelerator.save_state(save_path)
+                        logger.info(f"Saved state to {save_path}")
+                # Validation
+                should_run_validation = args.validation_prompt is not None and (
+                    args.validation_steps is not None and global_step % args.validation_steps == 0
+                )
+                if should_run_validation:
+                    run_validation(args, accelerator, transformer, scheduler, model_config, weight_dtype)
+            last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
+            logs.update(
+                {
+                    "loss": loss.detach().item(),
+                    "lr": last_lr,
+                }
+            )
+            progress_bar.set_postfix(**logs)
+            accelerator.log(logs, step=global_step)
+            if global_step >= args.max_train_steps:
+                break
+        if accelerator.is_main_process:
+            should_run_validation = args.validation_prompt is not None and (
+                args.validation_epochs is not None and (epoch + 1) % args.validation_epochs == 0
+            )
+            if should_run_validation:
+                run_validation(args, accelerator, transformer, scheduler, model_config, weight_dtype)
+    accelerator.wait_for_everyone()
+    if accelerator.is_main_process:
+        transformer = unwrap_model(accelerator, transformer)
+        dtype = (
+            torch.float16
+            if args.mixed_precision == "fp16"
+            else torch.bfloat16
+            if args.mixed_precision == "bf16"
+            else torch.float32
+        )
+        transformer = transformer.to(dtype)
+        transformer_lora_layers = get_peft_model_state_dict(transformer)
+        CogVideoXImageToVideoPipeline.save_lora_weights(
+            save_directory=args.output_dir,
+            transformer_lora_layers=transformer_lora_layers,
+        )
+        # Cleanup trained models to save memory
+        if args.load_tensors:
+            del transformer
+        else:
+            del transformer, text_encoder, vae
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+        accelerator.print("===== Memory before testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        # Final test inference
+        pipe = CogVideoXImageToVideoPipeline.from_pretrained(
+            args.pretrained_model_name_or_path,
+            revision=args.revision,
+            variant=args.variant,
+            torch_dtype=weight_dtype,
+        )
+        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
+        if args.enable_slicing:
+            pipe.vae.enable_slicing()
+        if args.enable_tiling:
+            pipe.vae.enable_tiling()
+        if args.enable_model_cpu_offload:
+            pipe.enable_model_cpu_offload()
+        # Load LoRA weights
+        lora_scaling = args.lora_alpha / args.rank
+        pipe.load_lora_weights(args.output_dir, adapter_name="cogvideox-lora")
+        pipe.set_adapters(["cogvideox-lora"], [lora_scaling])
+        # Run inference
+        validation_outputs = []
+        if args.validation_prompt and args.num_validation_videos > 0:
+            validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+            validation_images = args.validation_images.split(args.validation_prompt_separator)
+            for validation_image, validation_prompt in zip(validation_images, validation_prompts):
+                pipeline_args = {
+                    "image": load_image(validation_image),
+                    "prompt": validation_prompt,
+                    "guidance_scale": args.guidance_scale,
+                    "use_dynamic_cfg": args.use_dynamic_cfg,
+                    "height": args.height,
+                    "width": args.width,
+                }
+                video = log_validation(
+                    accelerator=accelerator,
+                    pipe=pipe,
+                    args=args,
+                    pipeline_args=pipeline_args,
+                    is_final_validation=True,
+                )
+                validation_outputs.extend(video)
+        accelerator.print("===== Memory after testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        torch.cuda.synchronize(accelerator.device)
+        if args.push_to_hub:
+            save_model_card(
+                repo_id,
+                videos=validation_outputs,
+                base_model=args.pretrained_model_name_or_path,
+                validation_prompt=args.validation_prompt,
+                repo_folder=args.output_dir,
+                fps=args.fps,
+            )
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=args.output_dir,
+                commit_message="End of training",
+                ignore_patterns=["step_*", "epoch_*"],
+            )
+    accelerator.end_training()
+if __name__ == "__main__":
+    args = get_args()
+    main(args)

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_image_to_video_sft.py ADDED Viewed

	@@ -0,0 +1,947 @@

+# Copyright 2024 The HuggingFace Team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import gc
+import logging
+import math
+import os
+import random
+import shutil
+from datetime import timedelta
+from pathlib import Path
+from typing import Any, Dict
+import diffusers
+import torch
+import transformers
+import wandb
+from accelerate import Accelerator, DistributedType, init_empty_weights
+from accelerate.logging import get_logger
+from accelerate.utils import (
+    DistributedDataParallelKwargs,
+    InitProcessGroupKwargs,
+    ProjectConfiguration,
+    set_seed,
+)
+from diffusers import (
+    AutoencoderKLCogVideoX,
+    CogVideoXDPMScheduler,
+    CogVideoXImageToVideoPipeline,
+    CogVideoXTransformer3DModel,
+)
+from diffusers.models.autoencoders.vae import DiagonalGaussianDistribution
+from diffusers.optimization import get_scheduler
+from diffusers.training_utils import cast_training_params
+from diffusers.utils import convert_unet_state_dict_to_peft, export_to_video, load_image
+from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
+from huggingface_hub import create_repo, upload_folder
+from torch.utils.data import DataLoader
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer, T5EncoderModel
+from args import get_args  # isort:skip
+from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop  # isort:skip
+from text_encoder import compute_prompt_embeddings  # isort:skip
+from utils import (
+    get_gradient_norm,
+    get_optimizer,
+    prepare_rotary_positional_embeddings,
+    print_memory,
+    reset_memory,
+    unwrap_model,
+)
+logger = get_logger(__name__)
+def save_model_card(
+    repo_id: str,
+    videos=None,
+    base_model: str = None,
+    validation_prompt=None,
+    repo_folder=None,
+    fps=8,
+):
+    widget_dict = []
+    if videos is not None:
+        for i, video in enumerate(videos):
+            export_to_video(video, os.path.join(repo_folder, f"final_video_{i}.mp4", fps=fps))
+            widget_dict.append(
+                {
+                    "text": validation_prompt if validation_prompt else " ",
+                    "output": {"url": f"video_{i}.mp4"},
+                }
+            )
+    model_description = f"""
+# CogVideoX Full Finetune
+<Gallery />
+## Model description
+This is a full finetune of the CogVideoX model `{base_model}`.
+## License
+Please adhere to the licensing terms as described [here](https://huggingface.co/THUDM/CogVideoX-5b-I2V/blob/main/LICENSE).
+"""
+    model_card = load_or_create_model_card(
+        repo_id_or_path=repo_id,
+        from_training=True,
+        license="other",
+        base_model=base_model,
+        prompt=validation_prompt,
+        model_description=model_description,
+        widget=widget_dict,
+    )
+    tags = [
+        "text-to-video",
+        "image-to-video",
+        "diffusers-training",
+        "diffusers",
+        "cogvideox",
+        "cogvideox-diffusers",
+    ]
+    model_card = populate_model_card(model_card, tags=tags)
+    model_card.save(os.path.join(repo_folder, "README.md"))
+def log_validation(
+    accelerator: Accelerator,
+    pipe: CogVideoXImageToVideoPipeline,
+    args: Dict[str, Any],
+    pipeline_args: Dict[str, Any],
+    is_final_validation: bool = False,
+):
+    logger.info(
+        f"Running validation... \n Generating {args.num_validation_videos} videos with prompt: {pipeline_args['prompt']}."
+    )
+    pipe = pipe.to(accelerator.device)
+    # run inference
+    generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) if args.seed else None
+    videos = []
+    for _ in range(args.num_validation_videos):
+        video = pipe(**pipeline_args, generator=generator, output_type="np").frames[0]
+        videos.append(video)
+    for tracker in accelerator.trackers:
+        phase_name = "test" if is_final_validation else "validation"
+        if tracker.name == "wandb":
+            video_filenames = []
+            for i, video in enumerate(videos):
+                prompt = (
+                    pipeline_args["prompt"][:25]
+                    .replace(" ", "_")
+                    .replace(" ", "_")
+                    .replace("'", "_")
+                    .replace('"', "_")
+                    .replace("/", "_")
+                )
+                filename = os.path.join(args.output_dir, f"{phase_name}_video_{i}_{prompt}.mp4")
+                export_to_video(video, filename, fps=8)
+                video_filenames.append(filename)
+            tracker.log(
+                {
+                    phase_name: [
+                        wandb.Video(filename, caption=f"{i}: {pipeline_args['prompt']}")
+                        for i, filename in enumerate(video_filenames)
+                    ]
+                }
+            )
+    return videos
+def run_validation(
+    args: Dict[str, Any],
+    accelerator: Accelerator,
+    transformer,
+    scheduler,
+    model_config: Dict[str, Any],
+    weight_dtype: torch.dtype,
+) -> None:
+    accelerator.print("===== Memory before validation =====")
+    print_memory(accelerator.device)
+    torch.cuda.synchronize(accelerator.device)
+    pipe = CogVideoXImageToVideoPipeline.from_pretrained(
+        args.pretrained_model_name_or_path,
+        transformer=unwrap_model(accelerator, transformer),
+        scheduler=scheduler,
+        revision=args.revision,
+        variant=args.variant,
+        torch_dtype=weight_dtype,
+    )
+    if args.enable_slicing:
+        pipe.vae.enable_slicing()
+    if args.enable_tiling:
+        pipe.vae.enable_tiling()
+    if args.enable_model_cpu_offload:
+        pipe.enable_model_cpu_offload()
+    validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+    validation_images = args.validation_images.split(args.validation_prompt_separator)
+    for validation_image, validation_prompt in zip(validation_images, validation_prompts):
+        pipeline_args = {
+            "image": load_image(validation_image),
+            "prompt": validation_prompt,
+            "guidance_scale": args.guidance_scale,
+            "use_dynamic_cfg": args.use_dynamic_cfg,
+            "height": args.height,
+            "width": args.width,
+            "max_sequence_length": model_config.max_text_seq_length,
+        }
+        log_validation(
+            pipe=pipe,
+            args=args,
+            accelerator=accelerator,
+            pipeline_args=pipeline_args,
+        )
+    accelerator.print("===== Memory after validation =====")
+    print_memory(accelerator.device)
+    reset_memory(accelerator.device)
+    del pipe
+    gc.collect()
+    torch.cuda.empty_cache()
+    torch.cuda.synchronize(accelerator.device)
+class CollateFunction:
+    def __init__(self, weight_dtype: torch.dtype, load_tensors: bool) -> None:
+        self.weight_dtype = weight_dtype
+        self.load_tensors = load_tensors
+    def __call__(self, data: Dict[str, Any]) -> Dict[str, torch.Tensor]:
+        prompts = [x["prompt"] for x in data[0]]
+        if self.load_tensors:
+            prompts = torch.stack(prompts).to(dtype=self.weight_dtype, non_blocking=True)
+        images = [x["image"] for x in data[0]]
+        images = torch.stack(images).to(dtype=self.weight_dtype, non_blocking=True)
+        videos = [x["video"] for x in data[0]]
+        videos = torch.stack(videos).to(dtype=self.weight_dtype, non_blocking=True)
+        return {
+            "images": images,
+            "videos": videos,
+            "prompts": prompts,
+        }
+def main(args):
+    if args.report_to == "wandb" and args.hub_token is not None:
+        raise ValueError(
+            "You cannot use both --report_to=wandb and --hub_token due to a security risk of exposing your token."
+            " Please use `huggingface-cli login` to authenticate with the Hub."
+        )
+    if torch.backends.mps.is_available() and args.mixed_precision == "bf16":
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    logging_dir = Path(args.output_dir, args.logging_dir)
+    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
+    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+    init_process_group_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=args.nccl_timeout))
+    accelerator = Accelerator(
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        mixed_precision=args.mixed_precision,
+        log_with=args.report_to,
+        project_config=accelerator_project_config,
+        kwargs_handlers=[ddp_kwargs, init_process_group_kwargs],
+    )
+    # Disable AMP for MPS.
+    if torch.backends.mps.is_available():
+        accelerator.native_amp = False
+    # Make one log on every process with the configuration for debugging.
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+    if accelerator.is_local_main_process:
+        transformers.utils.logging.set_verbosity_warning()
+        diffusers.utils.logging.set_verbosity_info()
+    else:
+        transformers.utils.logging.set_verbosity_error()
+        diffusers.utils.logging.set_verbosity_error()
+    # If passed along, set the training seed now.
+    if args.seed is not None:
+        set_seed(args.seed)
+    # Handle the repository creation
+    if accelerator.is_main_process:
+        if args.output_dir is not None:
+            os.makedirs(args.output_dir, exist_ok=True)
+        if args.push_to_hub:
+            repo_id = create_repo(
+                repo_id=args.hub_model_id or Path(args.output_dir).name,
+                exist_ok=True,
+            ).repo_id
+    # Prepare models and scheduler
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="tokenizer",
+        revision=args.revision,
+    )
+    text_encoder = T5EncoderModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="text_encoder",
+        revision=args.revision,
+    )
+    # CogVideoX-2b weights are stored in float16
+    # CogVideoX-5b and CogVideoX-5b-I2V weights are stored in bfloat16
+    load_dtype = torch.bfloat16 if "5b" in args.pretrained_model_name_or_path.lower() else torch.float16
+    transformer = CogVideoXTransformer3DModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="transformer",
+        torch_dtype=load_dtype,
+        revision=args.revision,
+        variant=args.variant,
+    )
+    if args.ignore_learned_positional_embeddings:
+        del transformer.patch_embed.pos_embedding
+        transformer.patch_embed.use_learned_positional_embeddings = False
+        transformer.config.use_learned_positional_embeddings = False
+    vae = AutoencoderKLCogVideoX.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="vae",
+        revision=args.revision,
+        variant=args.variant,
+    )
+    scheduler = CogVideoXDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+    if args.enable_slicing:
+        vae.enable_slicing()
+    if args.enable_tiling:
+        vae.enable_tiling()
+    text_encoder.requires_grad_(False)
+    vae.requires_grad_(False)
+    transformer.requires_grad_(True)
+    VAE_SCALING_FACTOR = vae.config.scaling_factor
+    VAE_SCALE_FACTOR_SPATIAL = 2 ** (len(vae.config.block_out_channels) - 1)
+    RoPE_BASE_HEIGHT = transformer.config.sample_height * VAE_SCALE_FACTOR_SPATIAL
+    RoPE_BASE_WIDTH = transformer.config.sample_width * VAE_SCALE_FACTOR_SPATIAL
+    # For mixed precision training we cast all non-trainable weights (vae, text_encoder and transformer) to half-precision
+    # as these weights are only used for inference, keeping weights in full precision is not required.
+    weight_dtype = torch.float32
+    if accelerator.state.deepspeed_plugin:
+        # DeepSpeed is handling precision, use what's in the DeepSpeed config
+        if (
+            "fp16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["fp16"]["enabled"]
+        ):
+            weight_dtype = torch.float16
+        if (
+            "bf16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["bf16"]["enabled"]
+        ):
+            weight_dtype = torch.bfloat16
+    else:
+        if accelerator.mixed_precision == "fp16":
+            weight_dtype = torch.float16
+        elif accelerator.mixed_precision == "bf16":
+            weight_dtype = torch.bfloat16
+    if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    text_encoder.to(accelerator.device, dtype=weight_dtype)
+    transformer.to(accelerator.device, dtype=weight_dtype)
+    vae.to(accelerator.device, dtype=weight_dtype)
+    if args.gradient_checkpointing:
+        transformer.enable_gradient_checkpointing()
+    # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
+    def save_model_hook(models, weights, output_dir):
+        if accelerator.is_main_process:
+            for model in models:
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    model = unwrap_model(accelerator, model)
+                    model.save_pretrained(
+                        os.path.join(output_dir, "transformer"), safe_serialization=True, max_shard_size="5GB"
+                    )
+                else:
+                    raise ValueError(f"Unexpected save model: {model.__class__}")
+                # make sure to pop weight so that corresponding model is not saved again
+                if weights:
+                    weights.pop()
+    def load_model_hook(models, input_dir):
+        transformer_ = None
+        init_under_meta = False
+        # This is a bit of a hack but I don't know any other solution.
+        if not accelerator.distributed_type == DistributedType.DEEPSPEED:
+            while len(models) > 0:
+                model = models.pop()
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    transformer_ = unwrap_model(accelerator, model)
+                else:
+                    raise ValueError(f"Unexpected save model: {unwrap_model(accelerator, model).__class__}")
+        else:
+            with init_empty_weights():
+                transformer_ = CogVideoXTransformer3DModel.from_config(
+                    args.pretrained_model_name_or_path, subfolder="transformer"
+                )
+                init_under_meta = True
+        load_model = CogVideoXTransformer3DModel.from_pretrained(os.path.join(input_dir, "transformer"))
+        transformer_.register_to_config(**load_model.config)
+        transformer_.load_state_dict(load_model.state_dict(), assign=init_under_meta)
+        del load_model
+        # Make sure the trainable params are in float32. This is again needed since the base models
+        # are in `weight_dtype`. More details:
+        # https://github.com/huggingface/diffusers/pull/6514#discussion_r1449796804
+        if args.mixed_precision == "fp16":
+            cast_training_params([transformer_])
+    accelerator.register_save_state_pre_hook(save_model_hook)
+    accelerator.register_load_state_pre_hook(load_model_hook)
+    # Enable TF32 for faster training on Ampere GPUs,
+    # cf https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
+    if args.allow_tf32 and torch.cuda.is_available():
+        torch.backends.cuda.matmul.allow_tf32 = True
+    if args.scale_lr:
+        args.learning_rate = (
+            args.learning_rate * args.gradient_accumulation_steps * args.train_batch_size * accelerator.num_processes
+        )
+    # Make sure the trainable params are in float32.
+    if args.mixed_precision == "fp16":
+        cast_training_params([transformer], dtype=torch.float32)
+    transformer_parameters = list(filter(lambda p: p.requires_grad, transformer.parameters()))
+    # Optimization parameters
+    transformer_parameters_with_lr = {
+        "params": transformer_parameters,
+        "lr": args.learning_rate,
+    }
+    params_to_optimize = [transformer_parameters_with_lr]
+    num_trainable_parameters = sum(param.numel() for model in params_to_optimize for param in model["params"])
+    use_deepspeed_optimizer = (
+        accelerator.state.deepspeed_plugin is not None
+        and "optimizer" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    use_deepspeed_scheduler = (
+        accelerator.state.deepspeed_plugin is not None
+        and "scheduler" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    optimizer = get_optimizer(
+        params_to_optimize=params_to_optimize,
+        optimizer_name=args.optimizer,
+        learning_rate=args.learning_rate,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        beta3=args.beta3,
+        epsilon=args.epsilon,
+        weight_decay=args.weight_decay,
+        prodigy_decouple=args.prodigy_decouple,
+        prodigy_use_bias_correction=args.prodigy_use_bias_correction,
+        prodigy_safeguard_warmup=args.prodigy_safeguard_warmup,
+        use_8bit=args.use_8bit,
+        use_4bit=args.use_4bit,
+        use_torchao=args.use_torchao,
+        use_deepspeed=use_deepspeed_optimizer,
+        use_cpu_offload_optimizer=args.use_cpu_offload_optimizer,
+        offload_gradients=args.offload_gradients,
+    )
+    # Dataset and DataLoader
+    dataset_init_kwargs = {
+        "data_root": args.data_root,
+        "dataset_file": args.dataset_file,
+        "caption_column": args.caption_column,
+        "video_column": args.video_column,
+        "max_num_frames": args.max_num_frames,
+        "id_token": args.id_token,
+        "height_buckets": args.height_buckets,
+        "width_buckets": args.width_buckets,
+        "frame_buckets": args.frame_buckets,
+        "load_tensors": args.load_tensors,
+        "random_flip": args.random_flip,
+        "image_to_video": True,
+    }
+    if args.video_reshape_mode is None:
+        train_dataset = VideoDatasetWithResizing(**dataset_init_kwargs)
+    else:
+        train_dataset = VideoDatasetWithResizeAndRectangleCrop(
+            video_reshape_mode=args.video_reshape_mode, **dataset_init_kwargs
+        )
+    collate_fn = CollateFunction(weight_dtype, args.load_tensors)
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=1,
+        sampler=BucketSampler(train_dataset, batch_size=args.train_batch_size, shuffle=True),
+        collate_fn=collate_fn,
+        num_workers=args.dataloader_num_workers,
+        pin_memory=args.pin_memory,
+    )
+    # Scheduler and math around the number of training steps.
+    overrode_max_train_steps = False
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.max_train_steps is None:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+        overrode_max_train_steps = True
+    if args.use_cpu_offload_optimizer:
+        lr_scheduler = None
+        accelerator.print(
+            "CPU Offload Optimizer cannot be used with DeepSpeed or builtin PyTorch LR Schedulers. If "
+            "you are training with those settings, they will be ignored."
+        )
+    else:
+        if use_deepspeed_scheduler:
+            from accelerate.utils import DummyScheduler
+            lr_scheduler = DummyScheduler(
+                name=args.lr_scheduler,
+                optimizer=optimizer,
+                total_num_steps=args.max_train_steps * accelerator.num_processes,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+            )
+        else:
+            lr_scheduler = get_scheduler(
+                args.lr_scheduler,
+                optimizer=optimizer,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+                num_training_steps=args.max_train_steps * accelerator.num_processes,
+                num_cycles=args.lr_num_cycles,
+                power=args.lr_power,
+            )
+    # Prepare everything with our `accelerator`.
+    transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+        transformer, optimizer, train_dataloader, lr_scheduler
+    )
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if overrode_max_train_steps:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+    # Afterwards we recalculate our number of training epochs
+    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    # We need to initialize the trackers we use, and also store our configuration.
+    # The trackers initializes automatically on the main process.
+    if accelerator.is_main_process:
+        tracker_name = args.tracker_name or "cogvideox-sft"
+        accelerator.init_trackers(tracker_name, config=vars(args))
+        accelerator.print("===== Memory before training =====")
+        reset_memory(accelerator.device)
+        print_memory(accelerator.device)
+    # Train!
+    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("***** Running training *****")
+    accelerator.print(f"  Num trainable parameters = {num_trainable_parameters}")
+    accelerator.print(f"  Num examples = {len(train_dataset)}")
+    accelerator.print(f"  Num batches each epoch = {len(train_dataloader)}")
+    accelerator.print(f"  Num epochs = {args.num_train_epochs}")
+    accelerator.print(f"  Instantaneous batch size per device = {args.train_batch_size}")
+    accelerator.print(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
+    accelerator.print(f"  Gradient accumulation steps = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  Total optimization steps = {args.max_train_steps}")
+    global_step = 0
+    first_epoch = 0
+    # Potentially load in the weights and states from a previous save
+    if not args.resume_from_checkpoint:
+        initial_global_step = 0
+    else:
+        if args.resume_from_checkpoint != "latest":
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = os.listdir(args.output_dir)
+            dirs = [d for d in dirs if d.startswith("checkpoint")]
+            dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
+            path = dirs[-1] if len(dirs) > 0 else None
+        if path is None:
+            accelerator.print(
+                f"Checkpoint '{args.resume_from_checkpoint}' does not exist. Starting a new training run."
+            )
+            args.resume_from_checkpoint = None
+            initial_global_step = 0
+        else:
+            accelerator.print(f"Resuming from checkpoint {path}")
+            accelerator.load_state(os.path.join(args.output_dir, path))
+            global_step = int(path.split("-")[1])
+            initial_global_step = global_step
+            first_epoch = global_step // num_update_steps_per_epoch
+    progress_bar = tqdm(
+        range(0, args.max_train_steps),
+        initial=initial_global_step,
+        desc="Steps",
+        # Only show the progress bar once on each machine.
+        disable=not accelerator.is_local_main_process,
+    )
+    # For DeepSpeed training
+    model_config = transformer.module.config if hasattr(transformer, "module") else transformer.config
+    if args.load_tensors:
+        del vae, text_encoder
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+    alphas_cumprod = scheduler.alphas_cumprod.to(accelerator.device, dtype=torch.float32)
+    for epoch in range(first_epoch, args.num_train_epochs):
+        transformer.train()
+        for step, batch in enumerate(train_dataloader):
+            models_to_accumulate = [transformer]
+            logs = {}
+            with accelerator.accumulate(models_to_accumulate):
+                images = batch["images"].to(accelerator.device, non_blocking=True)
+                videos = batch["videos"].to(accelerator.device, non_blocking=True)
+                prompts = batch["prompts"]
+                # Encode videos
+                if not args.load_tensors:
+                    images = images.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+                    image_noise_sigma = torch.normal(
+                        mean=-3.0, std=0.5, size=(images.size(0),), device=accelerator.device, dtype=weight_dtype
+                    )
+                    image_noise_sigma = torch.exp(image_noise_sigma)
+                    noisy_images = images + torch.randn_like(images) * image_noise_sigma[:, None, None, None, None]
+                    image_latent_dist = vae.encode(noisy_images).latent_dist
+                    videos = videos.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+                    latent_dist = vae.encode(videos).latent_dist
+                else:
+                    image_latent_dist = DiagonalGaussianDistribution(images)
+                    latent_dist = DiagonalGaussianDistribution(videos)
+                image_latents = image_latent_dist.sample() * VAE_SCALING_FACTOR
+                image_latents = image_latents.permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+                image_latents = image_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                video_latents = latent_dist.sample() * VAE_SCALING_FACTOR
+                video_latents = video_latents.permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+                video_latents = video_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                padding_shape = (video_latents.shape[0], video_latents.shape[1] - 1, *video_latents.shape[2:])
+                latent_padding = image_latents.new_zeros(padding_shape)
+                image_latents = torch.cat([image_latents, latent_padding], dim=1)
+                if random.random() < args.noised_image_dropout:
+                    image_latents = torch.zeros_like(image_latents)
+                # Encode prompts
+                if not args.load_tensors:
+                    prompt_embeds = compute_prompt_embeddings(
+                        tokenizer,
+                        text_encoder,
+                        prompts,
+                        model_config.max_text_seq_length,
+                        accelerator.device,
+                        weight_dtype,
+                        requires_grad=False,
+                    )
+                else:
+                    prompt_embeds = prompts.to(dtype=weight_dtype)
+                # Sample noise that will be added to the latents
+                noise = torch.randn_like(video_latents)
+                batch_size, num_frames, num_channels, height, width = video_latents.shape
+                # Sample a random timestep for each image
+                timesteps = torch.randint(
+                    0,
+                    scheduler.config.num_train_timesteps,
+                    (batch_size,),
+                    dtype=torch.int64,
+                    device=accelerator.device,
+                )
+                # Prepare rotary embeds
+                image_rotary_emb = (
+                    prepare_rotary_positional_embeddings(
+                        height=height * VAE_SCALE_FACTOR_SPATIAL,
+                        width=width * VAE_SCALE_FACTOR_SPATIAL,
+                        num_frames=num_frames,
+                        vae_scale_factor_spatial=VAE_SCALE_FACTOR_SPATIAL,
+                        patch_size=model_config.patch_size,
+                        patch_size_t=model_config.patch_size_t if hasattr(model_config, "patch_size_t") else None,
+                        attention_head_dim=model_config.attention_head_dim,
+                        device=accelerator.device,
+                        base_height=RoPE_BASE_HEIGHT,
+                        base_width=RoPE_BASE_WIDTH,
+                    )
+                    if model_config.use_rotary_positional_embeddings
+                    else None
+                )
+                # Add noise to the model input according to the noise magnitude at each timestep
+                # (this is the forward diffusion process)
+                noisy_video_latents = scheduler.add_noise(video_latents, noise, timesteps)
+                noisy_model_input = torch.cat([noisy_video_latents, image_latents], dim=2)
+                model_config.patch_size_t if hasattr(model_config, "patch_size_t") else None,
+                ofs_embed_dim = model_config.ofs_embed_dim if hasattr(model_config, "ofs_embed_dim") else None,
+                ofs_emb = None if ofs_embed_dim is None else noisy_model_input.new_full((1,), fill_value=2.0)
+                # Predict the noise residual
+                model_output = transformer(
+                    hidden_states=noisy_model_input,
+                    encoder_hidden_states=prompt_embeds,
+                    timestep=timesteps,
+                    ofs=ofs_emb,
+                    image_rotary_emb=image_rotary_emb,
+                    return_dict=False,
+                )[0]
+                model_pred = scheduler.get_velocity(model_output, noisy_video_latents, timesteps)
+                weights = 1 / (1 - alphas_cumprod[timesteps])
+                while len(weights.shape) < len(model_pred.shape):
+                    weights = weights.unsqueeze(-1)
+                target = video_latents
+                loss = torch.mean(
+                    (weights * (model_pred - target) ** 2).reshape(batch_size, -1),
+                    dim=1,
+                )
+                loss = loss.mean()
+                accelerator.backward(loss)
+                if accelerator.sync_gradients:
+                    gradient_norm_before_clip = get_gradient_norm(transformer.parameters())
+                    accelerator.clip_grad_norm_(transformer.parameters(), args.max_grad_norm)
+                    gradient_norm_after_clip = get_gradient_norm(transformer.parameters())
+                    logs.update(
+                        {
+                            "gradient_norm_before_clip": gradient_norm_before_clip,
+                            "gradient_norm_after_clip": gradient_norm_after_clip,
+                        }
+                    )
+                if accelerator.state.deepspeed_plugin is None:
+                    optimizer.step()
+                    optimizer.zero_grad()
+                if not args.use_cpu_offload_optimizer:
+                    lr_scheduler.step()
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+                # Checkpointing
+                if accelerator.is_main_process or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                    if global_step % args.checkpointing_steps == 0:
+                        # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
+                        if args.checkpoints_total_limit is not None:
+                            checkpoints = os.listdir(args.output_dir)
+                            checkpoints = [d for d in checkpoints if d.startswith("checkpoint")]
+                            checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[1]))
+                            # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
+                            if len(checkpoints) >= args.checkpoints_total_limit:
+                                num_to_remove = len(checkpoints) - args.checkpoints_total_limit + 1
+                                removing_checkpoints = checkpoints[0:num_to_remove]
+                                logger.info(
+                                    f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
+                                )
+                                logger.info(f"Removing checkpoints: {', '.join(removing_checkpoints)}")
+                                for removing_checkpoint in removing_checkpoints:
+                                    removing_checkpoint = os.path.join(args.output_dir, removing_checkpoint)
+                                    shutil.rmtree(removing_checkpoint)
+                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
+                        accelerator.save_state(save_path)
+                        logger.info(f"Saved state to {save_path}")
+                # Validation
+                should_run_validation = args.validation_prompt is not None and (
+                    args.validation_steps is not None and global_step % args.validation_steps == 0
+                )
+                if should_run_validation:
+                    run_validation(args, accelerator, transformer, scheduler, model_config, weight_dtype)
+            last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
+            logs.update(
+                {
+                    "loss": loss.detach().item(),
+                    "lr": last_lr,
+                }
+            )
+            progress_bar.set_postfix(**logs)
+            accelerator.log(logs, step=global_step)
+            if global_step >= args.max_train_steps:
+                break
+        if accelerator.is_main_process:
+            should_run_validation = args.validation_prompt is not None and (
+                args.validation_epochs is not None and (epoch + 1) % args.validation_epochs == 0
+            )
+            if should_run_validation:
+                run_validation(args, accelerator, transformer, scheduler, model_config, weight_dtype)
+    accelerator.wait_for_everyone()
+    if accelerator.is_main_process:
+        transformer = unwrap_model(accelerator, transformer)
+        dtype = (
+            torch.float16
+            if args.mixed_precision == "fp16"
+            else torch.bfloat16
+            if args.mixed_precision == "bf16"
+            else torch.float32
+        )
+        transformer = transformer.to(dtype)
+        transformer.save_pretrained(
+            os.path.join(args.output_dir, "transformer"),
+            safe_serialization=True,
+            max_shard_size="5GB",
+        )
+        # Cleanup trained models to save memory
+        if args.load_tensors:
+            del transformer
+        else:
+            del transformer, text_encoder, vae
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+        accelerator.print("===== Memory before testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        # Final test inference
+        pipe = CogVideoXImageToVideoPipeline.from_pretrained(
+            args.pretrained_model_name_or_path,
+            revision=args.revision,
+            variant=args.variant,
+            torch_dtype=weight_dtype,
+        )
+        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
+        if args.enable_slicing:
+            pipe.vae.enable_slicing()
+        if args.enable_tiling:
+            pipe.vae.enable_tiling()
+        if args.enable_model_cpu_offload:
+            pipe.enable_model_cpu_offload()
+        # Run inference
+        validation_outputs = []
+        if args.validation_prompt and args.num_validation_videos > 0:
+            validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+            validation_images = args.validation_images.split(args.validation_prompt_separator)
+            for validation_image, validation_prompt in zip(validation_images, validation_prompts):
+                pipeline_args = {
+                    "image": load_image(validation_image),
+                    "prompt": validation_prompt,
+                    "guidance_scale": args.guidance_scale,
+                    "use_dynamic_cfg": args.use_dynamic_cfg,
+                    "height": args.height,
+                    "width": args.width,
+                }
+                video = log_validation(
+                    accelerator=accelerator,
+                    pipe=pipe,
+                    args=args,
+                    pipeline_args=pipeline_args,
+                    is_final_validation=True,
+                )
+                validation_outputs.extend(video)
+        accelerator.print("===== Memory after testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        torch.cuda.synchronize(accelerator.device)
+        if args.push_to_hub:
+            save_model_card(
+                repo_id,
+                videos=validation_outputs,
+                base_model=args.pretrained_model_name_or_path,
+                validation_prompt=args.validation_prompt,
+                repo_folder=args.output_dir,
+                fps=args.fps,
+            )
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=args.output_dir,
+                commit_message="End of training",
+                ignore_patterns=["step_*", "epoch_*"],
+            )
+    accelerator.end_training()
+if __name__ == "__main__":
+    args = get_args()
+    main(args)

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_text_to_video_lora.py ADDED Viewed

	@@ -0,0 +1,955 @@

+# Copyright 2024 The HuggingFace Team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import gc
+import logging
+import math
+import os
+import shutil
+from datetime import timedelta
+from pathlib import Path
+from typing import Any, Dict
+import diffusers
+import torch
+import transformers
+import wandb
+from accelerate import Accelerator, DistributedType
+from accelerate.logging import get_logger
+from accelerate.utils import (
+    DistributedDataParallelKwargs,
+    InitProcessGroupKwargs,
+    ProjectConfiguration,
+    set_seed,
+)
+from diffusers import (
+    AutoencoderKLCogVideoX,
+    CogVideoXDPMScheduler,
+    CogVideoXPipeline,
+    CogVideoXTransformer3DModel,
+)
+from diffusers.models.autoencoders.vae import DiagonalGaussianDistribution
+from diffusers.optimization import get_scheduler
+from diffusers.training_utils import cast_training_params
+from diffusers.utils import convert_unet_state_dict_to_peft, export_to_video
+from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
+from huggingface_hub import create_repo, upload_folder
+from peft import LoraConfig, get_peft_model_state_dict, set_peft_model_state_dict
+from torch.utils.data import DataLoader
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer, T5EncoderModel
+from args import get_args  # isort:skip
+from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop  # isort:skip
+from text_encoder import compute_prompt_embeddings  # isort:skip
+from utils import (
+    get_gradient_norm,
+    get_optimizer,
+    prepare_rotary_positional_embeddings,
+    print_memory,
+    reset_memory,
+    unwrap_model,
+)  # isort:skip
+logger = get_logger(__name__)
+def save_model_card(
+    repo_id: str,
+    videos=None,
+    base_model: str = None,
+    validation_prompt=None,
+    repo_folder=None,
+    fps=8,
+):
+    widget_dict = []
+    if videos is not None:
+        for i, video in enumerate(videos):
+            export_to_video(video, os.path.join(repo_folder, f"final_video_{i}.mp4", fps=fps))
+            widget_dict.append(
+                {
+                    "text": validation_prompt if validation_prompt else " ",
+                    "output": {"url": f"video_{i}.mp4"},
+                }
+            )
+    model_description = f"""
+# CogVideoX LoRA Finetune
+<Gallery />
+## Model description
+This is a lora finetune of the CogVideoX model `{base_model}`.
+The model was trained using [CogVideoX Factory](https://github.com/a-r-r-o-w/cogvideox-factory) - a repository containing memory-optimized training scripts for the CogVideoX family of models using [TorchAO](https://github.com/pytorch/ao) and [DeepSpeed](https://github.com/microsoft/DeepSpeed). The scripts were adopted from [CogVideoX Diffusers trainer](https://github.com/huggingface/diffusers/blob/main/examples/cogvideo/train_cogvideox_lora.py).
+## Download model
+[Download LoRA]({repo_id}/tree/main) in the Files & Versions tab.
+## Usage
+Requires the [🧨 Diffusers library](https://github.com/huggingface/diffusers) installed.
+```py
+import torch
+from diffusers import CogVideoXPipeline
+from diffusers.utils import export_to_video
+pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16).to("cuda")
+pipe.load_lora_weights("{repo_id}", weight_name="pytorch_lora_weights.safetensors", adapter_name="cogvideox-lora")
+# The LoRA adapter weights are determined by what was used for training.
+# In this case, we assume `--lora_alpha` is 32 and `--rank` is 64.
+# It can be made lower or higher from what was used in training to decrease or amplify the effect
+# of the LoRA upto a tolerance, beyond which one might notice no effect at all or overflows.
+pipe.set_adapters(["cogvideox-lora"], [32 / 64])
+video = pipe("{validation_prompt}", guidance_scale=6, use_dynamic_cfg=True).frames[0]
+export_to_video(video, "output.mp4", fps=8)
+```
+For more details, including weighting, merging and fusing LoRAs, check the [documentation](https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading_adapters) on loading LoRAs in diffusers.
+## License
+Please adhere to the licensing terms as described [here](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) and [here](https://huggingface.co/THUDM/CogVideoX-2b/blob/main/LICENSE).
+"""
+    model_card = load_or_create_model_card(
+        repo_id_or_path=repo_id,
+        from_training=True,
+        license="other",
+        base_model=base_model,
+        prompt=validation_prompt,
+        model_description=model_description,
+        widget=widget_dict,
+    )
+    tags = [
+        "text-to-video",
+        "diffusers-training",
+        "diffusers",
+        "lora",
+        "cogvideox",
+        "cogvideox-diffusers",
+        "template:sd-lora",
+    ]
+    model_card = populate_model_card(model_card, tags=tags)
+    model_card.save(os.path.join(repo_folder, "README.md"))
+def log_validation(
+    accelerator: Accelerator,
+    pipe: CogVideoXPipeline,
+    args: Dict[str, Any],
+    pipeline_args: Dict[str, Any],
+    epoch,
+    is_final_validation: bool = False,
+):
+    logger.info(
+        f"Running validation... \n Generating {args.num_validation_videos} videos with prompt: {pipeline_args['prompt']}."
+    )
+    pipe = pipe.to(accelerator.device)
+    # run inference
+    generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) if args.seed else None
+    videos = []
+    for _ in range(args.num_validation_videos):
+        video = pipe(**pipeline_args, generator=generator, output_type="np").frames[0]
+        videos.append(video)
+    for tracker in accelerator.trackers:
+        phase_name = "test" if is_final_validation else "validation"
+        if tracker.name == "wandb":
+            video_filenames = []
+            for i, video in enumerate(videos):
+                prompt = (
+                    pipeline_args["prompt"][:25]
+                    .replace(" ", "_")
+                    .replace(" ", "_")
+                    .replace("'", "_")
+                    .replace('"', "_")
+                    .replace("/", "_")
+                )
+                filename = os.path.join(args.output_dir, f"{phase_name}_video_{i}_{prompt}.mp4")
+                export_to_video(video, filename, fps=8)
+                video_filenames.append(filename)
+            tracker.log(
+                {
+                    phase_name: [
+                        wandb.Video(filename, caption=f"{i}: {pipeline_args['prompt']}")
+                        for i, filename in enumerate(video_filenames)
+                    ]
+                }
+            )
+    return videos
+class CollateFunction:
+    def __init__(self, weight_dtype: torch.dtype, load_tensors: bool) -> None:
+        self.weight_dtype = weight_dtype
+        self.load_tensors = load_tensors
+    def __call__(self, data: Dict[str, Any]) -> Dict[str, torch.Tensor]:
+        prompts = [x["prompt"] for x in data[0]]
+        if self.load_tensors:
+            prompts = torch.stack(prompts).to(dtype=self.weight_dtype, non_blocking=True)
+        videos = [x["video"] for x in data[0]]
+        videos = torch.stack(videos).to(dtype=self.weight_dtype, non_blocking=True)
+        return {
+            "videos": videos,
+            "prompts": prompts,
+        }
+def main(args):
+    if args.report_to == "wandb" and args.hub_token is not None:
+        raise ValueError(
+            "You cannot use both --report_to=wandb and --hub_token due to a security risk of exposing your token."
+            " Please use `huggingface-cli login` to authenticate with the Hub."
+        )
+    if torch.backends.mps.is_available() and args.mixed_precision == "bf16":
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    logging_dir = Path(args.output_dir, args.logging_dir)
+    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
+    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+    init_process_group_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=args.nccl_timeout))
+    accelerator = Accelerator(
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        mixed_precision=args.mixed_precision,
+        log_with=args.report_to,
+        project_config=accelerator_project_config,
+        kwargs_handlers=[ddp_kwargs, init_process_group_kwargs],
+    )
+    # Disable AMP for MPS.
+    if torch.backends.mps.is_available():
+        accelerator.native_amp = False
+    # Make one log on every process with the configuration for debugging.
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+    if accelerator.is_local_main_process:
+        transformers.utils.logging.set_verbosity_warning()
+        diffusers.utils.logging.set_verbosity_info()
+    else:
+        transformers.utils.logging.set_verbosity_error()
+        diffusers.utils.logging.set_verbosity_error()
+    # If passed along, set the training seed now.
+    if args.seed is not None:
+        set_seed(args.seed)
+    # Handle the repository creation
+    if accelerator.is_main_process:
+        if args.output_dir is not None:
+            os.makedirs(args.output_dir, exist_ok=True)
+        if args.push_to_hub:
+            repo_id = create_repo(
+                repo_id=args.hub_model_id or Path(args.output_dir).name,
+                exist_ok=True,
+            ).repo_id
+    # Prepare models and scheduler
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="tokenizer",
+        revision=args.revision,
+    )
+    text_encoder = T5EncoderModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="text_encoder",
+        revision=args.revision,
+    )
+    # CogVideoX-2b weights are stored in float16
+    # CogVideoX-5b and CogVideoX-5b-I2V weights are stored in bfloat16
+    load_dtype = torch.bfloat16 if "5b" in args.pretrained_model_name_or_path.lower() else torch.float16
+    transformer = CogVideoXTransformer3DModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="transformer",
+        torch_dtype=load_dtype,
+        revision=args.revision,
+        variant=args.variant,
+    )
+    vae = AutoencoderKLCogVideoX.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="vae",
+        revision=args.revision,
+        variant=args.variant,
+    )
+    scheduler = CogVideoXDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+    if args.enable_slicing:
+        vae.enable_slicing()
+    if args.enable_tiling:
+        vae.enable_tiling()
+    # We only train the additional adapter LoRA layers
+    text_encoder.requires_grad_(False)
+    transformer.requires_grad_(False)
+    vae.requires_grad_(False)
+    VAE_SCALING_FACTOR = vae.config.scaling_factor
+    VAE_SCALE_FACTOR_SPATIAL = 2 ** (len(vae.config.block_out_channels) - 1)
+    RoPE_BASE_HEIGHT = transformer.config.sample_height * VAE_SCALE_FACTOR_SPATIAL
+    RoPE_BASE_WIDTH = transformer.config.sample_width * VAE_SCALE_FACTOR_SPATIAL
+    # For mixed precision training we cast all non-trainable weights (vae, text_encoder and transformer) to half-precision
+    # as these weights are only used for inference, keeping weights in full precision is not required.
+    weight_dtype = torch.float32
+    if accelerator.state.deepspeed_plugin:
+        # DeepSpeed is handling precision, use what's in the DeepSpeed config
+        if (
+            "fp16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["fp16"]["enabled"]
+        ):
+            weight_dtype = torch.float16
+        if (
+            "bf16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["bf16"]["enabled"]
+        ):
+            weight_dtype = torch.bfloat16
+    else:
+        if accelerator.mixed_precision == "fp16":
+            weight_dtype = torch.float16
+        elif accelerator.mixed_precision == "bf16":
+            weight_dtype = torch.bfloat16
+    if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    text_encoder.to(accelerator.device, dtype=weight_dtype)
+    transformer.to(accelerator.device, dtype=weight_dtype)
+    vae.to(accelerator.device, dtype=weight_dtype)
+    if args.gradient_checkpointing:
+        transformer.enable_gradient_checkpointing()
+    # now we will add new LoRA weights to the attention layers
+    transformer_lora_config = LoraConfig(
+        r=args.rank,
+        lora_alpha=args.lora_alpha,
+        init_lora_weights=True,
+        target_modules=["to_k", "to_q", "to_v", "to_out.0"],
+    )
+    transformer.add_adapter(transformer_lora_config)
+    # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
+    def save_model_hook(models, weights, output_dir):
+        if accelerator.is_main_process:
+            transformer_lora_layers_to_save = None
+            for model in models:
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    model = unwrap_model(accelerator, model)
+                    transformer_lora_layers_to_save = get_peft_model_state_dict(model)
+                else:
+                    raise ValueError(f"unexpected save model: {model.__class__}")
+                # make sure to pop weight so that corresponding model is not saved again
+                if weights:
+                    weights.pop()
+            CogVideoXPipeline.save_lora_weights(
+                output_dir,
+                transformer_lora_layers=transformer_lora_layers_to_save,
+            )
+    def load_model_hook(models, input_dir):
+        transformer_ = None
+        # This is a bit of a hack but I don't know any other solution.
+        if not accelerator.distributed_type == DistributedType.DEEPSPEED:
+            while len(models) > 0:
+                model = models.pop()
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    transformer_ = unwrap_model(accelerator, model)
+                else:
+                    raise ValueError(f"Unexpected save model: {unwrap_model(accelerator, model).__class__}")
+        else:
+            transformer_ = CogVideoXTransformer3DModel.from_pretrained(
+                args.pretrained_model_name_or_path, subfolder="transformer"
+            )
+            transformer_.add_adapter(transformer_lora_config)
+        lora_state_dict = CogVideoXPipeline.lora_state_dict(input_dir)
+        transformer_state_dict = {
+            f'{k.replace("transformer.", "")}': v for k, v in lora_state_dict.items() if k.startswith("transformer.")
+        }
+        transformer_state_dict = convert_unet_state_dict_to_peft(transformer_state_dict)
+        incompatible_keys = set_peft_model_state_dict(transformer_, transformer_state_dict, adapter_name="default")
+        if incompatible_keys is not None:
+            # check only for unexpected keys
+            unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None)
+            if unexpected_keys:
+                logger.warning(
+                    f"Loading adapter weights from state_dict led to unexpected keys not found in the model: "
+                    f" {unexpected_keys}. "
+                )
+        # Make sure the trainable params are in float32. This is again needed since the base models
+        # are in `weight_dtype`. More details:
+        # https://github.com/huggingface/diffusers/pull/6514#discussion_r1449796804
+        if args.mixed_precision == "fp16":
+            # only upcast trainable parameters (LoRA) into fp32
+            cast_training_params([transformer_])
+    accelerator.register_save_state_pre_hook(save_model_hook)
+    accelerator.register_load_state_pre_hook(load_model_hook)
+    # Enable TF32 for faster training on Ampere GPUs,
+    # cf https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
+    if args.allow_tf32 and torch.cuda.is_available():
+        torch.backends.cuda.matmul.allow_tf32 = True
+    if args.scale_lr:
+        args.learning_rate = (
+            args.learning_rate * args.gradient_accumulation_steps * args.train_batch_size * accelerator.num_processes
+        )
+    # Make sure the trainable params are in float32.
+    if args.mixed_precision == "fp16":
+        # only upcast trainable parameters (LoRA) into fp32
+        cast_training_params([transformer], dtype=torch.float32)
+    transformer_lora_parameters = list(filter(lambda p: p.requires_grad, transformer.parameters()))
+    # Optimization parameters
+    transformer_parameters_with_lr = {
+        "params": transformer_lora_parameters,
+        "lr": args.learning_rate,
+    }
+    params_to_optimize = [transformer_parameters_with_lr]
+    num_trainable_parameters = sum(param.numel() for model in params_to_optimize for param in model["params"])
+    use_deepspeed_optimizer = (
+        accelerator.state.deepspeed_plugin is not None
+        and "optimizer" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    use_deepspeed_scheduler = (
+        accelerator.state.deepspeed_plugin is not None
+        and "scheduler" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    optimizer = get_optimizer(
+        params_to_optimize=params_to_optimize,
+        optimizer_name=args.optimizer,
+        learning_rate=args.learning_rate,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        beta3=args.beta3,
+        epsilon=args.epsilon,
+        weight_decay=args.weight_decay,
+        prodigy_decouple=args.prodigy_decouple,
+        prodigy_use_bias_correction=args.prodigy_use_bias_correction,
+        prodigy_safeguard_warmup=args.prodigy_safeguard_warmup,
+        use_8bit=args.use_8bit,
+        use_4bit=args.use_4bit,
+        use_torchao=args.use_torchao,
+        use_deepspeed=use_deepspeed_optimizer,
+        use_cpu_offload_optimizer=args.use_cpu_offload_optimizer,
+        offload_gradients=args.offload_gradients,
+    )
+    # Dataset and DataLoader
+    dataset_init_kwargs = {
+        "data_root": args.data_root,
+        "dataset_file": args.dataset_file,
+        "caption_column": args.caption_column,
+        "video_column": args.video_column,
+        "max_num_frames": args.max_num_frames,
+        "id_token": args.id_token,
+        "height_buckets": args.height_buckets,
+        "width_buckets": args.width_buckets,
+        "frame_buckets": args.frame_buckets,
+        "load_tensors": args.load_tensors,
+        "random_flip": args.random_flip,
+    }
+    if args.video_reshape_mode is None:
+        train_dataset = VideoDatasetWithResizing(**dataset_init_kwargs)
+    else:
+        train_dataset = VideoDatasetWithResizeAndRectangleCrop(
+            video_reshape_mode=args.video_reshape_mode, **dataset_init_kwargs
+        )
+    collate_fn = CollateFunction(weight_dtype, args.load_tensors)
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=1,
+        sampler=BucketSampler(train_dataset, batch_size=args.train_batch_size, shuffle=True),
+        collate_fn=collate_fn,
+        num_workers=args.dataloader_num_workers,
+        pin_memory=args.pin_memory,
+    )
+    # Scheduler and math around the number of training steps.
+    overrode_max_train_steps = False
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.max_train_steps is None:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+        overrode_max_train_steps = True
+    if args.use_cpu_offload_optimizer:
+        lr_scheduler = None
+        accelerator.print(
+            "CPU Offload Optimizer cannot be used with DeepSpeed or builtin PyTorch LR Schedulers. If "
+            "you are training with those settings, they will be ignored."
+        )
+    else:
+        if use_deepspeed_scheduler:
+            from accelerate.utils import DummyScheduler
+            lr_scheduler = DummyScheduler(
+                name=args.lr_scheduler,
+                optimizer=optimizer,
+                total_num_steps=args.max_train_steps * accelerator.num_processes,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+            )
+        else:
+            lr_scheduler = get_scheduler(
+                args.lr_scheduler,
+                optimizer=optimizer,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+                num_training_steps=args.max_train_steps * accelerator.num_processes,
+                num_cycles=args.lr_num_cycles,
+                power=args.lr_power,
+            )
+    # Prepare everything with our `accelerator`.
+    transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+        transformer, optimizer, train_dataloader, lr_scheduler
+    )
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if overrode_max_train_steps:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+    # Afterwards we recalculate our number of training epochs
+    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    # We need to initialize the trackers we use, and also store our configuration.
+    # The trackers initializes automatically on the main process.
+    if accelerator.distributed_type == DistributedType.DEEPSPEED or accelerator.is_main_process:
+        tracker_name = args.tracker_name or "cogvideox-lora"
+        accelerator.init_trackers(tracker_name, config=vars(args))
+        accelerator.print("===== Memory before training =====")
+        reset_memory(accelerator.device)
+        print_memory(accelerator.device)
+    # Train!
+    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("***** Running training *****")
+    accelerator.print(f"  Num trainable parameters = {num_trainable_parameters}")
+    accelerator.print(f"  Num examples = {len(train_dataset)}")
+    accelerator.print(f"  Num batches each epoch = {len(train_dataloader)}")
+    accelerator.print(f"  Num epochs = {args.num_train_epochs}")
+    accelerator.print(f"  Instantaneous batch size per device = {args.train_batch_size}")
+    accelerator.print(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
+    accelerator.print(f"  Gradient accumulation steps = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  Total optimization steps = {args.max_train_steps}")
+    global_step = 0
+    first_epoch = 0
+    # Potentially load in the weights and states from a previous save
+    if not args.resume_from_checkpoint:
+        initial_global_step = 0
+    else:
+        if args.resume_from_checkpoint != "latest":
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = os.listdir(args.output_dir)
+            dirs = [d for d in dirs if d.startswith("checkpoint")]
+            dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
+            path = dirs[-1] if len(dirs) > 0 else None
+        if path is None:
+            accelerator.print(
+                f"Checkpoint '{args.resume_from_checkpoint}' does not exist. Starting a new training run."
+            )
+            args.resume_from_checkpoint = None
+            initial_global_step = 0
+        else:
+            accelerator.print(f"Resuming from checkpoint {path}")
+            accelerator.load_state(os.path.join(args.output_dir, path))
+            global_step = int(path.split("-")[1])
+            initial_global_step = global_step
+            first_epoch = global_step // num_update_steps_per_epoch
+    progress_bar = tqdm(
+        range(0, args.max_train_steps),
+        initial=initial_global_step,
+        desc="Steps",
+        # Only show the progress bar once on each machine.
+        disable=not accelerator.is_local_main_process,
+    )
+    # For DeepSpeed training
+    model_config = transformer.module.config if hasattr(transformer, "module") else transformer.config
+    if args.load_tensors:
+        del vae, text_encoder
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+    alphas_cumprod = scheduler.alphas_cumprod.to(accelerator.device, dtype=torch.float32)
+    for epoch in range(first_epoch, args.num_train_epochs):
+        transformer.train()
+        for step, batch in enumerate(train_dataloader):
+            models_to_accumulate = [transformer]
+            logs = {}
+            with accelerator.accumulate(models_to_accumulate):
+                videos = batch["videos"].to(accelerator.device, non_blocking=True)
+                prompts = batch["prompts"]
+                # Encode videos
+                if not args.load_tensors:
+                    videos = videos.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+                    latent_dist = vae.encode(videos).latent_dist
+                else:
+                    latent_dist = DiagonalGaussianDistribution(videos)
+                videos = latent_dist.sample() * VAE_SCALING_FACTOR
+                videos = videos.permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+                videos = videos.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                model_input = videos
+                # Encode prompts
+                if not args.load_tensors:
+                    prompt_embeds = compute_prompt_embeddings(
+                        tokenizer,
+                        text_encoder,
+                        prompts,
+                        model_config.max_text_seq_length,
+                        accelerator.device,
+                        weight_dtype,
+                        requires_grad=False,
+                    )
+                else:
+                    prompt_embeds = prompts.to(dtype=weight_dtype)
+                # Sample noise that will be added to the latents
+                noise = torch.randn_like(model_input)
+                batch_size, num_frames, num_channels, height, width = model_input.shape
+                # Sample a random timestep for each image
+                timesteps = torch.randint(
+                    0,
+                    scheduler.config.num_train_timesteps,
+                    (batch_size,),
+                    dtype=torch.int64,
+                    device=model_input.device,
+                )
+                # Prepare rotary embeds
+                image_rotary_emb = (
+                    prepare_rotary_positional_embeddings(
+                        height=height * VAE_SCALE_FACTOR_SPATIAL,
+                        width=width * VAE_SCALE_FACTOR_SPATIAL,
+                        num_frames=num_frames,
+                        vae_scale_factor_spatial=VAE_SCALE_FACTOR_SPATIAL,
+                        patch_size=model_config.patch_size,
+                        patch_size_t=model_config.patch_size_t if hasattr(model_config, "patch_size_t") else None,
+                        attention_head_dim=model_config.attention_head_dim,
+                        device=accelerator.device,
+                        base_height=RoPE_BASE_HEIGHT,
+                        base_width=RoPE_BASE_WIDTH,
+                    )
+                    if model_config.use_rotary_positional_embeddings
+                    else None
+                )
+                # Add noise to the model input according to the noise magnitude at each timestep
+                # (this is the forward diffusion process)
+                noisy_model_input = scheduler.add_noise(model_input, noise, timesteps)
+                # Predict the noise residual
+                model_output = transformer(
+                    hidden_states=noisy_model_input,
+                    encoder_hidden_states=prompt_embeds,
+                    timestep=timesteps,
+                    image_rotary_emb=image_rotary_emb,
+                    return_dict=False,
+                )[0]
+                model_pred = scheduler.get_velocity(model_output, noisy_model_input, timesteps)
+                weights = 1 / (1 - alphas_cumprod[timesteps])
+                while len(weights.shape) < len(model_pred.shape):
+                    weights = weights.unsqueeze(-1)
+                target = model_input
+                loss = torch.mean(
+                    (weights * (model_pred - target) ** 2).reshape(batch_size, -1),
+                    dim=1,
+                )
+                loss = loss.mean()
+                accelerator.backward(loss)
+                if accelerator.sync_gradients and accelerator.distributed_type != DistributedType.DEEPSPEED:
+                    gradient_norm_before_clip = get_gradient_norm(transformer.parameters())
+                    accelerator.clip_grad_norm_(transformer.parameters(), args.max_grad_norm)
+                    gradient_norm_after_clip = get_gradient_norm(transformer.parameters())
+                    logs.update(
+                        {
+                            "gradient_norm_before_clip": gradient_norm_before_clip,
+                            "gradient_norm_after_clip": gradient_norm_after_clip,
+                        }
+                    )
+                if accelerator.state.deepspeed_plugin is None:
+                    optimizer.step()
+                    optimizer.zero_grad()
+                if not args.use_cpu_offload_optimizer:
+                    lr_scheduler.step()
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+                if accelerator.distributed_type == DistributedType.DEEPSPEED or accelerator.is_main_process:
+                    if global_step % args.checkpointing_steps == 0:
+                        # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
+                        if args.checkpoints_total_limit is not None:
+                            checkpoints = os.listdir(args.output_dir)
+                            checkpoints = [d for d in checkpoints if d.startswith("checkpoint")]
+                            checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[1]))
+                            # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
+                            if len(checkpoints) >= args.checkpoints_total_limit:
+                                num_to_remove = len(checkpoints) - args.checkpoints_total_limit + 1
+                                removing_checkpoints = checkpoints[0:num_to_remove]
+                                logger.info(
+                                    f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
+                                )
+                                logger.info(f"Removing checkpoints: {', '.join(removing_checkpoints)}")
+                                for removing_checkpoint in removing_checkpoints:
+                                    removing_checkpoint = os.path.join(args.output_dir, removing_checkpoint)
+                                    shutil.rmtree(removing_checkpoint)
+                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
+                        accelerator.save_state(save_path)
+                        logger.info(f"Saved state to {save_path}")
+            last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
+            logs.update(
+                {
+                    "loss": loss.detach().item(),
+                    "lr": last_lr,
+                }
+            )
+            progress_bar.set_postfix(**logs)
+            accelerator.log(logs, step=global_step)
+            if global_step >= args.max_train_steps:
+                break
+        if accelerator.is_main_process:
+            if args.validation_prompt is not None and (epoch + 1) % args.validation_epochs == 0:
+                accelerator.print("===== Memory before validation =====")
+                print_memory(accelerator.device)
+                torch.cuda.synchronize(accelerator.device)
+                pipe = CogVideoXPipeline.from_pretrained(
+                    args.pretrained_model_name_or_path,
+                    transformer=unwrap_model(accelerator, transformer),
+                    scheduler=scheduler,
+                    revision=args.revision,
+                    variant=args.variant,
+                    torch_dtype=weight_dtype,
+                )
+                if args.enable_slicing:
+                    pipe.vae.enable_slicing()
+                if args.enable_tiling:
+                    pipe.vae.enable_tiling()
+                if args.enable_model_cpu_offload:
+                    pipe.enable_model_cpu_offload()
+                validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+                for validation_prompt in validation_prompts:
+                    pipeline_args = {
+                        "prompt": validation_prompt,
+                        "guidance_scale": args.guidance_scale,
+                        "use_dynamic_cfg": args.use_dynamic_cfg,
+                        "height": args.height,
+                        "width": args.width,
+                        "max_sequence_length": model_config.max_text_seq_length,
+                    }
+                    log_validation(
+                        pipe=pipe,
+                        args=args,
+                        accelerator=accelerator,
+                        pipeline_args=pipeline_args,
+                        epoch=epoch,
+                    )
+                accelerator.print("===== Memory after validation =====")
+                print_memory(accelerator.device)
+                reset_memory(accelerator.device)
+                del pipe
+                gc.collect()
+                torch.cuda.empty_cache()
+                torch.cuda.synchronize(accelerator.device)
+    accelerator.wait_for_everyone()
+    if accelerator.is_main_process:
+        transformer = unwrap_model(accelerator, transformer)
+        dtype = (
+            torch.float16
+            if args.mixed_precision == "fp16"
+            else torch.bfloat16
+            if args.mixed_precision == "bf16"
+            else torch.float32
+        )
+        transformer = transformer.to(dtype)
+        transformer_lora_layers = get_peft_model_state_dict(transformer)
+        CogVideoXPipeline.save_lora_weights(
+            save_directory=args.output_dir,
+            transformer_lora_layers=transformer_lora_layers,
+        )
+        # Cleanup trained models to save memory
+        if args.load_tensors:
+            del transformer
+        else:
+            del transformer, text_encoder, vae
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+        accelerator.print("===== Memory before testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        # Final test inference
+        pipe = CogVideoXPipeline.from_pretrained(
+            args.pretrained_model_name_or_path,
+            revision=args.revision,
+            variant=args.variant,
+            torch_dtype=weight_dtype,
+        )
+        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
+        if args.enable_slicing:
+            pipe.vae.enable_slicing()
+        if args.enable_tiling:
+            pipe.vae.enable_tiling()
+        if args.enable_model_cpu_offload:
+            pipe.enable_model_cpu_offload()
+        # Load LoRA weights
+        lora_scaling = args.lora_alpha / args.rank
+        pipe.load_lora_weights(args.output_dir, adapter_name="cogvideox-lora")
+        pipe.set_adapters(["cogvideox-lora"], [lora_scaling])
+        # Run inference
+        validation_outputs = []
+        if args.validation_prompt and args.num_validation_videos > 0:
+            validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+            for validation_prompt in validation_prompts:
+                pipeline_args = {
+                    "prompt": validation_prompt,
+                    "guidance_scale": args.guidance_scale,
+                    "use_dynamic_cfg": args.use_dynamic_cfg,
+                    "height": args.height,
+                    "width": args.width,
+                }
+                video = log_validation(
+                    accelerator=accelerator,
+                    pipe=pipe,
+                    args=args,
+                    pipeline_args=pipeline_args,
+                    epoch=epoch,
+                    is_final_validation=True,
+                )
+                validation_outputs.extend(video)
+        accelerator.print("===== Memory after testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        torch.cuda.synchronize(accelerator.device)
+        if args.push_to_hub:
+            save_model_card(
+                repo_id,
+                videos=validation_outputs,
+                base_model=args.pretrained_model_name_or_path,
+                validation_prompt=args.validation_prompt,
+                repo_folder=args.output_dir,
+                fps=args.fps,
+            )
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=args.output_dir,
+                commit_message="End of training",
+                ignore_patterns=["step_*", "epoch_*"],
+            )
+    accelerator.end_training()
+if __name__ == "__main__":
+    args = get_args()
+    main(args)

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/cogvideox_text_to_video_sft.py ADDED Viewed

	@@ -0,0 +1,917 @@

+# Copyright 2024 The HuggingFace Team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import gc
+import logging
+import math
+import os
+import shutil
+from datetime import timedelta
+from pathlib import Path
+from typing import Any, Dict
+import diffusers
+import torch
+import transformers
+import wandb
+from accelerate import Accelerator, DistributedType, init_empty_weights
+from accelerate.logging import get_logger
+from accelerate.utils import (
+    DistributedDataParallelKwargs,
+    InitProcessGroupKwargs,
+    ProjectConfiguration,
+    set_seed,
+)
+from diffusers import (
+    AutoencoderKLCogVideoX,
+    CogVideoXDPMScheduler,
+    CogVideoXPipeline,
+    CogVideoXTransformer3DModel,
+)
+from diffusers.models.autoencoders.vae import DiagonalGaussianDistribution
+from diffusers.optimization import get_scheduler
+from diffusers.training_utils import cast_training_params
+from diffusers.utils import export_to_video
+from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
+from huggingface_hub import create_repo, upload_folder
+from torch.utils.data import DataLoader
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer, T5EncoderModel
+from args import get_args  # isort:skip
+from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop  # isort:skip
+from text_encoder import compute_prompt_embeddings  # isort:skip
+from utils import (
+    get_gradient_norm,
+    get_optimizer,
+    prepare_rotary_positional_embeddings,
+    print_memory,
+    reset_memory,
+    unwrap_model,
+)  # isort:skip
+logger = get_logger(__name__)
+def save_model_card(
+    repo_id: str,
+    videos=None,
+    base_model: str = None,
+    validation_prompt=None,
+    repo_folder=None,
+    fps=8,
+):
+    widget_dict = []
+    if videos is not None:
+        for i, video in enumerate(videos):
+            export_to_video(video, os.path.join(repo_folder, f"final_video_{i}.mp4", fps=fps))
+            widget_dict.append(
+                {
+                    "text": validation_prompt if validation_prompt else " ",
+                    "output": {"url": f"video_{i}.mp4"},
+                }
+            )
+    model_description = f"""
+# CogVideoX Full Finetune
+<Gallery />
+## Model description
+This is a full finetune of the CogVideoX model `{base_model}`.
+The model was trained using [CogVideoX Factory](https://github.com/a-r-r-o-w/cogvideox-factory) - a repository containing memory-optimized training scripts for the CogVideoX family of models using [TorchAO](https://github.com/pytorch/ao) and [DeepSpeed](https://github.com/microsoft/DeepSpeed). The scripts were adopted from [CogVideoX Diffusers trainer](https://github.com/huggingface/diffusers/blob/main/examples/cogvideo/train_cogvideox_lora.py).
+## Download model
+[Download LoRA]({repo_id}/tree/main) in the Files & Versions tab.
+## Usage
+Requires the [🧨 Diffusers library](https://github.com/huggingface/diffusers) installed.
+```py
+import torch
+from diffusers import CogVideoXPipeline
+from diffusers.utils import export_to_video
+pipe = CogVideoXPipeline.from_pretrained("{repo_id}", torch_dtype=torch.bfloat16).to("cuda")
+video = pipe("{validation_prompt}", guidance_scale=6, use_dynamic_cfg=True).frames[0]
+export_to_video(video, "output.mp4", fps=8)
+```
+For more details, checkout the [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox) for CogVideoX.
+## License
+Please adhere to the licensing terms as described [here](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) and [here](https://huggingface.co/THUDM/CogVideoX-2b/blob/main/LICENSE).
+"""
+    model_card = load_or_create_model_card(
+        repo_id_or_path=repo_id,
+        from_training=True,
+        license="other",
+        base_model=base_model,
+        prompt=validation_prompt,
+        model_description=model_description,
+        widget=widget_dict,
+    )
+    tags = [
+        "text-to-video",
+        "diffusers-training",
+        "diffusers",
+        "cogvideox",
+        "cogvideox-diffusers",
+    ]
+    model_card = populate_model_card(model_card, tags=tags)
+    model_card.save(os.path.join(repo_folder, "README.md"))
+def log_validation(
+    accelerator: Accelerator,
+    pipe: CogVideoXPipeline,
+    args: Dict[str, Any],
+    pipeline_args: Dict[str, Any],
+    epoch,
+    is_final_validation: bool = False,
+):
+    logger.info(
+        f"Running validation... \n Generating {args.num_validation_videos} videos with prompt: {pipeline_args['prompt']}."
+    )
+    pipe = pipe.to(accelerator.device)
+    # run inference
+    generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) if args.seed else None
+    videos = []
+    for _ in range(args.num_validation_videos):
+        video = pipe(**pipeline_args, generator=generator, output_type="np").frames[0]
+        videos.append(video)
+    for tracker in accelerator.trackers:
+        phase_name = "test" if is_final_validation else "validation"
+        if tracker.name == "wandb":
+            video_filenames = []
+            for i, video in enumerate(videos):
+                prompt = (
+                    pipeline_args["prompt"][:25]
+                    .replace(" ", "_")
+                    .replace(" ", "_")
+                    .replace("'", "_")
+                    .replace('"', "_")
+                    .replace("/", "_")
+                )
+                filename = os.path.join(args.output_dir, f"{phase_name}_video_{i}_{prompt}.mp4")
+                export_to_video(video, filename, fps=8)
+                video_filenames.append(filename)
+            tracker.log(
+                {
+                    phase_name: [
+                        wandb.Video(filename, caption=f"{i}: {pipeline_args['prompt']}")
+                        for i, filename in enumerate(video_filenames)
+                    ]
+                }
+            )
+    return videos
+class CollateFunction:
+    def __init__(self, weight_dtype: torch.dtype, load_tensors: bool) -> None:
+        self.weight_dtype = weight_dtype
+        self.load_tensors = load_tensors
+    def __call__(self, data: Dict[str, Any]) -> Dict[str, torch.Tensor]:
+        prompts = [x["prompt"] for x in data[0]]
+        if self.load_tensors:
+            prompts = torch.stack(prompts).to(dtype=self.weight_dtype, non_blocking=True)
+        videos = [x["video"] for x in data[0]]
+        videos = torch.stack(videos).to(dtype=self.weight_dtype, non_blocking=True)
+        return {
+            "videos": videos,
+            "prompts": prompts,
+        }
+def main(args):
+    if args.report_to == "wandb" and args.hub_token is not None:
+        raise ValueError(
+            "You cannot use both --report_to=wandb and --hub_token due to a security risk of exposing your token."
+            " Please use `huggingface-cli login` to authenticate with the Hub."
+        )
+    if torch.backends.mps.is_available() and args.mixed_precision == "bf16":
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    logging_dir = Path(args.output_dir, args.logging_dir)
+    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
+    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+    init_process_group_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=args.nccl_timeout))
+    accelerator = Accelerator(
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        mixed_precision=args.mixed_precision,
+        log_with=args.report_to,
+        project_config=accelerator_project_config,
+        kwargs_handlers=[ddp_kwargs, init_process_group_kwargs],
+    )
+    # Disable AMP for MPS.
+    if torch.backends.mps.is_available():
+        accelerator.native_amp = False
+    # Make one log on every process with the configuration for debugging.
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+    if accelerator.is_local_main_process:
+        transformers.utils.logging.set_verbosity_warning()
+        diffusers.utils.logging.set_verbosity_info()
+    else:
+        transformers.utils.logging.set_verbosity_error()
+        diffusers.utils.logging.set_verbosity_error()
+    # If passed along, set the training seed now.
+    if args.seed is not None:
+        set_seed(args.seed)
+    # Handle the repository creation
+    if accelerator.is_main_process:
+        if args.output_dir is not None:
+            os.makedirs(args.output_dir, exist_ok=True)
+        if args.push_to_hub:
+            repo_id = create_repo(
+                repo_id=args.hub_model_id or Path(args.output_dir).name,
+                exist_ok=True,
+            ).repo_id
+    # Prepare models and scheduler
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="tokenizer",
+        revision=args.revision,
+    )
+    text_encoder = T5EncoderModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="text_encoder",
+        revision=args.revision,
+    )
+    # CogVideoX-2b weights are stored in float16
+    # CogVideoX-5b and CogVideoX-5b-I2V weights are stored in bfloat16
+    load_dtype = torch.bfloat16 if "5b" in args.pretrained_model_name_or_path.lower() else torch.float16
+    transformer = CogVideoXTransformer3DModel.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="transformer",
+        torch_dtype=load_dtype,
+        revision=args.revision,
+        variant=args.variant,
+    )
+    vae = AutoencoderKLCogVideoX.from_pretrained(
+        args.pretrained_model_name_or_path,
+        subfolder="vae",
+        revision=args.revision,
+        variant=args.variant,
+    )
+    scheduler = CogVideoXDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+    if args.enable_slicing:
+        vae.enable_slicing()
+    if args.enable_tiling:
+        vae.enable_tiling()
+    text_encoder.requires_grad_(False)
+    vae.requires_grad_(False)
+    transformer.requires_grad_(True)
+    VAE_SCALING_FACTOR = vae.config.scaling_factor
+    VAE_SCALE_FACTOR_SPATIAL = 2 ** (len(vae.config.block_out_channels) - 1)
+    RoPE_BASE_HEIGHT = transformer.config.sample_height * VAE_SCALE_FACTOR_SPATIAL
+    RoPE_BASE_WIDTH = transformer.config.sample_width * VAE_SCALE_FACTOR_SPATIAL
+    # For mixed precision training we cast all non-trainable weights (vae, text_encoder and transformer) to half-precision
+    # as these weights are only used for inference, keeping weights in full precision is not required.
+    weight_dtype = torch.float32
+    if accelerator.state.deepspeed_plugin:
+        # DeepSpeed is handling precision, use what's in the DeepSpeed config
+        if (
+            "fp16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["fp16"]["enabled"]
+        ):
+            weight_dtype = torch.float16
+        if (
+            "bf16" in accelerator.state.deepspeed_plugin.deepspeed_config
+            and accelerator.state.deepspeed_plugin.deepspeed_config["bf16"]["enabled"]
+        ):
+            weight_dtype = torch.bfloat16
+    else:
+        if accelerator.mixed_precision == "fp16":
+            weight_dtype = torch.float16
+        elif accelerator.mixed_precision == "bf16":
+            weight_dtype = torch.bfloat16
+    if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
+        # due to pytorch#99272, MPS does not yet support bfloat16.
+        raise ValueError(
+            "Mixed precision training with bfloat16 is not supported on MPS. Please use fp16 (recommended) or fp32 instead."
+        )
+    text_encoder.to(accelerator.device, dtype=weight_dtype)
+    transformer.to(accelerator.device, dtype=weight_dtype)
+    vae.to(accelerator.device, dtype=weight_dtype)
+    if args.gradient_checkpointing:
+        transformer.enable_gradient_checkpointing()
+    # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
+    def save_model_hook(models, weights, output_dir):
+        if accelerator.is_main_process:
+            for model in models:
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    model: CogVideoXTransformer3DModel
+                    model = unwrap_model(accelerator, model)
+                    model.save_pretrained(
+                        os.path.join(output_dir, "transformer"), safe_serialization=True, max_shard_size="5GB"
+                    )
+                else:
+                    raise ValueError(f"Unexpected save model: {model.__class__}")
+                # make sure to pop weight so that corresponding model is not saved again
+                if weights:
+                    weights.pop()
+    def load_model_hook(models, input_dir):
+        transformer_ = None
+        init_under_meta = False
+        # This is a bit of a hack but I don't know any other solution.
+        if not accelerator.distributed_type == DistributedType.DEEPSPEED:
+            while len(models) > 0:
+                model = models.pop()
+                if isinstance(unwrap_model(accelerator, model), type(unwrap_model(accelerator, transformer))):
+                    transformer_ = unwrap_model(accelerator, model)
+                else:
+                    raise ValueError(f"Unexpected save model: {unwrap_model(accelerator, model).__class__}")
+        else:
+            with init_empty_weights():
+                transformer_ = CogVideoXTransformer3DModel.from_config(
+                    args.pretrained_model_name_or_path, subfolder="transformer"
+                )
+                init_under_meta = True
+        load_model = CogVideoXTransformer3DModel.from_pretrained(os.path.join(input_dir, "transformer"))
+        transformer_.register_to_config(**load_model.config)
+        transformer_.load_state_dict(load_model.state_dict(), assign=init_under_meta)
+        del load_model
+        # Make sure the trainable params are in float32. This is again needed since the base models
+        # are in `weight_dtype`. More details:
+        # https://github.com/huggingface/diffusers/pull/6514#discussion_r1449796804
+        if args.mixed_precision == "fp16":
+            cast_training_params([transformer_])
+    accelerator.register_save_state_pre_hook(save_model_hook)
+    accelerator.register_load_state_pre_hook(load_model_hook)
+    # Enable TF32 for faster training on Ampere GPUs,
+    # cf https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
+    if args.allow_tf32 and torch.cuda.is_available():
+        torch.backends.cuda.matmul.allow_tf32 = True
+    if args.scale_lr:
+        args.learning_rate = (
+            args.learning_rate * args.gradient_accumulation_steps * args.train_batch_size * accelerator.num_processes
+        )
+    # Make sure the trainable params are in float32.
+    if args.mixed_precision == "fp16":
+        # only upcast trainable parameters (LoRA) into fp32
+        cast_training_params([transformer], dtype=torch.float32)
+    transformer_parameters = list(filter(lambda p: p.requires_grad, transformer.parameters()))
+    # Optimization parameters
+    transformer_parameters_with_lr = {
+        "params": transformer_parameters,
+        "lr": args.learning_rate,
+    }
+    params_to_optimize = [transformer_parameters_with_lr]
+    num_trainable_parameters = sum(param.numel() for model in params_to_optimize for param in model["params"])
+    use_deepspeed_optimizer = (
+        accelerator.state.deepspeed_plugin is not None
+        and "optimizer" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    use_deepspeed_scheduler = (
+        accelerator.state.deepspeed_plugin is not None
+        and "scheduler" in accelerator.state.deepspeed_plugin.deepspeed_config
+    )
+    optimizer = get_optimizer(
+        params_to_optimize=params_to_optimize,
+        optimizer_name=args.optimizer,
+        learning_rate=args.learning_rate,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        beta3=args.beta3,
+        epsilon=args.epsilon,
+        weight_decay=args.weight_decay,
+        prodigy_decouple=args.prodigy_decouple,
+        prodigy_use_bias_correction=args.prodigy_use_bias_correction,
+        prodigy_safeguard_warmup=args.prodigy_safeguard_warmup,
+        use_8bit=args.use_8bit,
+        use_4bit=args.use_4bit,
+        use_torchao=args.use_torchao,
+        use_deepspeed=use_deepspeed_optimizer,
+        use_cpu_offload_optimizer=args.use_cpu_offload_optimizer,
+        offload_gradients=args.offload_gradients,
+    )
+    # Dataset and DataLoader
+    dataset_init_kwargs = {
+        "data_root": args.data_root,
+        "dataset_file": args.dataset_file,
+        "caption_column": args.caption_column,
+        "video_column": args.video_column,
+        "max_num_frames": args.max_num_frames,
+        "id_token": args.id_token,
+        "height_buckets": args.height_buckets,
+        "width_buckets": args.width_buckets,
+        "frame_buckets": args.frame_buckets,
+        "load_tensors": args.load_tensors,
+        "random_flip": args.random_flip,
+    }
+    if args.video_reshape_mode is None:
+        train_dataset = VideoDatasetWithResizing(**dataset_init_kwargs)
+    else:
+        train_dataset = VideoDatasetWithResizeAndRectangleCrop(
+            video_reshape_mode=args.video_reshape_mode, **dataset_init_kwargs
+        )
+    collate_fn = CollateFunction(weight_dtype, args.load_tensors)
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=1,
+        sampler=BucketSampler(train_dataset, batch_size=args.train_batch_size, shuffle=True),
+        collate_fn=collate_fn,
+        num_workers=args.dataloader_num_workers,
+        pin_memory=args.pin_memory,
+    )
+    # Scheduler and math around the number of training steps.
+    overrode_max_train_steps = False
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.max_train_steps is None:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+        overrode_max_train_steps = True
+    if args.use_cpu_offload_optimizer:
+        lr_scheduler = None
+        accelerator.print(
+            "CPU Offload Optimizer cannot be used with DeepSpeed or builtin PyTorch LR Schedulers. If "
+            "you are training with those settings, they will be ignored."
+        )
+    else:
+        if use_deepspeed_scheduler:
+            from accelerate.utils import DummyScheduler
+            lr_scheduler = DummyScheduler(
+                name=args.lr_scheduler,
+                optimizer=optimizer,
+                total_num_steps=args.max_train_steps * accelerator.num_processes,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+            )
+        else:
+            lr_scheduler = get_scheduler(
+                args.lr_scheduler,
+                optimizer=optimizer,
+                num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+                num_training_steps=args.max_train_steps * accelerator.num_processes,
+                num_cycles=args.lr_num_cycles,
+                power=args.lr_power,
+            )
+    # Prepare everything with our `accelerator`.
+    transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+        transformer, optimizer, train_dataloader, lr_scheduler
+    )
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if overrode_max_train_steps:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+    # Afterwards we recalculate our number of training epochs
+    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    # We need to initialize the trackers we use, and also store our configuration.
+    # The trackers initializes automatically on the main process.
+    if accelerator.distributed_type == DistributedType.DEEPSPEED or accelerator.is_main_process:
+        tracker_name = args.tracker_name or "cogvideox-sft"
+        accelerator.init_trackers(tracker_name, config=vars(args))
+        accelerator.print("===== Memory before training =====")
+        reset_memory(accelerator.device)
+        print_memory(accelerator.device)
+    # Train!
+    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("***** Running training *****")
+    accelerator.print(f"  Num trainable parameters = {num_trainable_parameters}")
+    accelerator.print(f"  Num examples = {len(train_dataset)}")
+    accelerator.print(f"  Num batches each epoch = {len(train_dataloader)}")
+    accelerator.print(f"  Num epochs = {args.num_train_epochs}")
+    accelerator.print(f"  Instantaneous batch size per device = {args.train_batch_size}")
+    accelerator.print(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
+    accelerator.print(f"  Gradient accumulation steps = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  Total optimization steps = {args.max_train_steps}")
+    global_step = 0
+    first_epoch = 0
+    # Potentially load in the weights and states from a previous save
+    if not args.resume_from_checkpoint:
+        initial_global_step = 0
+    else:
+        if args.resume_from_checkpoint != "latest":
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = os.listdir(args.output_dir)
+            dirs = [d for d in dirs if d.startswith("checkpoint")]
+            dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
+            path = dirs[-1] if len(dirs) > 0 else None
+        if path is None:
+            accelerator.print(
+                f"Checkpoint '{args.resume_from_checkpoint}' does not exist. Starting a new training run."
+            )
+            args.resume_from_checkpoint = None
+            initial_global_step = 0
+        else:
+            accelerator.print(f"Resuming from checkpoint {path}")
+            accelerator.load_state(os.path.join(args.output_dir, path))
+            global_step = int(path.split("-")[1])
+            initial_global_step = global_step
+            first_epoch = global_step // num_update_steps_per_epoch
+    progress_bar = tqdm(
+        range(0, args.max_train_steps),
+        initial=initial_global_step,
+        desc="Steps",
+        # Only show the progress bar once on each machine.
+        disable=not accelerator.is_local_main_process,
+    )
+    # For DeepSpeed training
+    model_config = transformer.module.config if hasattr(transformer, "module") else transformer.config
+    if args.load_tensors:
+        del vae, text_encoder
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+    alphas_cumprod = scheduler.alphas_cumprod.to(accelerator.device, dtype=torch.float32)
+    for epoch in range(first_epoch, args.num_train_epochs):
+        transformer.train()
+        for step, batch in enumerate(train_dataloader):
+            models_to_accumulate = [transformer]
+            logs = {}
+            with accelerator.accumulate(models_to_accumulate):
+                videos = batch["videos"].to(accelerator.device, non_blocking=True)
+                prompts = batch["prompts"]
+                # Encode videos
+                if not args.load_tensors:
+                    videos = videos.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+                    latent_dist = vae.encode(videos).latent_dist
+                else:
+                    latent_dist = DiagonalGaussianDistribution(videos)
+                videos = latent_dist.sample() * VAE_SCALING_FACTOR
+                videos = videos.permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+                videos = videos.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                model_input = videos
+                # Encode prompts
+                if not args.load_tensors:
+                    prompt_embeds = compute_prompt_embeddings(
+                        tokenizer,
+                        text_encoder,
+                        prompts,
+                        model_config.max_text_seq_length,
+                        accelerator.device,
+                        weight_dtype,
+                        requires_grad=False,
+                    )
+                else:
+                    prompt_embeds = prompts.to(dtype=weight_dtype)
+                # Sample noise that will be added to the latents
+                noise = torch.randn_like(model_input)
+                batch_size, num_frames, num_channels, height, width = model_input.shape
+                # Sample a random timestep for each image
+                timesteps = torch.randint(
+                    0,
+                    scheduler.config.num_train_timesteps,
+                    (batch_size,),
+                    dtype=torch.int64,
+                    device=model_input.device,
+                )
+                # Prepare rotary embeds
+                image_rotary_emb = (
+                    prepare_rotary_positional_embeddings(
+                        height=height * VAE_SCALE_FACTOR_SPATIAL,
+                        width=width * VAE_SCALE_FACTOR_SPATIAL,
+                        num_frames=num_frames,
+                        vae_scale_factor_spatial=VAE_SCALE_FACTOR_SPATIAL,
+                        patch_size=model_config.patch_size,
+                        patch_size_t=model_config.patch_size_t if hasattr(model_config, "patch_size_t") else None,
+                        attention_head_dim=model_config.attention_head_dim,
+                        device=accelerator.device,
+                        base_height=RoPE_BASE_HEIGHT,
+                        base_width=RoPE_BASE_WIDTH,
+                    )
+                    if model_config.use_rotary_positional_embeddings
+                    else None
+                )
+                # Add noise to the model input according to the noise magnitude at each timestep
+                # (this is the forward diffusion process)
+                noisy_model_input = scheduler.add_noise(model_input, noise, timesteps)
+                # Predict the noise residual
+                model_output = transformer(
+                    hidden_states=noisy_model_input,
+                    encoder_hidden_states=prompt_embeds,
+                    timestep=timesteps,
+                    image_rotary_emb=image_rotary_emb,
+                    return_dict=False,
+                )[0]
+                model_pred = scheduler.get_velocity(model_output, noisy_model_input, timesteps)
+                weights = 1 / (1 - alphas_cumprod[timesteps])
+                while len(weights.shape) < len(model_pred.shape):
+                    weights = weights.unsqueeze(-1)
+                target = model_input
+                loss = torch.mean(
+                    (weights * (model_pred - target) ** 2).reshape(batch_size, -1),
+                    dim=1,
+                )
+                loss = loss.mean()
+                accelerator.backward(loss)
+                if accelerator.sync_gradients and accelerator.distributed_type != DistributedType.DEEPSPEED:
+                    gradient_norm_before_clip = get_gradient_norm(transformer.parameters())
+                    accelerator.clip_grad_norm_(transformer.parameters(), args.max_grad_norm)
+                    gradient_norm_after_clip = get_gradient_norm(transformer.parameters())
+                    logs.update(
+                        {
+                            "gradient_norm_before_clip": gradient_norm_before_clip,
+                            "gradient_norm_after_clip": gradient_norm_after_clip,
+                        }
+                    )
+                if accelerator.state.deepspeed_plugin is None:
+                    optimizer.step()
+                    optimizer.zero_grad()
+                if not args.use_cpu_offload_optimizer:
+                    lr_scheduler.step()
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+                if accelerator.is_main_process or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                    if global_step % args.checkpointing_steps == 0:
+                        # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
+                        if args.checkpoints_total_limit is not None:
+                            checkpoints = os.listdir(args.output_dir)
+                            checkpoints = [d for d in checkpoints if d.startswith("checkpoint")]
+                            checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[1]))
+                            # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
+                            if len(checkpoints) >= args.checkpoints_total_limit:
+                                num_to_remove = len(checkpoints) - args.checkpoints_total_limit + 1
+                                removing_checkpoints = checkpoints[0:num_to_remove]
+                                logger.info(
+                                    f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
+                                )
+                                logger.info(f"Removing checkpoints: {', '.join(removing_checkpoints)}")
+                                for removing_checkpoint in removing_checkpoints:
+                                    removing_checkpoint = os.path.join(args.output_dir, removing_checkpoint)
+                                    shutil.rmtree(removing_checkpoint)
+                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
+                        accelerator.save_state(save_path)
+                        logger.info(f"Saved state to {save_path}")
+            last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
+            logs.update(
+                {
+                    "loss": loss.detach().item(),
+                    "lr": last_lr,
+                }
+            )
+            progress_bar.set_postfix(**logs)
+            accelerator.log(logs, step=global_step)
+            if global_step >= args.max_train_steps:
+                break
+        if accelerator.is_main_process:
+            if args.validation_prompt is not None and (epoch + 1) % args.validation_epochs == 0:
+                accelerator.print("===== Memory before validation =====")
+                print_memory(accelerator.device)
+                torch.cuda.synchronize(accelerator.device)
+                pipe = CogVideoXPipeline.from_pretrained(
+                    args.pretrained_model_name_or_path,
+                    transformer=unwrap_model(accelerator, transformer),
+                    scheduler=scheduler,
+                    revision=args.revision,
+                    variant=args.variant,
+                    torch_dtype=weight_dtype,
+                )
+                if args.enable_slicing:
+                    pipe.vae.enable_slicing()
+                if args.enable_tiling:
+                    pipe.vae.enable_tiling()
+                if args.enable_model_cpu_offload:
+                    pipe.enable_model_cpu_offload()
+                validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+                for validation_prompt in validation_prompts:
+                    pipeline_args = {
+                        "prompt": validation_prompt,
+                        "guidance_scale": args.guidance_scale,
+                        "use_dynamic_cfg": args.use_dynamic_cfg,
+                        "height": args.height,
+                        "width": args.width,
+                        "max_sequence_length": model_config.max_text_seq_length,
+                    }
+                    log_validation(
+                        accelerator=accelerator,
+                        pipe=pipe,
+                        args=args,
+                        pipeline_args=pipeline_args,
+                        epoch=epoch,
+                        is_final_validation=False,
+                    )
+                accelerator.print("===== Memory after validation =====")
+                print_memory(accelerator.device)
+                reset_memory(accelerator.device)
+                del pipe
+                gc.collect()
+                torch.cuda.empty_cache()
+                torch.cuda.synchronize(accelerator.device)
+    accelerator.wait_for_everyone()
+    if accelerator.is_main_process:
+        transformer = unwrap_model(accelerator, transformer)
+        dtype = (
+            torch.float16
+            if args.mixed_precision == "fp16"
+            else torch.bfloat16
+            if args.mixed_precision == "bf16"
+            else torch.float32
+        )
+        transformer = transformer.to(dtype)
+        transformer.save_pretrained(
+            os.path.join(args.output_dir, "transformer"),
+            safe_serialization=True,
+            max_shard_size="5GB",
+        )
+        # Cleanup trained models to save memory
+        if args.load_tensors:
+            del transformer
+        else:
+            del transformer, text_encoder, vae
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize(accelerator.device)
+        accelerator.print("===== Memory before testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        # Final test inference
+        pipe = CogVideoXPipeline.from_pretrained(
+            args.pretrained_model_name_or_path,
+            revision=args.revision,
+            variant=args.variant,
+            torch_dtype=weight_dtype,
+        )
+        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
+        if args.enable_slicing:
+            pipe.vae.enable_slicing()
+        if args.enable_tiling:
+            pipe.vae.enable_tiling()
+        if args.enable_model_cpu_offload:
+            pipe.enable_model_cpu_offload()
+        # Run inference
+        validation_outputs = []
+        if args.validation_prompt and args.num_validation_videos > 0:
+            validation_prompts = args.validation_prompt.split(args.validation_prompt_separator)
+            for validation_prompt in validation_prompts:
+                pipeline_args = {
+                    "prompt": validation_prompt,
+                    "guidance_scale": args.guidance_scale,
+                    "use_dynamic_cfg": args.use_dynamic_cfg,
+                    "height": args.height,
+                    "width": args.width,
+                }
+                video = log_validation(
+                    accelerator=accelerator,
+                    pipe=pipe,
+                    args=args,
+                    pipeline_args=pipeline_args,
+                    epoch=epoch,
+                    is_final_validation=True,
+                )
+                validation_outputs.extend(video)
+        accelerator.print("===== Memory after testing =====")
+        print_memory(accelerator.device)
+        reset_memory(accelerator.device)
+        torch.cuda.synchronize(accelerator.device)
+        if args.push_to_hub:
+            save_model_card(
+                repo_id,
+                videos=validation_outputs,
+                base_model=args.pretrained_model_name_or_path,
+                validation_prompt=args.validation_prompt,
+                repo_folder=args.output_dir,
+                fps=args.fps,
+            )
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=args.output_dir,
+                commit_message="End of training",
+                ignore_patterns=["step_*", "epoch_*"],
+            )
+    accelerator.end_training()
+if __name__ == "__main__":
+    args = get_args()
+    main(args)

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/dataset.py ADDED Viewed

	@@ -0,0 +1,428 @@

+import random
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+import numpy as np
+import pandas as pd
+import torch
+import torchvision.transforms as TT
+from accelerate.logging import get_logger
+from torch.utils.data import Dataset, Sampler
+from torchvision import transforms
+from torchvision.transforms import InterpolationMode
+from torchvision.transforms.functional import resize
+# Must import after torch because this can sometimes lead to a nasty segmentation fault, or stack smashing error
+# Very few bug reports but it happens. Look in decord Github issues for more relevant information.
+import decord  # isort:skip
+decord.bridge.set_bridge("torch")
+logger = get_logger(__name__)
+HEIGHT_BUCKETS = [256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536]
+WIDTH_BUCKETS = [256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536]
+FRAME_BUCKETS = [16, 24, 32, 48, 64, 80]
+class VideoDataset(Dataset):
+    def __init__(
+        self,
+        data_root: str,
+        dataset_file: Optional[str] = None,
+        caption_column: str = "text",
+        video_column: str = "video",
+        max_num_frames: int = 49,
+        id_token: Optional[str] = None,
+        height_buckets: List[int] = None,
+        width_buckets: List[int] = None,
+        frame_buckets: List[int] = None,
+        load_tensors: bool = False,
+        random_flip: Optional[float] = None,
+        image_to_video: bool = False,
+    ) -> None:
+        super().__init__()
+        self.data_root = Path(data_root)
+        self.dataset_file = dataset_file
+        self.caption_column = caption_column
+        self.video_column = video_column
+        self.max_num_frames = max_num_frames
+        self.id_token = f"{id_token.strip()} " if id_token else ""
+        self.height_buckets = height_buckets or HEIGHT_BUCKETS
+        self.width_buckets = width_buckets or WIDTH_BUCKETS
+        self.frame_buckets = frame_buckets or FRAME_BUCKETS
+        self.load_tensors = load_tensors
+        self.random_flip = random_flip
+        self.image_to_video = image_to_video
+        self.resolutions = [
+            (f, h, w) for h in self.height_buckets for w in self.width_buckets for f in self.frame_buckets
+        ]
+        # Two methods of loading data are supported.
+        #   - Using a CSV: caption_column and video_column must be some column in the CSV. One could
+        #     make use of other columns too, such as a motion score or aesthetic score, by modifying the
+        #     logic in CSV processing.
+        #   - Using two files containing line-separate captions and relative paths to videos.
+        # For a more detailed explanation about preparing dataset format, checkout the README.
+        if dataset_file is None:
+            (
+                self.prompts,
+                self.video_paths,
+            ) = self._load_dataset_from_local_path()
+        else:
+            (
+                self.prompts,
+                self.video_paths,
+            ) = self._load_dataset_from_csv()
+        if len(self.video_paths) != len(self.prompts):
+            raise ValueError(
+                f"Expected length of prompts and videos to be the same but found {len(self.prompts)=} and {len(self.video_paths)=}. Please ensure that the number of caption prompts and videos match in your dataset."
+            )
+        self.video_transforms = transforms.Compose(
+            [
+                transforms.RandomHorizontalFlip(random_flip)
+                if random_flip
+                else transforms.Lambda(self.identity_transform),
+                transforms.Lambda(self.scale_transform),
+                transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True),
+            ]
+        )
+    @staticmethod
+    def identity_transform(x):
+        return x
+    @staticmethod
+    def scale_transform(x):
+        return x / 255.0
+    def __len__(self) -> int:
+        return len(self.video_paths)
+    def __getitem__(self, index: int) -> Dict[str, Any]:
+        if isinstance(index, list):
+            # Here, index is actually a list of data objects that we need to return.
+            # The BucketSampler should ideally return indices. But, in the sampler, we'd like
+            # to have information about num_frames, height and width. Since this is not stored
+            # as metadata, we need to read the video to get this information. You could read this
+            # information without loading the full video in memory, but we do it anyway. In order
+            # to not load the video twice (once to get the metadata, and once to return the loaded video
+            # based on sampled indices), we cache it in the BucketSampler. When the sampler is
+            # to yield, we yield the cache data instead of indices. So, this special check ensures
+            # that data is not loaded a second time. PRs are welcome for improvements.
+            return index
+        if self.load_tensors:
+            image_latents, video_latents, prompt_embeds = self._preprocess_video(self.video_paths[index])
+            # This is hardcoded for now.
+            # The VAE's temporal compression ratio is 4.
+            # The VAE's spatial compression ratio is 8.
+            latent_num_frames = video_latents.size(1)
+            if latent_num_frames % 2 == 0:
+                num_frames = latent_num_frames * 4
+            else:
+                num_frames = (latent_num_frames - 1) * 4 + 1
+            height = video_latents.size(2) * 8
+            width = video_latents.size(3) * 8
+            return {
+                "prompt": prompt_embeds,
+                "image": image_latents,
+                "video": video_latents,
+                "video_metadata": {
+                    "num_frames": num_frames,
+                    "height": height,
+                    "width": width,
+                },
+            }
+        else:
+            image, video, _ = self._preprocess_video(self.video_paths[index])
+            return {
+                "prompt": self.id_token + self.prompts[index],
+                "image": image,
+                "video": video,
+                "video_metadata": {
+                    "num_frames": video.shape[0],
+                    "height": video.shape[2],
+                    "width": video.shape[3],
+                },
+            }
+    def _load_dataset_from_local_path(self) -> Tuple[List[str], List[str]]:
+        if not self.data_root.exists():
+            raise ValueError("Root folder for videos does not exist")
+        prompt_path = self.data_root.joinpath(self.caption_column)
+        video_path = self.data_root.joinpath(self.video_column)
+        if not prompt_path.exists() or not prompt_path.is_file():
+            raise ValueError(
+                "Expected `--caption_column` to be path to a file in `--data_root` containing line-separated text prompts."
+            )
+        if not video_path.exists() or not video_path.is_file():
+            raise ValueError(
+                "Expected `--video_column` to be path to a file in `--data_root` containing line-separated paths to video data in the same directory."
+            )
+        with open(prompt_path, "r", encoding="utf-8") as file:
+            prompts = [line.strip() for line in file.readlines() if len(line.strip()) > 0]
+        with open(video_path, "r", encoding="utf-8") as file:
+            video_paths = [self.data_root.joinpath(line.strip()) for line in file.readlines() if len(line.strip()) > 0]
+        if not self.load_tensors and any(not path.is_file() for path in video_paths):
+            raise ValueError(
+                f"Expected `{self.video_column=}` to be a path to a file in `{self.data_root=}` containing line-separated paths to video data but found atleast one path that is not a valid file."
+            )
+        return prompts, video_paths
+    def _load_dataset_from_csv(self) -> Tuple[List[str], List[str]]:
+        df = pd.read_csv(self.dataset_file)
+        prompts = df[self.caption_column].tolist()
+        video_paths = df[self.video_column].tolist()
+        video_paths = [self.data_root.joinpath(line.strip()) for line in video_paths]
+        if any(not path.is_file() for path in video_paths):
+            raise ValueError(
+                f"Expected `{self.video_column=}` to be a path to a file in `{self.data_root=}` containing line-separated paths to video data but found atleast one path that is not a valid file."
+            )
+        return prompts, video_paths
+    def _preprocess_video(self, path: Path) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        r"""
+        Loads a single video, or latent and prompt embedding, based on initialization parameters.
+        If returning a video, returns a [F, C, H, W] video tensor, and None for the prompt embedding. Here,
+        F, C, H and W are the frames, channels, height and width of the input video.
+        If returning latent/embedding, returns a [F, C, H, W] latent, and the prompt embedding of shape [S, D].
+        F, C, H and W are the frames, channels, height and width of the latent, and S, D are the sequence length
+        and embedding dimension of prompt embeddings.
+        """
+        if self.load_tensors:
+            return self._load_preprocessed_latents_and_embeds(path)
+        else:
+            video_reader = decord.VideoReader(uri=path.as_posix())
+            video_num_frames = len(video_reader)
+            indices = list(range(0, video_num_frames, video_num_frames // self.max_num_frames))
+            frames = video_reader.get_batch(indices)
+            frames = frames[: self.max_num_frames].float()
+            frames = frames.permute(0, 3, 1, 2).contiguous()
+            frames = torch.stack([self.video_transforms(frame) for frame in frames], dim=0)
+            image = frames[:1].clone() if self.image_to_video else None
+            return image, frames, None
+    def _load_preprocessed_latents_and_embeds(self, path: Path) -> Tuple[torch.Tensor, torch.Tensor]:
+        filename_without_ext = path.name.split(".")[0]
+        pt_filename = f"{filename_without_ext}.pt"
+        # The current path is something like: /a/b/c/d/videos/00001.mp4
+        # We need to reach: /a/b/c/d/video_latents/00001.pt
+        image_latents_path = path.parent.parent.joinpath("image_latents")
+        video_latents_path = path.parent.parent.joinpath("video_latents")
+        embeds_path = path.parent.parent.joinpath("prompt_embeds")
+        if (
+            not video_latents_path.exists()
+            or not embeds_path.exists()
+            or (self.image_to_video and not image_latents_path.exists())
+        ):
+            raise ValueError(
+                f"When setting the load_tensors parameter to `True`, it is expected that the `{self.data_root=}` contains two folders named `video_latents` and `prompt_embeds`. However, these folders were not found. Please make sure to have prepared your data correctly using `prepare_data.py`. Additionally, if you're training image-to-video, it is expected that an `image_latents` folder is also present."
+            )
+        if self.image_to_video:
+            image_latent_filepath = image_latents_path.joinpath(pt_filename)
+        video_latent_filepath = video_latents_path.joinpath(pt_filename)
+        embeds_filepath = embeds_path.joinpath(pt_filename)
+        if not video_latent_filepath.is_file() or not embeds_filepath.is_file():
+            if self.image_to_video:
+                image_latent_filepath = image_latent_filepath.as_posix()
+            video_latent_filepath = video_latent_filepath.as_posix()
+            embeds_filepath = embeds_filepath.as_posix()
+            raise ValueError(
+                f"The file {video_latent_filepath=} or {embeds_filepath=} could not be found. Please ensure that you've correctly executed `prepare_dataset.py`."
+            )
+        images = (
+            torch.load(image_latent_filepath, map_location="cpu", weights_only=True) if self.image_to_video else None
+        )
+        latents = torch.load(video_latent_filepath, map_location="cpu", weights_only=True)
+        embeds = torch.load(embeds_filepath, map_location="cpu", weights_only=True)
+        return images, latents, embeds
+class VideoDatasetWithResizing(VideoDataset):
+    def __init__(self, *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+    def _preprocess_video(self, path: Path) -> torch.Tensor:
+        if self.load_tensors:
+            return self._load_preprocessed_latents_and_embeds(path)
+        else:
+            video_reader = decord.VideoReader(uri=path.as_posix())
+            video_num_frames = len(video_reader)
+            nearest_frame_bucket = min(
+                self.frame_buckets, key=lambda x: abs(x - min(video_num_frames, self.max_num_frames))
+            )
+            frame_indices = list(range(0, video_num_frames, video_num_frames // nearest_frame_bucket))
+            frames = video_reader.get_batch(frame_indices)
+            frames = frames[:nearest_frame_bucket].float()
+            frames = frames.permute(0, 3, 1, 2).contiguous()
+            nearest_res = self._find_nearest_resolution(frames.shape[2], frames.shape[3])
+            frames_resized = torch.stack([resize(frame, nearest_res) for frame in frames], dim=0)
+            frames = torch.stack([self.video_transforms(frame) for frame in frames_resized], dim=0)
+            image = frames[:1].clone() if self.image_to_video else None
+            return image, frames, None
+    def _find_nearest_resolution(self, height, width):
+        nearest_res = min(self.resolutions, key=lambda x: abs(x[1] - height) + abs(x[2] - width))
+        return nearest_res[1], nearest_res[2]
+class VideoDatasetWithResizeAndRectangleCrop(VideoDataset):
+    def __init__(self, video_reshape_mode: str = "center", *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+        self.video_reshape_mode = video_reshape_mode
+    def _resize_for_rectangle_crop(self, arr, image_size):
+        reshape_mode = self.video_reshape_mode
+        if arr.shape[3] / arr.shape[2] > image_size[1] / image_size[0]:
+            arr = resize(
+                arr,
+                size=[image_size[0], int(arr.shape[3] * image_size[0] / arr.shape[2])],
+                interpolation=InterpolationMode.BICUBIC,
+            )
+        else:
+            arr = resize(
+                arr,
+                size=[int(arr.shape[2] * image_size[1] / arr.shape[3]), image_size[1]],
+                interpolation=InterpolationMode.BICUBIC,
+            )
+        h, w = arr.shape[2], arr.shape[3]
+        arr = arr.squeeze(0)
+        delta_h = h - image_size[0]
+        delta_w = w - image_size[1]
+        if reshape_mode == "random" or reshape_mode == "none":
+            top = np.random.randint(0, delta_h + 1)
+            left = np.random.randint(0, delta_w + 1)
+        elif reshape_mode == "center":
+            top, left = delta_h // 2, delta_w // 2
+        else:
+            raise NotImplementedError
+        arr = TT.functional.crop(arr, top=top, left=left, height=image_size[0], width=image_size[1])
+        return arr
+    def _preprocess_video(self, path: Path) -> torch.Tensor:
+        if self.load_tensors:
+            return self._load_preprocessed_latents_and_embeds(path)
+        else:
+            video_reader = decord.VideoReader(uri=path.as_posix())
+            video_num_frames = len(video_reader)
+            nearest_frame_bucket = min(
+                self.frame_buckets, key=lambda x: abs(x - min(video_num_frames, self.max_num_frames))
+            )
+            frame_indices = list(range(0, video_num_frames, video_num_frames // nearest_frame_bucket))
+            frames = video_reader.get_batch(frame_indices)
+            frames = frames[:nearest_frame_bucket].float()
+            frames = frames.permute(0, 3, 1, 2).contiguous()
+            nearest_res = self._find_nearest_resolution(frames.shape[2], frames.shape[3])
+            frames_resized = self._resize_for_rectangle_crop(frames, nearest_res)
+            frames = torch.stack([self.video_transforms(frame) for frame in frames_resized], dim=0)
+            image = frames[:1].clone() if self.image_to_video else None
+            return image, frames, None
+    def _find_nearest_resolution(self, height, width):
+        nearest_res = min(self.resolutions, key=lambda x: abs(x[1] - height) + abs(x[2] - width))
+        return nearest_res[1], nearest_res[2]
+class BucketSampler(Sampler):
+    r"""
+    PyTorch Sampler that groups 3D data by height, width and frames.
+    Args:
+        data_source (`VideoDataset`):
+            A PyTorch dataset object that is an instance of `VideoDataset`.
+        batch_size (`int`, defaults to `8`):
+            The batch size to use for training.
+        shuffle (`bool`, defaults to `True`):
+            Whether or not to shuffle the data in each batch before dispatching to dataloader.
+        drop_last (`bool`, defaults to `False`):
+            Whether or not to drop incomplete buckets of data after completely iterating over all data
+            in the dataset. If set to True, only batches that have `batch_size` number of entries will
+            be yielded. If set to False, it is guaranteed that all data in the dataset will be processed
+            and batches that do not have `batch_size` number of entries will also be yielded.
+    """
+    def __init__(
+        self, data_source: VideoDataset, batch_size: int = 8, shuffle: bool = True, drop_last: bool = False
+    ) -> None:
+        self.data_source = data_source
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.drop_last = drop_last
+        self.buckets = {resolution: [] for resolution in data_source.resolutions}
+        self._raised_warning_for_drop_last = False
+    def __len__(self):
+        if self.drop_last and not self._raised_warning_for_drop_last:
+            self._raised_warning_for_drop_last = True
+            logger.warning(
+                "Calculating the length for bucket sampler is not possible when `drop_last` is set to True. This may cause problems when setting the number of epochs used for training."
+            )
+        return (len(self.data_source) + self.batch_size - 1) // self.batch_size
+    def __iter__(self):
+        for index, data in enumerate(self.data_source):
+            video_metadata = data["video_metadata"]
+            f, h, w = video_metadata["num_frames"], video_metadata["height"], video_metadata["width"]
+            self.buckets[(f, h, w)].append(data)
+            if len(self.buckets[(f, h, w)]) == self.batch_size:
+                if self.shuffle:
+                    random.shuffle(self.buckets[(f, h, w)])
+                yield self.buckets[(f, h, w)]
+                del self.buckets[(f, h, w)]
+                self.buckets[(f, h, w)] = []
+        if self.drop_last:
+            return
+        for fhw, bucket in list(self.buckets.items()):
+            if len(bucket) == 0:
+                continue
+            if self.shuffle:
+                random.shuffle(bucket)
+                yield bucket
+                del self.buckets[fhw]
+                self.buckets[fhw] = []

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/prepare_dataset.py ADDED Viewed

	@@ -0,0 +1,669 @@

+#!/usr/bin/env python3
+import argparse
+import functools
+import json
+import os
+import pathlib
+import queue
+import traceback
+import uuid
+from concurrent.futures import ThreadPoolExecutor
+from typing import Any, Dict, List, Optional, Union
+import torch
+import torch.distributed as dist
+from diffusers import AutoencoderKLCogVideoX
+from diffusers.training_utils import set_seed
+from diffusers.utils import export_to_video, get_logger
+from torch.utils.data import DataLoader
+from torchvision import transforms
+from tqdm import tqdm
+from transformers import T5EncoderModel, T5Tokenizer
+import decord  # isort:skip
+from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop  # isort:skip
+decord.bridge.set_bridge("torch")
+logger = get_logger(__name__)
+DTYPE_MAPPING = {
+    "fp32": torch.float32,
+    "fp16": torch.float16,
+    "bf16": torch.bfloat16,
+}
+def check_height(x: Any) -> int:
+    x = int(x)
+    if x % 16 != 0:
+        raise argparse.ArgumentTypeError(
+            f"`--height_buckets` must be divisible by 16, but got {x} which does not fit criteria."
+        )
+    return x
+def check_width(x: Any) -> int:
+    x = int(x)
+    if x % 16 != 0:
+        raise argparse.ArgumentTypeError(
+            f"`--width_buckets` must be divisible by 16, but got {x} which does not fit criteria."
+        )
+    return x
+def check_frames(x: Any) -> int:
+    x = int(x)
+    if x % 4 != 0 and x % 4 != 1:
+        raise argparse.ArgumentTypeError(
+            f"`--frames_buckets` must be of form `4 * k` or `4 * k + 1`, but got {x} which does not fit criteria."
+        )
+    return x
+def get_args() -> Dict[str, Any]:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_id",
+        type=str,
+        default="THUDM/CogVideoX-2b",
+        help="Hugging Face model ID to use for tokenizer, text encoder and VAE.",
+    )
+    parser.add_argument("--data_root", type=str, required=True, help="Path to where training data is located.")
+    parser.add_argument(
+        "--dataset_file", type=str, default=None, help="Path to CSV file containing metadata about training data."
+    )
+    parser.add_argument(
+        "--caption_column",
+        type=str,
+        default="caption",
+        help="If using a CSV file via the `--dataset_file` argument, this should be the name of the column containing the captions. If using the folder structure format for data loading, this should be the name of the file containing line-separated captions (the file should be located in `--data_root`).",
+    )
+    parser.add_argument(
+        "--video_column",
+        type=str,
+        default="video",
+        help="If using a CSV file via the `--dataset_file` argument, this should be the name of the column containing the video paths. If using the folder structure format for data loading, this should be the name of the file containing line-separated video paths (the file should be located in `--data_root`).",
+    )
+    parser.add_argument(
+        "--id_token",
+        type=str,
+        default=None,
+        help="Identifier token appended to the start of each prompt if provided.",
+    )
+    parser.add_argument(
+        "--height_buckets",
+        nargs="+",
+        type=check_height,
+        default=[256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536],
+    )
+    parser.add_argument(
+        "--width_buckets",
+        nargs="+",
+        type=check_width,
+        default=[256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536],
+    )
+    parser.add_argument(
+        "--frame_buckets",
+        nargs="+",
+        type=check_frames,
+        default=[49],
+    )
+    parser.add_argument(
+        "--random_flip",
+        type=float,
+        default=None,
+        help="If random horizontal flip augmentation is to be used, this should be the flip probability.",
+    )
+    parser.add_argument(
+        "--dataloader_num_workers",
+        type=int,
+        default=0,
+        help="Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process.",
+    )
+    parser.add_argument(
+        "--pin_memory",
+        action="store_true",
+        help="Whether or not to use the pinned memory setting in pytorch dataloader.",
+    )
+    parser.add_argument(
+        "--video_reshape_mode",
+        type=str,
+        default=None,
+        help="All input videos are reshaped to this mode. Choose between ['center', 'random', 'none']",
+    )
+    parser.add_argument(
+        "--save_image_latents",
+        action="store_true",
+        help="Whether or not to encode and store image latents, which are required for image-to-video finetuning. The image latents are the first frame of input videos encoded with the VAE.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        required=True,
+        help="Path to output directory where preprocessed videos/latents/embeddings will be saved.",
+    )
+    parser.add_argument("--max_num_frames", type=int, default=49, help="Maximum number of frames in output video.")
+    parser.add_argument(
+        "--max_sequence_length", type=int, default=226, help="Max sequence length of prompt embeddings."
+    )
+    parser.add_argument("--target_fps", type=int, default=8, help="Frame rate of output videos.")
+    parser.add_argument(
+        "--save_latents_and_embeddings",
+        action="store_true",
+        help="Whether to encode videos/captions to latents/embeddings and save them in pytorch serializable format.",
+    )
+    parser.add_argument(
+        "--use_slicing",
+        action="store_true",
+        help="Whether to enable sliced encoding/decoding in the VAE. Only used if `--save_latents_and_embeddings` is also used.",
+    )
+    parser.add_argument(
+        "--use_tiling",
+        action="store_true",
+        help="Whether to enable tiled encoding/decoding in the VAE. Only used if `--save_latents_and_embeddings` is also used.",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="Number of videos to process at once in the VAE.")
+    parser.add_argument(
+        "--num_decode_threads",
+        type=int,
+        default=0,
+        help="Number of decoding threads for `decord` to use. The default `0` means to automatically determine required number of threads.",
+    )
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        choices=["fp32", "fp16", "bf16"],
+        default="fp32",
+        help="Data type to use when generating latents and prompt embeddings.",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="Seed for reproducibility.")
+    parser.add_argument(
+        "--num_artifact_workers", type=int, default=4, help="Number of worker threads for serializing artifacts."
+    )
+    return parser.parse_args()
+def _get_t5_prompt_embeds(
+    tokenizer: T5Tokenizer,
+    text_encoder: T5EncoderModel,
+    prompt: Union[str, List[str]],
+    num_videos_per_prompt: int = 1,
+    max_sequence_length: int = 226,
+    device: Optional[torch.device] = None,
+    dtype: Optional[torch.dtype] = None,
+    text_input_ids=None,
+):
+    prompt = [prompt] if isinstance(prompt, str) else prompt
+    batch_size = len(prompt)
+    if tokenizer is not None:
+        text_inputs = tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            add_special_tokens=True,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids
+    else:
+        if text_input_ids is None:
+            raise ValueError("`text_input_ids` must be provided when the tokenizer is not specified.")
+    prompt_embeds = text_encoder(text_input_ids.to(device))[0]
+    prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+    # duplicate text embeddings for each generation per prompt, using mps friendly method
+    _, seq_len, _ = prompt_embeds.shape
+    prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
+    prompt_embeds = prompt_embeds.view(batch_size * num_videos_per_prompt, seq_len, -1)
+    return prompt_embeds
+def encode_prompt(
+    tokenizer: T5Tokenizer,
+    text_encoder: T5EncoderModel,
+    prompt: Union[str, List[str]],
+    num_videos_per_prompt: int = 1,
+    max_sequence_length: int = 226,
+    device: Optional[torch.device] = None,
+    dtype: Optional[torch.dtype] = None,
+    text_input_ids=None,
+):
+    prompt = [prompt] if isinstance(prompt, str) else prompt
+    prompt_embeds = _get_t5_prompt_embeds(
+        tokenizer,
+        text_encoder,
+        prompt=prompt,
+        num_videos_per_prompt=num_videos_per_prompt,
+        max_sequence_length=max_sequence_length,
+        device=device,
+        dtype=dtype,
+        text_input_ids=text_input_ids,
+    )
+    return prompt_embeds
+def compute_prompt_embeddings(
+    tokenizer: T5Tokenizer,
+    text_encoder: T5EncoderModel,
+    prompts: List[str],
+    max_sequence_length: int,
+    device: torch.device,
+    dtype: torch.dtype,
+    requires_grad: bool = False,
+):
+    if requires_grad:
+        prompt_embeds = encode_prompt(
+            tokenizer,
+            text_encoder,
+            prompts,
+            num_videos_per_prompt=1,
+            max_sequence_length=max_sequence_length,
+            device=device,
+            dtype=dtype,
+        )
+    else:
+        with torch.no_grad():
+            prompt_embeds = encode_prompt(
+                tokenizer,
+                text_encoder,
+                prompts,
+                num_videos_per_prompt=1,
+                max_sequence_length=max_sequence_length,
+                device=device,
+                dtype=dtype,
+            )
+    return prompt_embeds
+to_pil_image = transforms.ToPILImage(mode="RGB")
+def save_image(image: torch.Tensor, path: pathlib.Path) -> None:
+    image = image.to(dtype=torch.float32).clamp(-1, 1)
+    image = to_pil_image(image.float())
+    image.save(path)
+def save_video(video: torch.Tensor, path: pathlib.Path, fps: int = 8) -> None:
+    video = video.to(dtype=torch.float32).clamp(-1, 1)
+    video = [to_pil_image(frame) for frame in video]
+    export_to_video(video, path, fps=fps)
+def save_prompt(prompt: str, path: pathlib.Path) -> None:
+    with open(path, "w", encoding="utf-8") as file:
+        file.write(prompt)
+def save_metadata(metadata: Dict[str, Any], path: pathlib.Path) -> None:
+    with open(path, "w", encoding="utf-8") as file:
+        file.write(json.dumps(metadata))
+@torch.no_grad()
+def serialize_artifacts(
+    batch_size: int,
+    fps: int,
+    images_dir: Optional[pathlib.Path] = None,
+    image_latents_dir: Optional[pathlib.Path] = None,
+    videos_dir: Optional[pathlib.Path] = None,
+    video_latents_dir: Optional[pathlib.Path] = None,
+    prompts_dir: Optional[pathlib.Path] = None,
+    prompt_embeds_dir: Optional[pathlib.Path] = None,
+    images: Optional[torch.Tensor] = None,
+    image_latents: Optional[torch.Tensor] = None,
+    videos: Optional[torch.Tensor] = None,
+    video_latents: Optional[torch.Tensor] = None,
+    prompts: Optional[List[str]] = None,
+    prompt_embeds: Optional[torch.Tensor] = None,
+) -> None:
+    num_frames, height, width = videos.size(1), videos.size(3), videos.size(4)
+    metadata = [{"num_frames": num_frames, "height": height, "width": width}]
+    data_folder_mapper_list = [
+        (images, images_dir, lambda img, path: save_image(img[0], path), "png"),
+        (image_latents, image_latents_dir, torch.save, "pt"),
+        (videos, videos_dir, functools.partial(save_video, fps=fps), "mp4"),
+        (video_latents, video_latents_dir, torch.save, "pt"),
+        (prompts, prompts_dir, save_prompt, "txt"),
+        (prompt_embeds, prompt_embeds_dir, torch.save, "pt"),
+        (metadata, videos_dir, save_metadata, "txt"),
+    ]
+    filenames = [uuid.uuid4() for _ in range(batch_size)]
+    for data, folder, save_fn, extension in data_folder_mapper_list:
+        if data is None:
+            continue
+        for slice, filename in zip(data, filenames):
+            if isinstance(slice, torch.Tensor):
+                slice = slice.clone().to("cpu")
+            path = folder.joinpath(f"{filename}.{extension}")
+            save_fn(slice, path)
+def save_intermediates(output_queue: queue.Queue) -> None:
+    while True:
+        try:
+            item = output_queue.get(timeout=30)
+            if item is None:
+                break
+            serialize_artifacts(**item)
+        except queue.Empty:
+            continue
+@torch.no_grad()
+def main():
+    args = get_args()
+    set_seed(args.seed)
+    output_dir = pathlib.Path(args.output_dir)
+    tmp_dir = output_dir.joinpath("tmp")
+    output_dir.mkdir(parents=True, exist_ok=True)
+    tmp_dir.mkdir(parents=True, exist_ok=True)
+    # Create task queue for non-blocking serializing of artifacts
+    output_queue = queue.Queue()
+    save_thread = ThreadPoolExecutor(max_workers=args.num_artifact_workers)
+    save_future = save_thread.submit(save_intermediates, output_queue)
+    # Initialize distributed processing
+    if "LOCAL_RANK" in os.environ:
+        local_rank = int(os.environ["LOCAL_RANK"])
+        torch.cuda.set_device(local_rank)
+        dist.init_process_group(backend="nccl")
+        world_size = dist.get_world_size()
+        rank = dist.get_rank()
+    else:
+        # Single GPU
+        local_rank = 0
+        world_size = 1
+        rank = 0
+        torch.cuda.set_device(rank)
+    # Create folders where intermediate tensors from each rank will be saved
+    images_dir = tmp_dir.joinpath(f"images/{rank}")
+    image_latents_dir = tmp_dir.joinpath(f"image_latents/{rank}")
+    videos_dir = tmp_dir.joinpath(f"videos/{rank}")
+    video_latents_dir = tmp_dir.joinpath(f"video_latents/{rank}")
+    prompts_dir = tmp_dir.joinpath(f"prompts/{rank}")
+    prompt_embeds_dir = tmp_dir.joinpath(f"prompt_embeds/{rank}")
+    images_dir.mkdir(parents=True, exist_ok=True)
+    image_latents_dir.mkdir(parents=True, exist_ok=True)
+    videos_dir.mkdir(parents=True, exist_ok=True)
+    video_latents_dir.mkdir(parents=True, exist_ok=True)
+    prompts_dir.mkdir(parents=True, exist_ok=True)
+    prompt_embeds_dir.mkdir(parents=True, exist_ok=True)
+    weight_dtype = DTYPE_MAPPING[args.dtype]
+    target_fps = args.target_fps
+    # 1. Dataset
+    dataset_init_kwargs = {
+        "data_root": args.data_root,
+        "dataset_file": args.dataset_file,
+        "caption_column": args.caption_column,
+        "video_column": args.video_column,
+        "max_num_frames": args.max_num_frames,
+        "id_token": args.id_token,
+        "height_buckets": args.height_buckets,
+        "width_buckets": args.width_buckets,
+        "frame_buckets": args.frame_buckets,
+        "load_tensors": False,
+        "random_flip": args.random_flip,
+        "image_to_video": args.save_image_latents,
+    }
+    if args.video_reshape_mode is None:
+        dataset = VideoDatasetWithResizing(**dataset_init_kwargs)
+    else:
+        dataset = VideoDatasetWithResizeAndRectangleCrop(
+            video_reshape_mode=args.video_reshape_mode, **dataset_init_kwargs
+        )
+    original_dataset_size = len(dataset)
+    # Split data among GPUs
+    if world_size > 1:
+        samples_per_gpu = original_dataset_size // world_size
+        start_index = rank * samples_per_gpu
+        end_index = start_index + samples_per_gpu
+        if rank == world_size - 1:
+            end_index = original_dataset_size  # Make sure the last GPU gets the remaining data
+        # Slice the data
+        dataset.prompts = dataset.prompts[start_index:end_index]
+        dataset.video_paths = dataset.video_paths[start_index:end_index]
+    else:
+        pass
+    rank_dataset_size = len(dataset)
+    # 2. Dataloader
+    def collate_fn(data):
+        prompts = [x["prompt"] for x in data[0]]
+        images = None
+        if args.save_image_latents:
+            images = [x["image"] for x in data[0]]
+            images = torch.stack(images).to(dtype=weight_dtype, non_blocking=True)
+        videos = [x["video"] for x in data[0]]
+        videos = torch.stack(videos).to(dtype=weight_dtype, non_blocking=True)
+        return {
+            "images": images,
+            "videos": videos,
+            "prompts": prompts,
+        }
+    dataloader = DataLoader(
+        dataset,
+        batch_size=1,
+        sampler=BucketSampler(dataset, batch_size=args.batch_size, shuffle=True, drop_last=False),
+        collate_fn=collate_fn,
+        num_workers=args.dataloader_num_workers,
+        pin_memory=args.pin_memory,
+    )
+    # 3. Prepare models
+    device = f"cuda:{rank}"
+    if args.save_latents_and_embeddings:
+        tokenizer = T5Tokenizer.from_pretrained(args.model_id, subfolder="tokenizer")
+        text_encoder = T5EncoderModel.from_pretrained(
+            args.model_id, subfolder="text_encoder", torch_dtype=weight_dtype
+        )
+        text_encoder = text_encoder.to(device)
+        vae = AutoencoderKLCogVideoX.from_pretrained(args.model_id, subfolder="vae", torch_dtype=weight_dtype)
+        vae = vae.to(device)
+        if args.use_slicing:
+            vae.enable_slicing()
+        if args.use_tiling:
+            vae.enable_tiling()
+    # 4. Compute latents and embeddings and save
+    if rank == 0:
+        iterator = tqdm(
+            dataloader, desc="Encoding", total=(rank_dataset_size + args.batch_size - 1) // args.batch_size
+        )
+    else:
+        iterator = dataloader
+    for step, batch in enumerate(iterator):
+        try:
+            images = None
+            image_latents = None
+            video_latents = None
+            prompt_embeds = None
+            if args.save_image_latents:
+                images = batch["images"].to(device, non_blocking=True)
+                images = images.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+            videos = batch["videos"].to(device, non_blocking=True)
+            videos = videos.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
+            prompts = batch["prompts"]
+            # Encode videos & images
+            if args.save_latents_and_embeddings:
+                if args.use_slicing:
+                    if args.save_image_latents:
+                        encoded_slices = [vae._encode(image_slice) for image_slice in images.split(1)]
+                        image_latents = torch.cat(encoded_slices)
+                        image_latents = image_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                    encoded_slices = [vae._encode(video_slice) for video_slice in videos.split(1)]
+                    video_latents = torch.cat(encoded_slices)
+                else:
+                    if args.save_image_latents:
+                        image_latents = vae._encode(images)
+                        image_latents = image_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                    video_latents = vae._encode(videos)
+                video_latents = video_latents.to(memory_format=torch.contiguous_format, dtype=weight_dtype)
+                # Encode prompts
+                prompt_embeds = compute_prompt_embeddings(
+                    tokenizer,
+                    text_encoder,
+                    prompts,
+                    args.max_sequence_length,
+                    device,
+                    weight_dtype,
+                    requires_grad=False,
+                )
+            if images is not None:
+                images = (images.permute(0, 2, 1, 3, 4) + 1) / 2
+            videos = (videos.permute(0, 2, 1, 3, 4) + 1) / 2
+            output_queue.put(
+                {
+                    "batch_size": len(prompts),
+                    "fps": target_fps,
+                    "images_dir": images_dir,
+                    "image_latents_dir": image_latents_dir,
+                    "videos_dir": videos_dir,
+                    "video_latents_dir": video_latents_dir,
+                    "prompts_dir": prompts_dir,
+                    "prompt_embeds_dir": prompt_embeds_dir,
+                    "images": images,
+                    "image_latents": image_latents,
+                    "videos": videos,
+                    "video_latents": video_latents,
+                    "prompts": prompts,
+                    "prompt_embeds": prompt_embeds,
+                }
+            )
+        except Exception:
+            print("-------------------------")
+            print(f"An exception occurred while processing data: {rank=}, {world_size=}, {step=}")
+            traceback.print_exc()
+            print("-------------------------")
+    # 5. Complete distributed processing
+    if world_size > 1:
+        dist.barrier()
+        dist.destroy_process_group()
+    output_queue.put(None)
+    save_thread.shutdown(wait=True)
+    save_future.result()
+    # 6. Combine results from each rank
+    if rank == 0:
+        print(
+            f"Completed preprocessing latents and embeddings. Temporary files from all ranks saved to `{tmp_dir.as_posix()}`"
+        )
+        # Move files from each rank to common directory
+        for subfolder, extension in [
+            ("images", "png"),
+            ("image_latents", "pt"),
+            ("videos", "mp4"),
+            ("video_latents", "pt"),
+            ("prompts", "txt"),
+            ("prompt_embeds", "pt"),
+            ("videos", "txt"),
+        ]:
+            tmp_subfolder = tmp_dir.joinpath(subfolder)
+            combined_subfolder = output_dir.joinpath(subfolder)
+            combined_subfolder.mkdir(parents=True, exist_ok=True)
+            pattern = f"*.{extension}"
+            for file in tmp_subfolder.rglob(pattern):
+                file.replace(combined_subfolder / file.name)
+        # Remove temporary directories
+        def rmdir_recursive(dir: pathlib.Path) -> None:
+            for child in dir.iterdir():
+                if child.is_file():
+                    child.unlink()
+                else:
+                    rmdir_recursive(child)
+            dir.rmdir()
+        rmdir_recursive(tmp_dir)
+        # Combine prompts and videos into individual text files and single jsonl
+        prompts_folder = output_dir.joinpath("prompts")
+        prompts = []
+        stems = []
+        for filename in prompts_folder.rglob("*.txt"):
+            with open(filename, "r") as file:
+                prompts.append(file.read().strip())
+            stems.append(filename.stem)
+        prompts_txt = output_dir.joinpath("prompts.txt")
+        videos_txt = output_dir.joinpath("videos.txt")
+        data_jsonl = output_dir.joinpath("data.jsonl")
+        with open(prompts_txt, "w") as file:
+            for prompt in prompts:
+                file.write(f"{prompt}\n")
+        with open(videos_txt, "w") as file:
+            for stem in stems:
+                file.write(f"videos/{stem}.mp4\n")
+        with open(data_jsonl, "w") as file:
+            for prompt, stem in zip(prompts, stems):
+                video_metadata_txt = output_dir.joinpath(f"videos/{stem}.txt")
+                with open(video_metadata_txt, "r", encoding="utf-8") as metadata_file:
+                    metadata = json.loads(metadata_file.read())
+                data = {
+                    "prompt": prompt,
+                    "prompt_embed": f"prompt_embeds/{stem}.pt",
+                    "image": f"images/{stem}.png",
+                    "image_latent": f"image_latents/{stem}.pt",
+                    "video": f"videos/{stem}.mp4",
+                    "video_latent": f"video_latents/{stem}.pt",
+                    "metadata": metadata,
+                }
+                file.write(json.dumps(data) + "\n")
+        print(f"Completed preprocessing. All files saved to `{output_dir.as_posix()}`")
+if __name__ == "__main__":
+    main()

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/text_encoder/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .text_encoder import compute_prompt_embeddings

docs/finetrainers-src-codebase/examples/_legacy/training/cogvideox/text_encoder/text_encoder.py ADDED Viewed

	@@ -0,0 +1,99 @@

+from typing import List, Optional, Union
+import torch
+from transformers import T5EncoderModel, T5Tokenizer
+def _get_t5_prompt_embeds(
+    tokenizer: T5Tokenizer,
+    text_encoder: T5EncoderModel,
+    prompt: Union[str, List[str]],
+    num_videos_per_prompt: int = 1,
+    max_sequence_length: int = 226,
+    device: Optional[torch.device] = None,
+    dtype: Optional[torch.dtype] = None,
+    text_input_ids=None,
+):
+    prompt = [prompt] if isinstance(prompt, str) else prompt
+    batch_size = len(prompt)
+    if tokenizer is not None:
+        text_inputs = tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            add_special_tokens=True,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids
+    else:
+        if text_input_ids is None:
+            raise ValueError("`text_input_ids` must be provided when the tokenizer is not specified.")
+    prompt_embeds = text_encoder(text_input_ids.to(device))[0]
+    prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+    # duplicate text embeddings for each generation per prompt, using mps friendly method
+    _, seq_len, _ = prompt_embeds.shape
+    prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
+    prompt_embeds = prompt_embeds.view(batch_size * num_videos_per_prompt, seq_len, -1)
+    return prompt_embeds
+def encode_prompt(
+    tokenizer: T5Tokenizer,
+    text_encoder: T5EncoderModel,
+    prompt: Union[str, List[str]],
+    num_videos_per_prompt: int = 1,
+    max_sequence_length: int = 226,
+    device: Optional[torch.device] = None,
+    dtype: Optional[torch.dtype] = None,
+    text_input_ids=None,
+):
+    prompt = [prompt] if isinstance(prompt, str) else prompt
+    prompt_embeds = _get_t5_prompt_embeds(
+        tokenizer,
+        text_encoder,
+        prompt=prompt,
+        num_videos_per_prompt=num_videos_per_prompt,
+        max_sequence_length=max_sequence_length,
+        device=device,
+        dtype=dtype,
+        text_input_ids=text_input_ids,
+    )
+    return prompt_embeds
+def compute_prompt_embeddings(
+    tokenizer: T5Tokenizer,
+    text_encoder: T5EncoderModel,
+    prompt: str,
+    max_sequence_length: int,
+    device: torch.device,
+    dtype: torch.dtype,
+    requires_grad: bool = False,
+):
+    if requires_grad:
+        prompt_embeds = encode_prompt(
+            tokenizer,
+            text_encoder,
+            prompt,
+            num_videos_per_prompt=1,
+            max_sequence_length=max_sequence_length,
+            device=device,
+            dtype=dtype,
+        )
+    else:
+        with torch.no_grad():
+            prompt_embeds = encode_prompt(
+                tokenizer,
+                text_encoder,
+                prompt,
+                num_videos_per_prompt=1,
+                max_sequence_length=max_sequence_length,
+                device=device,
+                dtype=dtype,
+            )
+    return prompt_embeds