Spaces:
Runtime error
Runtime error
File size: 17,726 Bytes
e202b16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.0.28] - TBD
### Added
### Improved
### Removed
## [0.0.27] - 2024-07-10
Pre-built binary wheels require PyTorch 2.3.1
### Added
- fMHA: `PagedBlockDiagonalGappyKeysMask`
- fMHA: heterogeneous queries in `triton_splitk`
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for `merge_attentions`
- fMHA: Added `torch.compile` support for 3 biases (`LowerTriangularMask`, `LowerTriangularMaskWithTensorBias` and `BlockDiagonalMask`) - some might require PyTorch 2.4
- fMHA: Added `torch.compile` support in `memory_efficient_attention` when passing the flash operator explicitely (eg `memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp))`)
- fMHA: `memory_efficient_attention` now expects its `attn_bias` argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: `AttentionBias` subclasses are now constructed by default on the `cuda` device if available - they used to be created on the CPU device
- 2:4 sparsity: Added `xformers.ops.sp24.sparsify24_ste` for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
### Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a `trigger` file in the profiling directory
### Removed
- Removed support for PyTorch version older than 2.2
## [0.0.26] - 2024-04-29
Pre-built binary wheels require PyTorch 2.3.0
### Added
- [2:4 sparsity] Added support for Straight-Through Estimator for `sparsify24` gradient (`GRADIENT_STE`)
- [2:4 sparsity] `sparsify24_like` now supports the cuSparseLt backend, and the STE gradient
- Basic support for `torch.compile` for the `memory_efficient_attention` operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.
### Improved
- merge_attentions no longer needs inputs to be stacked.
- fMHA: triton_splitk now supports additive bias
- fMHA: benchmark cleanup
## [0.0.25.post1] - 2024-03-29
Pre-built binary wheels require PyTorch 2.2.2
## [0.0.25] - 2024-03-14
Pre-built binary wheels require PyTorch 2.2.1
### Added
- New `merge_attentions` function
- fMHA: New gappy attention biases.
### Improved
- fMHA: Updated Flash-Attention to v2.5.6: this has a performance improvement for multiquery.
- fMHA: triton_splitk changed and expanded. Now amalgamates using LSE. Can autotune, supports causal with a small number of queries - not just 1. Experimental support for paged attention.
- `rope_padded`: Fixed CUDA error with many queries (more than 65k)
- `rmsnorm`: Fixed CUDA error with large inputs (enables 512k+ sequence length on Llama2 70B)
### Removed
- fMHA: Removed triton operator (`fmha.triton.*`, `xformers.ops.MemoryEfficientAttentionTritonFwdFlashBwOp`, `xformers.ops.TritonFlashAttentionOp`), as it has correctness issues under some conditions, and is slower than other implementations.
## [0.0.24] - 2024-01-31
Pre-built binary wheels require PyTorch 2.2.0
### Added
- Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free. [Read more](https://twitter.com/d_haziza/status/1753030654118211593)
- Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see `xformers.ops.sparsify24`.
### Improved
- Make selective activation checkpointing be compatible with torch.compile.
### Removed
- Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
- Removed support for PyTorch version older than 2.1.0
## [0.0.23] - 2023-12-05
Pre-built binary wheels require PyTorch 2.1.1 (xFormers `0.0.23`) or PyTorch 2.1.2 (xFormers `0.0.23.post1`).
### Fixed
- fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with `length%64 == 1`
- fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports `BlockDiagonalCausalWithOffsetPaddedKeysMask`
### Added
- fMHA: Added `LocalAttentionFromBottomRightMask` (local)
- fMHA: Added `LowerTriangularFromBottomRightMask` (causal)
- fMHA: Added `LowerTriangularFromBottomRightLocalAttentionMask` (local + causal)
### Removed
- Removed `xformers.triton.sum_strided`
## [0.0.22] - 2023-09-27
### Fixed
- fMHA: Backward pass now works in PyTorch deterministic mode (although slower)
### Added
- fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to `memory_efficient_attention`, see the documentation for more details
- fMHA: Added experimental support for Local Attention biases to `memory_efficient_attention`
- Added an example of efficient [LLaMa decoding](https://github.com/facebookresearch/xformers/tree/main/examples/llama_inference) using xformers operators
- Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
- Added an efficient rope implementation in triton, to be used in LLM decoding
- Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
- `xformers.info` now indicates the Flash-Attention version used
### Removed
- fMHA: Removed `smallK` backend support for CPU. `memory_efficient_attention` only works for CUDA/GPU tensors now
- **DEPRECATION**: Many classes in `xformers.factory`, `xformers.triton` and `xformers.components` have been or will be deprecated soon (see tracking issue facebookresearch/xformers#848)
## [0.0.21] - 2023-08-18
### Improved
- fMHA: Updated [flash-attention](https://github.com/Dao-AILab/flash-attention) to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available
### Bug fixes
- fMHA/cutlass: Fix potential race condition in the FW/BW passes
- fMHA/cutlass: Fix `attn_bias` stride overflow for very long sequences (>32k)
- `LowerTriangularMask` is now backward compatible with older xformers versions
### Breaking changes
- `memory_efficient_attention` now expects the `attn_bias` argument to have a head dimension
- `memory_efficient_attention` no longer broadcasts the batch/head dimensions of `attn_bias`. Please use `.expand` if you need to broadcast the bias
- Remove `causal_diagonal` argument from `BlockDiagonalCausalWithOffsetPaddedKeysMask`
### Added
- Binary wheels on pypi/conda now contain H100 kernels
- fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery
**NOTE**: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.
## [0.0.20] - 2023-05-23
### Improved
- fMHA/cutlass (backward): Massive performance improvements when `batch_size * num_heads` is low (10x+)
- fMHA/cutlass: Further performance improvements for both the forward & backward kernels
- fMHA (backward): Now dispatching to cutlass when `embed_dim>64`
- fMHA: Updated Flash-Attention to `v1.0.5`
### Added
- fMHA now runs on H100 (support is experimental)
## [0.0.19] - 2023-04-28
### Added
- Display `nvcc` version used to compile `xformers` in `python -m xformers.info`
### Fixed
- Fixed performance regression with `nvcc>11.6` (facebookresearch/xformers#712)
- fMHA/cutlass: Fixed `nan` in the output when using a `torch.Tensor` with `-inf` prefixes as `attn_bias` (facebookresearch/xformers#722)
- fMHA/cutlass: Fixed `nan` in the output when the sequence length is larger than `2 ** 15` (facebookresearch/xformers#719)
- fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
- fMHA/cutlass: The kernel are now deterministic
- fMHA/cutlass: Fixed backward pass correctness when using dropout (facebookresearch/xformers#724)
## [0.0.18] - 2023-03-31
### Added
- Added `xformers.ops.index_select_cat` and `xformers.ops.scaled_index_add` - those are experimental functions that only work with a few shapes, and can be used to write efficient stochastic depth in transformer architectures for instance
### Fixed
- fMHA: `memory_efficient_attention` now accepts `torch.Tensor` as attention bias for any seqlen, although there are still requirements on the alignment of the bias tensor (see facebookresearch/xformers#683)
## [0.0.17] - 2023-03-28
### Fixed
- fMHA: Fixed BW pass on Sm86/Sm89 GPUs when `K > 64` (RTX 3090, RTX 4090, A6000, ..) [facebookresearch/xformers#631]
### Added
- fMHA/CUTLASS: Added tensor attn bias support [facebookresearch/xformers#587] - contribution from [@jfc4050](https://github.com/jfc4050)
- fMHA/CUTLASS: Added tensor attn bias grad support [facebookresearch/xformers#587] - contribution from [@jfc4050](https://github.com/jfc4050)
- fMHA/CUTLASS: Added dropout support [facebookresearch/xformers#587] - contribution from [@jfc4050](https://github.com/jfc4050)
- fMHA: Added support for varying sequence lengths [facebookresearch/xformers#500]
## [0.0.16] - 2023-01-31
### Fixed
- Updated triton dependency [facebookresearch/xformers#418]
- Stripe lineinfo from binaries, reducing the binary size [facebookresearch/xformers#549]
- Added support for pip wheels [facebookresearch/xformers#588, facebookresearch/xformers#573, facebookresearch/xformers#534, facebookresearch/xformers#523, ...] big thanks to [@AbdBarho](https://github.com/AbdBarho)!
- Fixed compatibility with Python 3.7 [facebookresearch/xformers#541] - thanks to [@susumuota](https://github.com/susumuota)
- fMHA: Fixed strides for QKV gradients for cutlass attention [facebookresearch/xformers#535]
- fMHA: Stricter inputs validation to avoid CUDA errors for unsupported inputs [facebookresearch/xformers#592]
- fMHA/Flash-Attention: Updated to https://github.com/HazyResearch/flash-attention/commit/a1f49a2b92b6fa022379bbebafed9d7f5e96a675 with multiple changes from [@TriDao](https://github.com/tridao) that make the operator up to 20% faster
- fMHA/Flash-Attention: Fixed backward pass wrapper, where non-contiguous gradients could give the wrong result [facebookresearch/xformers#548]
- fMHA: Separate each operator into forward and backward operators. It's now possible to use any combination of forward+backward (for instance Triton forward and Flash-Attention backward) [facebookresearch/xformers#560]
### Added
- fMHA: Added Triton operator for forward pass from [Flash-Attention](https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py) authored by [@TriDao](https://github.com/tridao), will be automatically used on A100 when compatible
- fMHA: Added [`xformers.ops.memory_efficient_attention_forward`](https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.memory_efficient_attention_forward), [`xformers.ops.memory_efficient_attention_forward_requires_grad`](https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.memory_efficient_attention_forward_requires_grad), [`xformers.ops.memory_efficient_attention_backward`](https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.memory_efficient_attention_backward) for power-users who write custom autograd functions [facebookresearch/xformers#560]
- fMHA: Support for custom scaling for the CUTLASS-based kernel [facebookresearch/xformers#530] - contribution from [@comaniac](https://github.com/comaniac)
## [0.0.15] - Skipped
## [0.0.14] - 2022-11-10
### Fixed
- fMHA/CUTLASS: The current CUDA stream is now used by the kernel [facebookresearch/xformers#491]
- fMHA/CUTLASS: Improve overall performance
### Added
- SwiGLU: Added `xformers.ops.SwiGLU` and its functional counterpart (`xformers.ops.swiglu`) [facebookresearch/xformers#490]
- fMHA: Possible to combine CUTLASS's forward with flash-attention's backward pass [facebookresearch/xformers#469] - improves performance on A100 for K = 128
- fMHA: Add custom `xformers.ops.unbind` operator to avoid a cat in the attention block [facebookresearch/xformers#458]
## [0.0.13] - 2022-09-26
### Added
- fMHA: Added CUTLASS-based kernel for `xformers.ops.memory_efficient_attention`. This kernel is automatically depending on the inputs, and works on any GPU after P100 [facebookresearch/xformers#362]
## [0.0.12] - 2022-08-08
### Fixed
- Removed duplicated biases in the FusedMLP layers [facebookresearch/xformers#317]
- Rotary embeddings respecting input types [facebookresearch/xformers#326]
- Poolformer style instantiating useless projection layers [facebookresearch/xformers#349]
- Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [facebookresearch/xformers#348]
- Pass use_triton flag to LayerNorm module [facebookresearch/xformers#336]
### Added
- Four blocksparsity layouts from DeepSpeed [facebookresearch/xformers#320]
- Support several initialization options [facebookresearch/xformers#312]
- Conv2DFeedforward feedforward part [facebookresearch/xformers#321]
- VisualAttention [facebookresearch/xformers#329]
- Automatic blocksparse for causal attention [facebookresearch/xformers#334]
- Better hierarchical transformer generation [facebookresearch/xformers#345]
- Fused operations with AOTAutograd/NVFuser, integration into MLP [facebookresearch/xformers#357]
- Refactor LRA code to use Pytorch Lightning [facebookresearch/xformers#343]
## [0.0.11] - 2022-05-30
### Fixed
- Fix some torchscriptability [facebookresearch/xformers#246]
- Fix FourierMix being compatible with AMP [facebookresearch/xformers#258]
- Better asserts on QKV dimensions [facebookresearch/xformers#264]
- Better perfs for FusedMLP and FusedLinearLayer [facebookresearch/xformers#283]
- Deepnorm init missing self-attention [facebookresearch/xformers#284]
### Added
- Simplicial Embeddings [facebookresearch/xformers#259]
- Mem efficient attention, FW pass [facebookresearch/xformers#267]
- MHA benchmark
- MLP benchmark
- Move all triton kernels to triton v2 [facebookresearch/xformers#272]
- Mem efficient attention, BW pass [facebookresearch/xformers#281]
- Metaformer support [facebookresearch/xformers#294]
## [0.0.10] - 2022-03-14
### Fixed
- Expose bias flag for feedforwards, same default as Timm [facebookresearch/xformers#220]
- Update eps value for layernorm, same default as torch [facebookresearch/xformers#221]
- PreNorm bugfix, only one input was normalized [facebookresearch/xformers#233]
- Fix bug where embedding dimensions that did not match model dim would lead to a crash [facebookresearch/xformers#244]
### Added
- Add DeepNet (DeepNorm) residual path and init [facebookresearch/xformers#227]
## [0.0.9] - 2022-02-09
### Added
- Compositional Attention [facebookresearch/xformers#41]
- Experimental Ragged attention [facebookresearch/xformers#189]
- Mixture of Experts [facebookresearch/xformers#181]
- BlockSparseTensor [facebookresearch/xformers#202]
- Nd-tensor support for triton softmax [facebookresearch/xformers#210]
### Fixed
- Bugfix Favor, single feature map [facebookresearch/xformers#183]
- Sanity check blocksparse settings [facebookresearch/xformers#207]
- Fixed some picklability [facebookresearch/xformers#204]
## [0.0.8] - 2022-01-07
### Fixed
- Much faster fused dropout [facebookresearch/xformers#164]
- Fused dropout repeatability [facebookresearch/xformers#173]
### Added
- Embedding weight tying option [facebookresearch/xformers#172]
## [0.0.7] - 2021-11-30
### Fixed
- Dropout setting not properly passed in many attentions [facebookresearch/xformers#123]
## [0.0.6] - 2021-11-24
### Fixed
- Fix self attention optimization not being triggered, broken residual path [facebookresearch/xformers#119]
- Improve speed by not using contiguous Tensors when not needed [facebookresearch/xformers#119]
### Added
- Attention mask wrapper [facebookresearch/xformers#113]
- ViT comparison benchmark [facebookresearch/xformers#117]
## [0.0.4] - 2021-11-16
### Fixed
- Homogenizing the masks, additive or bool [facebookresearch/xformers#79][facebookresearch/xformers#85][facebookresearch/xformers#86]
- Fix causality flag not being respected [facebookresearch/xformers#103]
- Enabling FusedLayerNorm by default in the factory if Triton is available
- Fixing Favor with fp16
- Fixing Favor trainability
### Added
- Fused dropout/bias/activation layer [facebookresearch/xformers#58]
- Fused layernorm used by default in the factory [facebookresearch/xformers#92]
## [0.0.3] - 2021-11-01
### Fixed
- Nystrom causal attention [facebookresearch/xformers#75]
## [0.0.2] - 2021-11-01
### Fixed
- More robust blocksparse [facebookresearch/xformers#24]
### Added
- Rotary embeddings [facebookresearch/xformers#32]
- More flexible layernorm [facebookresearch/xformers#50]
|