13 2 15

AbstractPhila PRO

AbstractPhil

https://civitai.com/user/AbstractPhila

AbstractEyes

AI & ML interests

datasets, research papers, experimentation, vision, classification, text encoders, tokenization, llms, diffusion, distillation, and more.

Recent Activity

posted an update about 2 hours ago

pytorch-parallel-compiler v0.5.0 upgrades: *Complex benchmarking for wide primitive objects is supported now. This includes multiple presets for quick tests on hardware. * All supported primitive either have validity checks or will have them. * 6 new wide layers supported directly, and will be a key part to the autotuner before v1.0 * WideTracedModel is a preliminary auto-builder so the user doesn't need to build them manually by gathering layers. https://github.com/AbstractEyes/pytorch-parallel-compiler New Layers for 0.5.0: WideGRU, WideLSTM, WideGroupNorm, WideMultiheadedAttention, WideInstancenorm1/2d, WideConv3d, Upcoming for 1.0: * WideTracedModel fully building any supported layer patterns with multiple autotune potentials for autoselection. * Module cherry-picking for use-case only; E.G. WideLinear replace only benefits your case 35% while Attention reduces by 10% no attn. * All (roughly 32 more) commonly used pytorch layer systems supported in one form or another with wide-batched kernels to benefit both eager and compiled, many of which require reworks or completely remaking them. * Autotuning wide formats based on hardware response to the kernels. Kernel chunking for big slow processes such as LSTM, kernel fusion for small process with excess overhead, expanding kernels with masking to fit specific use-case paradigms with hardwares, and a series of smaller and more important optimizations along the way. * Full transformer and rope support with wide-batched optimizations throughout the structures to allow more robust autoregression throughput. * Additional Conv1d, Conv2d, and Conv3d optimizations. >version 1.0 : * Entire diffusion structures specifically kernelized for high-efficiency utilization with eager and compilation. * Video diffusion specific targets meant to heavily reduce computation costs on the gpu and increase computation throughput on the gpu.

replied to their post 2 days ago

The Long: this is a proof of concept; ensemble compilation vmap prototype is functional and can be used to increase throughput for wider batches on FFN, MLP, LLM, or other models than just ensembles. This system traces your model and creates stages of functional activation. Based on the stage it will combine or remove combinations of stages meant to assign your layers to batched functional calls meant to put pressure on your GPU with less loops with directly curated cudagraph compliance where applicable. Identical weights yield identical results at the cost of hardware and vram. TLDR: This is an ensemble optimization adapted to standard models. This will yield high-capacity speed improvements through increased throughput for inference and training alike using carefully traced staged vmap structures. https://github.com/AbstractEyes/pytorch-parallel-compiler The early list of layers isn't fully represented yet, so this is a preliminary look into the potentials of this structure when fully fleshed out. MLP (N=100, batch=32, CUDA): ``` Eager: 2-3x speedup Compiled: 35-40x speedup ``` ResBlock (N=20, batch=8, CUDA): ``` Eager: ~5x speedup Compiled: ~10x speedup ``` This is early testing and so far the yields indicate that WIDENING your model with adjacent shared batched vmaps for uniformly staged models will yield considerably higher output for inference at the cost of additional hardware utilization. This is akin to lining up all your systems and uniformly passing the necessary implications through a shared frozen representation gate. Training for this is not tested nor supported yet, use at your own risk.

posted an update 2 days ago

View all activity

Organizations

posted an update about 2 hours ago

Post

pytorch-parallel-compiler v0.5.0 upgrades:
*Complex benchmarking for wide primitive objects is supported now. This includes multiple presets for quick tests on hardware.
* All supported primitive either have validity checks or will have them.
* 6 new wide layers supported directly, and will be a key part to the autotuner before v1.0
* WideTracedModel is a preliminary auto-builder so the user doesn't need to build them manually by gathering layers.

https://github.com/AbstractEyes/pytorch-parallel-compiler

New Layers for 0.5.0:
WideGRU, WideLSTM, WideGroupNorm, WideMultiheadedAttention, WideInstancenorm1/2d, WideConv3d,

Upcoming for 1.0:
* WideTracedModel fully building any supported layer patterns with multiple autotune potentials for autoselection.
* Module cherry-picking for use-case only; E.G. WideLinear replace only benefits your case 35% while Attention reduces by 10% no attn.
* All (roughly 32 more) commonly used pytorch layer systems supported in one form or another with wide-batched kernels to benefit both eager and compiled, many of which require reworks or completely remaking them.
* Autotuning wide formats based on hardware response to the kernels. Kernel chunking for big slow processes such as LSTM, kernel fusion for small process with excess overhead, expanding kernels with masking to fit specific use-case paradigms with hardwares, and a series of smaller and more important optimizations along the way.
* Full transformer and rope support with wide-batched optimizations throughout the structures to allow more robust autoregression throughput.
* Additional Conv1d, Conv2d, and Conv3d optimizations.

>version 1.0 :
* Entire diffusion structures specifically kernelized for high-efficiency utilization with eager and compilation.
* Video diffusion specific targets meant to heavily reduce computation costs on the gpu and increase computation throughput on the gpu.

replied to their post 2 days ago

This preliminary version will be expanded for primarily ease-of-use capacity coupled with adjacent secondary intermediate skill wrappers for usable micro-management if you wish to ensure the assembler formats your system correctly or bugs occur - there's always bugs.

Early tests will be targeting models such as standard conv systems, resnets, t5's, llama, qwen, and more as time progresses. The tests and benchmarks will be listed for use with a multitude of easy-access capacity utilizers, many of which will be omitted simply for not improving performance over standard sequential due to the precompilation simply not improving performance.

Full battery logistics will be available with the full structure as the system is fleshed out. For now look forward to a potential massive expansion to utilizing your models on scaled structures with minimal work from the developers.

I learned from my mistakes with the geofractal router system, it's too complicated to simply integrate someone's models into, so I'm taking a page directly from the ease-of-use book and ensuring this system is not only easy to use but WORKS.

posted an update 2 days ago

Post

200

Eager:    2-3x speedup
Compiled: 35-40x speedup

ResBlock (N=20, batch=8, CUDA):

Eager:    ~5x speedup  
Compiled: ~10x speedup

This is early testing and so far the yields indicate that WIDENING your model with adjacent shared batched vmaps for uniformly staged models will yield considerably higher output for inference at the cost of additional hardware utilization.

This is akin to lining up all your systems and uniformly passing the necessary implications through a shared frozen representation gate.

Training for this is not tested nor supported yet, use at your own risk.

1 reply

updated a model 3 days ago

AbstractPhil/mobiusnet-collective

Updated 3 days ago

updated 5 models 4 days ago

AbstractPhila PRO

AI & ML interests

Recent Activity

Organizations

AbstractPhil's activity