attention sinks & backward
#3
by
acforvs
- opened
Hi, thanks for all the amazing work!
I noticed that a new s_aux parameter has been added to the fwd pass to support attention sinks. However, I wasn't able to find any related changes in the backward pass.
Does the current implementation support training as well? if not, are there any plans to add support for attention sinks to the bwd pass?
Many thanks,
Vlad
Same question here! Is support for this planned?