attention sinks & backward

#3
by acforvs - opened

Hi, thanks for all the amazing work!

I noticed that a new s_aux parameter has been added to the fwd pass to support attention sinks. However, I wasn't able to find any related changes in the backward pass.

Does the current implementation support training as well? if not, are there any plans to add support for attention sinks to the bwd pass?

Many thanks,
Vlad

kernels-community org

Good question! Are there any plans for it @danieldk ?

Same question here! Is support for this planned?

Sign up or log in to comment