view reply i have seen vlms which are great but they do inference via frame by frame which can lead to false alarms can this model understand the video(take chunks of frame and uderstand the nature of video ) ?