DINOv3 Video Tracking
In-browser video tracking, powered by Transformers.js
Does it mean that the Qwen model is passing tokens back to the context. In the retrieval level these tokens got encoded into MALM model's hashes which are used to retrieve information from context in MALM tokens and then the result converts to Qwen's tokens again?
The result is pretty impressive! Here are some examples with queries (the results presented as small red circles):
Where to click to upload a file?
Where the result of the request would be presented as an image?
Worth noting, the points are located at the logical centers of the UI components (not at the titles or visual centers).
I'd also want to add information about license in the article, it took some time to figure out where it's at HF. For those who are curious it's Apache 2.0 (very permissive).