LLM-Engineers-Handbook / project_notes.txt
purpleriann's picture
Upload folder using huggingface_hub
a22e84b verified
Download locally the tar file, unzip it and obtain the mp4, vtt, json, etc. There is no need to store the files into MongoDB.
Process vtt or json and create for each video the short sentences.
Remove any repetitions and create timestamped longer sentences.
When we do stop the merging of consecutive timestamed short sentences.
We can merge everything per video and use spacy’s sentence segmentation to create the longer sentences.
https://spacy.io/usage/linguistic-features#sbd (sentence segmentation) - this should be the main and first bathing to try.
https://spacy.io/api/sentencizer - much simpler - it may not work as well.
Preserve the video that the sentences came from and their timestamps.
Use BERTopic to do topic modeling.
Start with 5 sentences in a video that you believe belong to a different topic. Just test BERTopic to see if the assigned topics are indeed different.
After verifying that BERTopic works you can select an embedding model (eg CLIP) https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html
You will now have sentences assigned to a topic and in the d-dimensional space sentences that belong to the same cluster == sentences that are assigned the same token id.
Retrieve the sentence embeddings that BERTopic used and store them in Qdrant (vector database).
Embed the question as well in the same vector database and in the same d-sim space.
Use the Qdrant cosine similarity to retrieve the closest k (kNN) sentences from the database.
Lookup their timestamps (note that Qdrant allows you to store vectors with any metadata you need)
Use the min and max timestamps from the Qdrant response to slice the video that the sentences belong to creating the clip. https://shotstack.io/learn/use-ffmpeg-to-trim-video/