Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hope this on day will be used for auto-tagging all video assets with time codes. The dream of being able to search for running horse and find a clip containing a running horse at 4m42s in one of thousands of clips.


this is a solved problem already — check out https://getjumper.io where you can do exactly this (search through 100s of hours) offline and locally.

Disclaimer: co-founder


It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself


I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?


I split per scene using pyscenedetect and sampled from each. Distance is via cosine similarity- I fed it into qdrant


Would you be willing to share more details of what you did?


Sure. I had a lot of help from Claude Opus 4.5, but it was roughly:

- Using pyscenedetect to split each video on a per scene level

- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)

- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)

- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.

I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.


you can do that with Morphik already :)

We use an embedding model that processes videos and allows you to perform RAG on them.


Would it allow me to query my library for every movie that contains dance routing move1-move2-move3 in that order?


Rag as in the content is used to generate an answer or rag as in searching for a video?


Gemini already does this (and has for awhile): https://ai.google.dev/gemini-api/docs/video-understanding




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: